Supervised Fine-Tuning (SFT): Transforming Pre-trained Base Model into Instruct — Llama-3-Instruct Anatomy
Supervised Fine-Tuning (SFT) anatomy: pre-trained base model → instruction-following model. Instruction dataset (Alpaca, OASST, Dolly), chat template application, loss masking (loss only on response), hyperparameter differences (lr 1/10 of pre-train), Llama-3-Instruct production recipe, practical Turkish fine-tune.
Şükrü Yusuf KAYA
75 min read
Advanced🎯 SFT — pre-trained base'den production chatbot'a
Llama-3-8B base model 15T token'a pre-trained. Mükemmel language model — ama 'instruct' takip etmez, 'hi' dersen rasgele Wikipedia paragrafı yazabilir. Supervised Fine-Tuning (SFT) bu base'i 'instruction-following' modeline dönüştürür. ChatGPT, Claude, Llama-3-Instruct — hepsi SFT'den geçti. Aslında basit: instruction → response çiftleri, chat template formatında, standard cross-entropy loss. AMA detaylar kritik: loss masking, hyperparameter sentitivity, dataset quality. 75 dakika sonra: SFT matematiksel anatomisini, Alpaca/OASST datasets'ini, Llama-3-Instruct production recipe'sini, Türkçe SFT'yi derinlemesine kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Pre-trained base vs Instruct — niye base yetmez
- Instruction dataset format — query, response pairs
- Popüler datasets — Alpaca, OASST, Dolly, ShareGPT
- Chat template uygulaması — formatlama
- Loss masking — sadece response üzerinde loss
- Hyperparameter farkları — pre-train vs SFT
- Llama-3-Instruct recipe — Meta'nın approach
- Catastrophic forgetting — fine-tune'da bilgi kaybı
- Türkçe SFT — dataset toplama + training
- Evaluation — instruction-following quality
1-6. SFT Pipeline#
1.1 Pre-trained model'in 'eksiği'#
Base model: P(next_token | context). Predicts likely next text.
User: 'Hi'
Base model: 'Hi, my name is John' or random continuation.
User wants: 'Hello! How can I help you?'
Base model konuşma görmedi. Sadece web text üzerinde train.
1.2 SFT solution#
Labeled examples:
Input: 'Hello' Output: 'Hello! How can I help you today?' Input: 'What is 2+2?' Output: '2+2 equals 4.' Input: 'Türkiye'nin başkenti?' Output: 'Ankara, Türkiye'nin başkentidir.'
10K-1M böyle örnek. Base model SFT → instruction-following.
1.3 Popüler SFT datasets#
- Alpaca (Stanford 2023): 52K instructions generated by GPT-3.5
- OASST (LAION 2023): community-collected conversations, multilingual
- Dolly 15K (Databricks 2023): human-generated
- ShareGPT (community): ChatGPT conversations crawled
- OpenHermes: curated mix
Türkçe: TR-Alpaca translated, Türkçe OASST subset, community Türkçe data.
1.4 Format#
Apply chat template (Modül 6.7):
<|im_start|>user What is 2+2?<|im_end|> <|im_start|>assistant 2+2 equals 4.<|im_end|>
Full sequence: input + response.
1.5 Loss masking — KRITIK#
Normal training: tüm sequence üzerinde cross-entropy loss.
SFT'de: sadece response token'ları üzerinde loss compute. Instruction token'ları (input) ignore.
# Pseudo tokens = encode(instruction + response) labels = tokens.copy() labels[:len(instruction)] = IGNORE_INDEX # Mask input tokens # Loss: cross_entropy(logits[1:], labels[1:], ignore_index=IGNORE_INDEX)
Niye: model response üretmeyi öğrensin, input'u memorize etmesin.
1.6 SFT hyperparameters#
- lr: 1-5e-5 (pre-train'in 1/10'u)
- warmup: 100-500 steps (pre-train'in 1/4'ü)
- epochs: 1-3 (data limited)
- batch size: 32-128 sequences
- weight decay: 0.0 (no regularization, want to adapt)
- LR schedule: cosine decay
1.7 Llama-3-Instruct recipe#
Meta paper:
- Base: Llama-3-8B
- Dataset: 10M+ human + synthetic instruction examples
- Multiple iterations: SFT → RLHF/DPO (Modül 15) → SFT round 2
- 4-8 epochs over instruction data
- lr: 2e-5
- Cosine schedule
Final: Llama-3-8B-Instruct. ChatGPT-quality.
python
# SFT with HuggingFace TRL (Trainer for RL Library)from transformers import AutoModelForCausalLM, AutoTokenizerfrom datasets import load_datasetfrom trl import SFTTrainer, SFTConfig # 1. Load base modelmodel_name = "meta-llama/Meta-Llama-3-8B"model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map='auto',)tokenizer = AutoTokenizer.from_pretrained(model_name)tokenizer.pad_token = tokenizer.eos_token # 2. Load Türkçe SFT datasetdataset = load_dataset("merve/turkish_instructions", split="train")# Format: {'instruction': '...', 'response': '...'} # 3. Format with chat templatedef format_chat(example): messages = [ {"role": "user", "content": example["instruction"]}, {"role": "assistant", "content": example["response"]}, ] text = tokenizer.apply_chat_template(messages, tokenize=False) return {"text": text} dataset = dataset.map(format_chat) # 4. SFT configsft_config = SFTConfig( output_dir="./llama-3-8b-tr-instruct", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, warmup_steps=100, lr_scheduler_type="cosine", weight_decay=0.0, bf16=True, logging_steps=10, save_steps=500, save_total_limit=3, max_seq_length=2048, dataset_text_field="text", packing=True, # Sequence packing) # 5. Traintrainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=sft_config,) trainer.train()trainer.save_model("./llama-3-8b-tr-instruct/final")SFT with HuggingFace TRL — Türkçe Llama-3-Instruct training
8-10. Catastrophic Forgetting + Türkçe + Eval#
8.1 Catastrophic forgetting#
Fine-tune'da risk: model pre-trained bilgisini unutur.
Örnek: Türkçe SFT yaparken İngilizce capability düşer. SFT corpus küçük (10K) ama pre-train corpus büyük (15T) — küçük data ile büyük bilgi unutulur.
8.2 Mitigation#
- Lower lr (1e-5 instead of 1e-4)
- Fewer epochs (1-3)
- Mix general data (some pre-training-style data in SFT)
- LoRA (Ders 14.2): freeze base, only train small adapter
8.3 Türkçe SFT pratik#
Dataset:
- TR-Alpaca (52K translated)
- OASST Turkish subset
- Manual curation
- Total: ~50-100K examples
Resource: 8x H100 4-8 saat (Llama-3-8B SFT).
Cost: $100-300.
8.4 Evaluation#
- MT-Bench Türkçe: judge GPT-4 evaluator
- Custom Türkçe benchmark (TR-MMLU)
- User-rating A/B test
- Coherence + correctness checks
Istinaden: kötü SFT → 'sycophantic' (overly agreeable), 'verbose' (long answers), 'hallucinations' (fabricates facts).
✅ Ders 14.1 Özeti — SFT
SFT: pre-trained base → instruction-following. Instruction-response çiftleri, chat template apply, loss masking (only response tokens). Datasets: Alpaca, OASST, Dolly, ShareGPT. Hyperparameters: lr 2e-5 (pre-train 1/10), 1-3 epochs, cosine decay. Llama-3-Instruct: 10M+ instructions, multiple SFT iterations. Catastrophic forgetting mitigation: lower lr, fewer epochs, mix general data. Türkçe SFT: 50-100K examples, 8x H100 4-8 saat, $100-300. Ders 14.2'de LoRA (Hu 2021) parameter-efficient fine-tuning'e geçeceğiz.
Sıradaki Ders: LoRA Parameter-Efficient Fine-Tuning#
Ders 14.2: LoRA (Hu 2021) — rank decomposition, %1 parametre, comparable quality. QLoRA (Dettmers 2023) 4-bit quantization + LoRA, consumer GPU'da 70B fine-tune.
Frequently Asked Questions
Modern: 10K-1M instructions. For adequate quality: ~50K curated examples + chat template. Llama-3-Instruct: 10M+ (heavy curation). Less is more — quality > quantity.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup