Hangi base model Türkçe SFT için iyi?

Llama-3-8B base: multilingual ama %95 English. Türkçe fine-tune ile decent Turkish. Llama-3-8B-Instruct: already instruction-tuned, continue Turkish SFT. Choice: base from-scratch SFT vs Instruct continue.

Supervised Fine-Tuning (SFT): Transforming Pre-trained Base Model into Instruct — Llama-3-Instruct Anatomy

Supervised Fine-Tuning (SFT) anatomy: pre-trained base model → instruction-following model. Instruction dataset (Alpaca, OASST, Dolly), chat template application, loss masking (loss only on response), hyperparameter differences (lr 1/10 of pre-train), Llama-3-Instruct production recipe, practical Turkish fine-tune.

Şükrü Yusuf KAYA

75 min read

5/13/2026

Advanced

Supervised Fine-Tuning (SFT): Pre-trained Base Model'i Instruct'a Dönüştürme — Llama-3-Instruct Anatomisi

🎯 SFT — pre-trained base'den production chatbot'a

Llama-3-8B base model 15T token'a pre-trained. Mükemmel language model — ama 'instruct' takip etmez, 'hi' dersen rasgele Wikipedia paragrafı yazabilir. Supervised Fine-Tuning (SFT) bu base'i 'instruction-following' modeline dönüştürür. ChatGPT, Claude, Llama-3-Instruct — hepsi SFT'den geçti. Aslında basit: instruction → response çiftleri, chat template formatında, standard cross-entropy loss. AMA detaylar kritik: loss masking, hyperparameter sentitivity, dataset quality. 75 dakika sonra: SFT matematiksel anatomisini, Alpaca/OASST datasets'ini, Llama-3-Instruct production recipe'sini, Türkçe SFT'yi derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Pre-trained base vs Instruct — niye base yetmez
Instruction dataset format — query, response pairs
Popüler datasets — Alpaca, OASST, Dolly, ShareGPT
Chat template uygulaması — formatlama
Loss masking — sadece response üzerinde loss
Hyperparameter farkları — pre-train vs SFT
Llama-3-Instruct recipe — Meta'nın approach
Catastrophic forgetting — fine-tune'da bilgi kaybı
Türkçe SFT — dataset toplama + training
Evaluation — instruction-following quality

1-6. SFT Pipeline#

1.1 Pre-trained model'in 'eksiği'#

Base model: P(next_token | context). Predicts likely next text.

User: 'Hi' Base model: 'Hi, my name is John' or random continuation.

User wants: 'Hello! How can I help you?'

Base model konuşma görmedi. Sadece web text üzerinde train.

1.2 SFT solution#

Labeled examples:

Input: 'Hello'
Output: 'Hello! How can I help you today?'

Input: 'What is 2+2?'
Output: '2+2 equals 4.'

Input: 'Türkiye'nin başkenti?'
Output: 'Ankara, Türkiye'nin başkentidir.'

10K-1M böyle örnek. Base model SFT → instruction-following.

1.3 Popüler SFT datasets#

Alpaca (Stanford 2023): 52K instructions generated by GPT-3.5
OASST (LAION 2023): community-collected conversations, multilingual
Dolly 15K (Databricks 2023): human-generated
ShareGPT (community): ChatGPT conversations crawled
OpenHermes: curated mix

Türkçe: TR-Alpaca translated, Türkçe OASST subset, community Türkçe data.

1.4 Format#

Apply chat template (Modül 6.7):

<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
2+2 equals 4.<|im_end|>

Full sequence: input + response.

1.5 Loss masking — KRITIK#

Normal training: tüm sequence üzerinde cross-entropy loss.

SFT'de: sadece response token'ları üzerinde loss compute. Instruction token'ları (input) ignore.

# Pseudo
tokens = encode(instruction + response)
labels = tokens.copy()
labels[:len(instruction)] = IGNORE_INDEX  # Mask input tokens

# Loss: cross_entropy(logits[1:], labels[1:], ignore_index=IGNORE_INDEX)

Niye: model response üretmeyi öğrensin, input'u memorize etmesin.

1.6 SFT hyperparameters#

lr: 1-5e-5 (pre-train'in 1/10'u)
warmup: 100-500 steps (pre-train'in 1/4'ü)
epochs: 1-3 (data limited)
batch size: 32-128 sequences
weight decay: 0.0 (no regularization, want to adapt)
LR schedule: cosine decay

1.7 Llama-3-Instruct recipe#

Meta paper:

Base: Llama-3-8B
Dataset: 10M+ human + synthetic instruction examples
Multiple iterations: SFT → RLHF/DPO (Modül 15) → SFT round 2
4-8 epochs over instruction data
lr: 2e-5
Cosine schedule

Final: Llama-3-8B-Instruct. ChatGPT-quality.

python

# SFT with HuggingFace TRL (Trainer for RL Library)
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
 
# 1. Load base model
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
 
# 2. Load Türkçe SFT dataset
dataset = load_dataset("merve/turkish_instructions", split="train")
# Format: {'instruction': '...', 'response': '...'}
 
# 3. Format with chat template
def format_chat(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}
 
dataset = dataset.map(format_chat)
 
# 4. SFT config
sft_config = SFTConfig(
    output_dir="./llama-3-8b-tr-instruct",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    weight_decay=0.0,
    bf16=True,
    logging_steps=10,
    save_steps=500,
    save_total_limit=3,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=True,  # Sequence packing
)
 
# 5. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=sft_config,
)
 
trainer.train()
trainer.save_model("./llama-3-8b-tr-instruct/final")

SFT with HuggingFace TRL — Türkçe Llama-3-Instruct training

8-10. Catastrophic Forgetting + Türkçe + Eval#

8.1 Catastrophic forgetting#

Fine-tune'da risk: model pre-trained bilgisini unutur.

Örnek: Türkçe SFT yaparken İngilizce capability düşer. SFT corpus küçük (10K) ama pre-train corpus büyük (15T) — küçük data ile büyük bilgi unutulur.

8.2 Mitigation#

Lower lr (1e-5 instead of 1e-4)
Fewer epochs (1-3)
Mix general data (some pre-training-style data in SFT)
LoRA (Ders 14.2): freeze base, only train small adapter

8.3 Türkçe SFT pratik#

Dataset:

TR-Alpaca (52K translated)
OASST Turkish subset
Manual curation
Total: ~50-100K examples

Resource: 8x H100 4-8 saat (Llama-3-8B SFT). Cost: $100-300.

8.4 Evaluation#

MT-Bench Türkçe: judge GPT-4 evaluator
Custom Türkçe benchmark (TR-MMLU)
User-rating A/B test
Coherence + correctness checks

Istinaden: kötü SFT → 'sycophantic' (overly agreeable), 'verbose' (long answers), 'hallucinations' (fabricates facts).

✅ Ders 14.1 Özeti — SFT

SFT: pre-trained base → instruction-following. Instruction-response çiftleri, chat template apply, loss masking (only response tokens). Datasets: Alpaca, OASST, Dolly, ShareGPT. Hyperparameters: lr 2e-5 (pre-train 1/10), 1-3 epochs, cosine decay. Llama-3-Instruct: 10M+ instructions, multiple SFT iterations. Catastrophic forgetting mitigation: lower lr, fewer epochs, mix general data. Türkçe SFT: 50-100K examples, 8x H100 4-8 saat, $100-300. Ders 14.2'de LoRA (Hu 2021) parameter-efficient fine-tuning'e geçeceğiz.

Sıradaki Ders: LoRA Parameter-Efficient Fine-Tuning#

Ders 14.2: LoRA (Hu 2021) — rank decomposition, %1 parametre, comparable quality. QLoRA (Dettmers 2023) 4-bit quantization + LoRA, consumer GPU'da 70B fine-tune.

Frequently Asked Questions

Modern: 10K-1M instructions. For adequate quality: ~50K curated examples + chat template. Llama-3-Instruct: 10M+ (heavy curation). Less is more — quality > quantity.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...