Llama 3.1 / 3.2 / 3.3 8B — RTX 4090'ın İş Atı: GQA + 128K Context + Türkçe Reçete

Llama 3.1/3.2/3.3 8B-Instruct'ın anatomisi: 32 layer × 4096 hidden, GQA (8 KV-head), RoPE θ=500K, SwiGLU, RMSNorm, 128K context (YaRN-extended). RTX 4090'da QLoRA NF4 + Unsloth ile 50K Türkçe Alpaca üzerinde 1 epoch ~50 dakika. TR-MMLU baseline 32.4 → fine-tune 39.8 (+%23). Full reçete: dataset format, hyperparameter tablosu, sweep aralıkları, sample inference, eval pipeline.

Şükrü Yusuf KAYA

50 dakikalık okuma

14.05.2026

İleri

Llama 3.1 / 3.2 / 3.3 8B — RTX 4090'ın İş Atı: GQA + 128K Context + Türkçe Reçete

🎯 Llama 3.x 8B niye 'workhorse'?

RTX 4090 önündeki herkes bir 'baseline' arar. 2026'da bu baseline Llama 3.3 8B-Instruct. Açık ağırlık, sıkı pre-trained (15T+ token), iyi multilingual (TR dahil), 128K context, tool-calling support, geniş ecosystem. Cookbook'un birçok Lab'ı bu modeli reference olarak alır.

1. Mimari Anatomi#

Bileşen	Llama 3.1 8B	Llama 3.2 1B/3B	Llama 3.3 70B	Notlar
Layers (L)	32	16 / 28	80	depth
Hidden (h)	4096	2048 / 3072	8192	dim
Attention heads (n_h)	32	32 / 24	64
KV heads (GQA)	8	8 / 8	8	GQA group=4-8
Head dim (d_h)	128	64 / 128	128
FFN hidden (h_ffn)	14336	8192 / 8192	28672	SwiGLU
Vocab	128,256	128,256	128,256	shared
RoPE θ	500,000	500,000	500,000	long-context-ready
Max seq (native)	128K (YaRN)	128K (YaRN)	128K (YaRN)
Active params	8.03B	1.24B / 3.21B	70.55B
Pre-train tokens	~15T	~9T / ~9T	~15T

Key arch decisions:

GQA (Grouped-Query Attention) — 32 Q head, 8 KV head → KV-cache 4x küçük (long-context için kritik)
RoPE θ=500,000 — long context için yüksek base frequency (klasik 10,000 değil)
YaRN rope scaling — 8K pre-trained → 128K extended
SwiGLU FFN — GLU variant:
SwiGLU(x) = (W_gate(x) · sigmoid(W_gate(x))) ⊙ W_up(x)
, sonra W_down

2. RTX 4090 Memory Bütçesi (Llama 3.1 8B QLoRA)#

Term	Boyut	Notlar
W (NF4 quantized)	3.74 GB	8B × 0.5 B/p × 0.93 quant overhead
G (LoRA only, r=32)	0.10 GB	58.7M trainable × 2 B (bf16)
O (paged_adamw_8bit)	0.30 GB	LoRA params × ~2 B avg (8-bit + percentile)
A (grad-ckpt + FA2 + pack, seq=4096, batch=2)	5.21 GB	per-layer ckpt + flash
B (workspace + frag)	3.00 GB	cuBLAS + cuDNN + alloc cache
Total estimate	12.35 GB	rahat, headroom 11.6 GB
Measured peak	13.4 GB	%8 over estimate

24GB - 13.4GB = 10.6 GB headroom → batch=4 veya seq=8192 deneyebilirsin.

python

# === Ders 3.1 Lab — Llama 3.1 8B QLoRA + Türkçe SFT ===
# Stage: Reference
# Hardware: RTX 4090 (24GB)
# Tahmini süre: 1 epoch, 50K Türkçe Alpaca, ~50 dakika
 
import os, torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
 
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, train_on_responses_only
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
 
# 1. Model + tokenizer (Unsloth: 2x faster, 70% less mem)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,                      # NF4
)
 
# 2. LoRA adapters — Unsloth fused (q/k/v/o + gate/up/down)
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=64,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",   # selective ckpt — Unsloth Triton
    random_state=42,
)
 
# 3. Tokenizer + chat template
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
 
# 4. Dataset — TR Alpaca
def to_chat(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
 
dataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train")
dataset = dataset.map(to_chat, num_proc=8)
 
# 5. SFTConfig
cfg = SFTConfig(
    output_dir="llama-3.1-8b-tr-instruct",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,          # effective=8
    learning_rate=2e-4,                     # QLoRA için yüksek (LoRA gerçek lr)
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    weight_decay=0.0,
    bf16=True,
    optim="paged_adamw_8bit",
    max_seq_length=4096,
    packing=True,                            # variable-length, +%40 throughput
    dataset_text_field="text",
    logging_steps=5,
    save_steps=100,
    save_total_limit=2,
    report_to="wandb",
    run_name="ftc-3.1-llama-8b-tr",
    seed=42,
)
 
# 6. Trainer + loss masking
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=cfg,
)
 
# Sadece response üzerinde loss
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)
 
# 7. Train
trainer.train()
trainer.save_model("llama-3.1-8b-tr-instruct/final")
 
# 8. Inference test
FastLanguageModel.for_inference(model)
prompt = "İstanbul'un yedi tepesi nelerdir, kısaca anlat."
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to("cuda")
out = model.generate(inputs, max_new_tokens=300, temperature=0.7,
                     do_sample=True, top_p=0.9)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Llama 3.1 8B Türkçe QLoRA Lab — RTX 4090 baseline

3. Hyperparameter Reference Tablosu (Cookbook Sweep Sonuçları)#

HP	Önerilen	Sweep aralığı	Niye
LoRA rank `r`	32	16, 32, 64, 128	r=32 sweet spot; 64 marjinal kalite, 2x mem
LoRA alpha	64 (2×r)	r, 2r, 4r	alpha/r ratio önemli (2.0 standard)
LoRA dropout	0.05	0.0, 0.05, 0.1	overfitting hafifletici
Learning rate	2e-4	5e-5 – 5e-4	QLoRA için LoRA params'ın "gerçek" lr'ı
Batch (per-device)	2	1, 2, 4	seq=4096'da 2 sığar; 4'te marjinal
Grad accum	4	1-16	effective batch=8 (4090 baseline)
Warmup ratio	0.03	0.0 – 0.10	%3 yeterli, 1 epoch'ta
LR scheduler	cosine	cosine / linear	cosine SFT'te standart
Weight decay	0.0	0.0 – 0.01	LoRA için gereksiz
Epochs	1-3	1, 2, 3	TR-Alpaca 1 epoch yeter
Seq length	4096	2048, 4096, 8192	4096 packing ile sweet
target_modules	all-7	attn-only / all-7	all-7 +%2-3 quality

4. Bench (RTX 4090 + Llama 3.1 8B + 50K TR Alpaca)#

Config	step/s	wall-clock	Peak GB	TR-MMLU baseline / FT
(default) QLoRA r=32 + Unsloth	3.10	47 min	11.8	32.4 → 39.8
QLoRA r=64	2.95	49 min	13.1	32.4 → 40.2
QLoRA r=128	2.80	52 min	14.3	32.4 → 40.5
LoRA bf16 r=32 (no quant)	1.78	82 min	23.1	32.4 → 40.0
Full FT (impossible bf16)	OOM	—	—	—

Karar: r=32 default, kalite/cost trade-off optimal. Cloud H100 80GB'da full FT 3-4 saat, +%1-2 marginal kalite — pratikte gereksiz.

MT-Bench-TR sonuçları (judge: GPT-4o)#

Model	Avg score (1-10)
Llama 3.1 8B-Instruct (base)	6.42
+ 50K TR-Alpaca SFT (cookbook)	7.18
+ DPO (UltraFeedback TR)	7.51
Qwen 2.5 7B-Instruct (TR-friendly)	7.32
GPT-4o-mini reference	8.12

🐛 FMD — 'Loss curve düzgün ama inference garbage output (kelimeler tekrarlı)'

Hipotezler: (a) EOT token problem — Llama 3.x'in EOS'u `<|eot_id|>` (128009), default `<|end_of_text|>` değil. Generate'de `eos_token_id=128009` zorla, yoksa model durmaz. (b) Chat template mismatch — train'de `llama-3.1` template, inference'da `apply_chat_template` ile farklı bir model template'i. (c) Saved checkpoint adapter only — `save_model` LoRA adapter'ı kaydeder, inference için `PeftModel.from_pretrained(base, adapter_path)` ile yükle ya da `merge_and_unload()`. Drill: 3 hipotezi sırayla elimine et.

✅ Teslim

Yukarıdaki Lab script'ini koş (~47 dakika). 2) Pre-FT + post-FT 20 Türkçe prompt üzerinde sample inference, karşılaştır. 3) TR-MMLU baseline + post-FT'yi öl. 4) Sonraki ders: 3.2 — Llama 3.2 1B/3B (Edge & Mobile FT).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Öğrenmeye Başla

Part 0 — Engineering Foundations

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Öğrenmeye Başla

Part 0 — Engineering Foundations

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Öğrenmeye Başla

Bağlantılı Pillar Konular

Bu yazının bağlandığı pillar konular

Pillar Konusu

LLMOps: Üretim Sınıfı LLM Operasyonları

LLMOps, büyük dil modeli tabanlı uygulamaların geliştirme, dağıtım, izleme, değerlendirme ve maliyet yönetimini kapsayan; klasik MLOps'un üzerine prompt versiyonlama, eval-driven CI ve gözlemlenebilirlik (observability) katmanlarını ekleyen mühendislik disiplinidir.