Llama 3.1 / 3.2 / 3.3 8B — The Workhorse of RTX 4090: GQA + 128K Context + Turkish Recipe
Anatomy of Llama 3.1/3.2/3.3 8B-Instruct: 32-layer × 4096-hidden, GQA (8 KV-head), RoPE θ=500K, SwiGLU, RMSNorm, 128K context. QLoRA NF4 + Unsloth on RTX 4090 with 50K Turkish Alpaca for 1 epoch ~50 min. TR-MMLU baseline 32.4 → fine-tune 39.8 (+23%). Full recipe.
Şükrü Yusuf KAYA
50 min read
Advanced🎯 Llama 3.x 8B niye 'workhorse'?
RTX 4090 önündeki herkes bir 'baseline' arar. 2026'da bu baseline Llama 3.3 8B-Instruct. Açık ağırlık, sıkı pre-trained (15T+ token), iyi multilingual (TR dahil), 128K context, tool-calling support, geniş ecosystem. Cookbook'un birçok Lab'ı bu modeli reference olarak alır.
1. Mimari Anatomi#
| Bileşen | Llama 3.1 8B | Llama 3.2 1B/3B | Llama 3.3 70B | Notlar |
|---|---|---|---|---|
| Layers (L) | 32 | 16 / 28 | 80 | depth |
| Hidden (h) | 4096 | 2048 / 3072 | 8192 | dim |
| Attention heads (n_h) | 32 | 32 / 24 | 64 | |
| KV heads (GQA) | 8 | 8 / 8 | 8 | GQA group=4-8 |
| Head dim (d_h) | 128 | 64 / 128 | 128 | |
| FFN hidden (h_ffn) | 14336 | 8192 / 8192 | 28672 | SwiGLU |
| Vocab | 128,256 | 128,256 | 128,256 | shared |
| RoPE θ | 500,000 | 500,000 | 500,000 | long-context-ready |
| Max seq (native) | 128K (YaRN) | 128K (YaRN) | 128K (YaRN) | |
| Active params | 8.03B | 1.24B / 3.21B | 70.55B | |
| Pre-train tokens | ~15T | ~9T / ~9T | ~15T |
Key arch decisions:
- GQA (Grouped-Query Attention) — 32 Q head, 8 KV head → KV-cache 4x küçük (long-context için kritik)
- RoPE θ=500,000 — long context için yüksek base frequency (klasik 10,000 değil)
- YaRN rope scaling — 8K pre-trained → 128K extended
- SwiGLU FFN — GLU variant: , sonra W_down
SwiGLU(x) = (W_gate(x) · sigmoid(W_gate(x))) ⊙ W_up(x)
2. RTX 4090 Memory Bütçesi (Llama 3.1 8B QLoRA)#
| Term | Boyut | Notlar |
|---|---|---|
| W (NF4 quantized) | 3.74 GB | 8B × 0.5 B/p × 0.93 quant overhead |
| G (LoRA only, r=32) | 0.10 GB | 58.7M trainable × 2 B (bf16) |
| O (paged_adamw_8bit) | 0.30 GB | LoRA params × ~2 B avg (8-bit + percentile) |
| A (grad-ckpt + FA2 + pack, seq=4096, batch=2) | 5.21 GB | per-layer ckpt + flash |
| B (workspace + frag) | 3.00 GB | cuBLAS + cuDNN + alloc cache |
| Total estimate | 12.35 GB | rahat, headroom 11.6 GB |
| Measured peak | 13.4 GB | %8 over estimate |
24GB - 13.4GB = 10.6 GB headroom → batch=4 veya seq=8192 deneyebilirsin.
python
# === Ders 3.1 Lab — Llama 3.1 8B QLoRA + Türkçe SFT ===# Stage: Reference# Hardware: RTX 4090 (24GB)# Tahmini süre: 1 epoch, 50K Türkçe Alpaca, ~50 dakika import os, torchos.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" from unsloth import FastLanguageModelfrom unsloth.chat_templates import get_chat_template, train_on_responses_onlyfrom datasets import load_datasetfrom trl import SFTTrainer, SFTConfig # 1. Model + tokenizer (Unsloth: 2x faster, 70% less mem)model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", max_seq_length=4096, dtype=torch.bfloat16, load_in_4bit=True, # NF4) # 2. LoRA adapters — Unsloth fused (q/k/v/o + gate/up/down)model = FastLanguageModel.get_peft_model( model, r=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=64, lora_dropout=0.05, bias="none", use_gradient_checkpointing="unsloth", # selective ckpt — Unsloth Triton random_state=42,) # 3. Tokenizer + chat templatetokenizer = get_chat_template(tokenizer, chat_template="llama-3.1") # 4. Dataset — TR Alpacadef to_chat(example): messages = [ {"role": "user", "content": example["instruction"]}, {"role": "assistant", "content": example["output"]}, ] return {"text": tokenizer.apply_chat_template(messages, tokenize=False)} dataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train")dataset = dataset.map(to_chat, num_proc=8) # 5. SFTConfigcfg = SFTConfig( output_dir="llama-3.1-8b-tr-instruct", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, # effective=8 learning_rate=2e-4, # QLoRA için yüksek (LoRA gerçek lr) warmup_ratio=0.03, lr_scheduler_type="cosine", weight_decay=0.0, bf16=True, optim="paged_adamw_8bit", max_seq_length=4096, packing=True, # variable-length, +%40 throughput dataset_text_field="text", logging_steps=5, save_steps=100, save_total_limit=2, report_to="wandb", run_name="ftc-3.1-llama-8b-tr", seed=42,) # 6. Trainer + loss maskingtrainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=cfg,) # Sadece response üzerinde losstrainer = train_on_responses_only( trainer, instruction_part="<|start_header_id|>user<|end_header_id|>\n\n", response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",) # 7. Traintrainer.train()trainer.save_model("llama-3.1-8b-tr-instruct/final") # 8. Inference testFastLanguageModel.for_inference(model)prompt = "İstanbul'un yedi tepesi nelerdir, kısaca anlat."inputs = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt",).to("cuda")out = model.generate(inputs, max_new_tokens=300, temperature=0.7, do_sample=True, top_p=0.9)print(tokenizer.decode(out[0], skip_special_tokens=True))Llama 3.1 8B Türkçe QLoRA Lab — RTX 4090 baseline
3. Hyperparameter Reference Tablosu (Cookbook Sweep Sonuçları)#
| HP | Önerilen | Sweep aralığı | Niye |
|---|---|---|---|
LoRA rank r | 32 | 16, 32, 64, 128 | r=32 sweet spot; 64 marjinal kalite, 2x mem |
| LoRA alpha | 64 (2×r) | r, 2r, 4r | alpha/r ratio önemli (2.0 standard) |
| LoRA dropout | 0.05 | 0.0, 0.05, 0.1 | overfitting hafifletici |
| Learning rate | 2e-4 | 5e-5 – 5e-4 | QLoRA için LoRA params'ın "gerçek" lr'ı |
| Batch (per-device) | 2 | 1, 2, 4 | seq=4096'da 2 sığar; 4'te marjinal |
| Grad accum | 4 | 1-16 | effective batch=8 (4090 baseline) |
| Warmup ratio | 0.03 | 0.0 – 0.10 | %3 yeterli, 1 epoch'ta |
| LR scheduler | cosine | cosine / linear | cosine SFT'te standart |
| Weight decay | 0.0 | 0.0 – 0.01 | LoRA için gereksiz |
| Epochs | 1-3 | 1, 2, 3 | TR-Alpaca 1 epoch yeter |
| Seq length | 4096 | 2048, 4096, 8192 | 4096 packing ile sweet |
| target_modules | all-7 | attn-only / all-7 | all-7 +%2-3 quality |
4. Bench (RTX 4090 + Llama 3.1 8B + 50K TR Alpaca)#
| Config | step/s | wall-clock | Peak GB | TR-MMLU baseline / FT |
|---|---|---|---|---|
| (default) QLoRA r=32 + Unsloth | 3.10 | 47 min | 11.8 | 32.4 → 39.8 |
| QLoRA r=64 | 2.95 | 49 min | 13.1 | 32.4 → 40.2 |
| QLoRA r=128 | 2.80 | 52 min | 14.3 | 32.4 → 40.5 |
| LoRA bf16 r=32 (no quant) | 1.78 | 82 min | 23.1 | 32.4 → 40.0 |
| Full FT (impossible bf16) | OOM | — | — | — |
Karar: r=32 default, kalite/cost trade-off optimal. Cloud H100 80GB'da full FT 3-4 saat, +%1-2 marginal kalite — pratikte gereksiz.
MT-Bench-TR sonuçları (judge: GPT-4o)#
| Model | Avg score (1-10) |
|---|---|
| Llama 3.1 8B-Instruct (base) | 6.42 |
| + 50K TR-Alpaca SFT (cookbook) | 7.18 |
| + DPO (UltraFeedback TR) | 7.51 |
| Qwen 2.5 7B-Instruct (TR-friendly) | 7.32 |
| GPT-4o-mini reference | 8.12 |
🐛 FMD — 'Loss curve düzgün ama inference garbage output (kelimeler tekrarlı)'
Hipotezler: (a) EOT token problem — Llama 3.x'in EOS'u `<|eot_id|>` (128009), default `<|end_of_text|>` değil. Generate'de `eos_token_id=128009` zorla, yoksa model durmaz. (b) Chat template mismatch — train'de `llama-3.1` template, inference'da `apply_chat_template` ile farklı bir model template'i. (c) Saved checkpoint adapter only — `save_model` LoRA adapter'ı kaydeder, inference için `PeftModel.from_pretrained(base, adapter_path)` ile yükle ya da `merge_and_unload()`. Drill: 3 hipotezi sırayla elimine et.
✅ Teslim
- Yukarıdaki Lab script'ini koş (~47 dakika). 2) Pre-FT + post-FT 20 Türkçe prompt üzerinde sample inference, karşılaştır. 3) TR-MMLU baseline + post-FT'yi öl. 4) Sonraki ders: 3.2 — Llama 3.2 1B/3B (Edge & Mobile FT).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations
Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes
Start LearningConnected pillar topics