İçeriğe geç

Llama 3.1 / 3.2 / 3.3 8B — RTX 4090'ın İş Atı: GQA + 128K Context + Türkçe Reçete

Llama 3.1/3.2/3.3 8B-Instruct'ın anatomisi: 32 layer × 4096 hidden, GQA (8 KV-head), RoPE θ=500K, SwiGLU, RMSNorm, 128K context (YaRN-extended). RTX 4090'da QLoRA NF4 + Unsloth ile 50K Türkçe Alpaca üzerinde 1 epoch ~50 dakika. TR-MMLU baseline 32.4 → fine-tune 39.8 (+%23). Full reçete: dataset format, hyperparameter tablosu, sweep aralıkları, sample inference, eval pipeline.

Şükrü Yusuf KAYA
50 dakikalık okuma
İleri
Llama 3.1 / 3.2 / 3.3 8B — RTX 4090'ın İş Atı: GQA + 128K Context + Türkçe Reçete
🎯 Llama 3.x 8B niye 'workhorse'?
RTX 4090 önündeki herkes bir 'baseline' arar. 2026'da bu baseline Llama 3.3 8B-Instruct. Açık ağırlık, sıkı pre-trained (15T+ token), iyi multilingual (TR dahil), 128K context, tool-calling support, geniş ecosystem. Cookbook'un birçok Lab'ı bu modeli reference olarak alır.

1. Mimari Anatomi#

BileşenLlama 3.1 8BLlama 3.2 1B/3BLlama 3.3 70BNotlar
Layers (L)3216 / 2880depth
Hidden (h)40962048 / 30728192dim
Attention heads (n_h)3232 / 2464
KV heads (GQA)88 / 88GQA group=4-8
Head dim (d_h)12864 / 128128
FFN hidden (h_ffn)143368192 / 819228672SwiGLU
Vocab128,256128,256128,256shared
RoPE θ500,000500,000500,000long-context-ready
Max seq (native)128K (YaRN)128K (YaRN)128K (YaRN)
Active params8.03B1.24B / 3.21B70.55B
Pre-train tokens~15T~9T / ~9T~15T
Key arch decisions:
  • GQA (Grouped-Query Attention) — 32 Q head, 8 KV head → KV-cache 4x küçük (long-context için kritik)
  • RoPE θ=500,000 — long context için yüksek base frequency (klasik 10,000 değil)
  • YaRN rope scaling — 8K pre-trained → 128K extended
  • SwiGLU FFN — GLU variant:
    SwiGLU(x) = (W_gate(x) · sigmoid(W_gate(x))) ⊙ W_up(x)
    , sonra W_down

2. RTX 4090 Memory Bütçesi (Llama 3.1 8B QLoRA)#

TermBoyutNotlar
W (NF4 quantized)3.74 GB8B × 0.5 B/p × 0.93 quant overhead
G (LoRA only, r=32)0.10 GB58.7M trainable × 2 B (bf16)
O (paged_adamw_8bit)0.30 GBLoRA params × ~2 B avg (8-bit + percentile)
A (grad-ckpt + FA2 + pack, seq=4096, batch=2)5.21 GBper-layer ckpt + flash
B (workspace + frag)3.00 GBcuBLAS + cuDNN + alloc cache
Total estimate12.35 GBrahat, headroom 11.6 GB
Measured peak13.4 GB%8 over estimate
24GB - 13.4GB = 10.6 GB headroom → batch=4 veya seq=8192 deneyebilirsin.
python
# === Ders 3.1 Lab — Llama 3.1 8B QLoRA + Türkçe SFT ===
# Stage: Reference
# Hardware: RTX 4090 (24GB)
# Tahmini süre: 1 epoch, 50K Türkçe Alpaca, ~50 dakika
 
import os, torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
 
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, train_on_responses_only
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
 
# 1. Model + tokenizer (Unsloth: 2x faster, 70% less mem)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True, # NF4
)
 
# 2. LoRA adapters — Unsloth fused (q/k/v/o + gate/up/down)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=64,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth", # selective ckpt — Unsloth Triton
random_state=42,
)
 
# 3. Tokenizer + chat template
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
 
# 4. Dataset — TR Alpaca
def to_chat(example):
messages = [
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]},
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
 
dataset = load_dataset("malhajar/alpaca-gpt4-tr", split="train")
dataset = dataset.map(to_chat, num_proc=8)
 
# 5. SFTConfig
cfg = SFTConfig(
output_dir="llama-3.1-8b-tr-instruct",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective=8
learning_rate=2e-4, # QLoRA için yüksek (LoRA gerçek lr)
warmup_ratio=0.03,
lr_scheduler_type="cosine",
weight_decay=0.0,
bf16=True,
optim="paged_adamw_8bit",
max_seq_length=4096,
packing=True, # variable-length, +%40 throughput
dataset_text_field="text",
logging_steps=5,
save_steps=100,
save_total_limit=2,
report_to="wandb",
run_name="ftc-3.1-llama-8b-tr",
seed=42,
)
 
# 6. Trainer + loss masking
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=cfg,
)
 
# Sadece response üzerinde loss
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)
 
# 7. Train
trainer.train()
trainer.save_model("llama-3.1-8b-tr-instruct/final")
 
# 8. Inference test
FastLanguageModel.for_inference(model)
prompt = "İstanbul'un yedi tepesi nelerdir, kısaca anlat."
inputs = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to("cuda")
out = model.generate(inputs, max_new_tokens=300, temperature=0.7,
do_sample=True, top_p=0.9)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Llama 3.1 8B Türkçe QLoRA Lab — RTX 4090 baseline

3. Hyperparameter Reference Tablosu (Cookbook Sweep Sonuçları)#

HPÖnerilenSweep aralığıNiye
LoRA rank
r
3216, 32, 64, 128r=32 sweet spot; 64 marjinal kalite, 2x mem
LoRA alpha64 (2×r)r, 2r, 4ralpha/r ratio önemli (2.0 standard)
LoRA dropout0.050.0, 0.05, 0.1overfitting hafifletici
Learning rate2e-45e-5 – 5e-4QLoRA için LoRA params'ın "gerçek" lr'ı
Batch (per-device)21, 2, 4seq=4096'da 2 sığar; 4'te marjinal
Grad accum41-16effective batch=8 (4090 baseline)
Warmup ratio0.030.0 – 0.10%3 yeterli, 1 epoch'ta
LR schedulercosinecosine / linearcosine SFT'te standart
Weight decay0.00.0 – 0.01LoRA için gereksiz
Epochs1-31, 2, 3TR-Alpaca 1 epoch yeter
Seq length40962048, 4096, 81924096 packing ile sweet
target_modulesall-7attn-only / all-7all-7 +%2-3 quality

4. Bench (RTX 4090 + Llama 3.1 8B + 50K TR Alpaca)#

Configstep/swall-clockPeak GBTR-MMLU baseline / FT
(default) QLoRA r=32 + Unsloth3.1047 min11.832.4 → 39.8
QLoRA r=642.9549 min13.132.4 → 40.2
QLoRA r=1282.8052 min14.332.4 → 40.5
LoRA bf16 r=32 (no quant)1.7882 min23.132.4 → 40.0
Full FT (impossible bf16)OOM
Karar: r=32 default, kalite/cost trade-off optimal. Cloud H100 80GB'da full FT 3-4 saat, +%1-2 marginal kalite — pratikte gereksiz.

MT-Bench-TR sonuçları (judge: GPT-4o)#

ModelAvg score (1-10)
Llama 3.1 8B-Instruct (base)6.42
+ 50K TR-Alpaca SFT (cookbook)7.18
+ DPO (UltraFeedback TR)7.51
Qwen 2.5 7B-Instruct (TR-friendly)7.32
GPT-4o-mini reference8.12
🐛 FMD — 'Loss curve düzgün ama inference garbage output (kelimeler tekrarlı)'
Hipotezler: (a) EOT token problem — Llama 3.x'in EOS'u `<|eot_id|>` (128009), default `<|end_of_text|>` değil. Generate'de `eos_token_id=128009` zorla, yoksa model durmaz. (b) Chat template mismatch — train'de `llama-3.1` template, inference'da `apply_chat_template` ile farklı bir model template'i. (c) Saved checkpoint adapter only — `save_model` LoRA adapter'ı kaydeder, inference için `PeftModel.from_pretrained(base, adapter_path)` ile yükle ya da `merge_and_unload()`. Drill: 3 hipotezi sırayla elimine et.
✅ Teslim
  1. Yukarıdaki Lab script'ini koş (~47 dakika). 2) Pre-FT + post-FT 20 Türkçe prompt üzerinde sample inference, karşılaştır. 3) TR-MMLU baseline + post-FT'yi öl. 4) Sonraki ders: 3.2 — Llama 3.2 1B/3B (Edge & Mobile FT).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler

Bağlantılı Pillar Konular

Bu yazının bağlandığı pillar konular