TR SFT: Quality > Quantity — 5K Curated TR Data > 100K Noisy
Main insight of TR SFT: less but high-quality data beats more but noisy. 5K human-curated TR > 100K MT-translated bad Alpaca. How to mix TR-Alpaca, OASST-TR, Mukayese, custom domain TR data. Curated 5K dataset: 1 epoch in 12 min on RTX 4090.
Şükrü Yusuf KAYA
28 min read
Advanced1. LIMA Insight (Zhou et al. 2023) — TR'ye Uyarlama#
LIMA (Meta 2023): 1000 curated SFT data ile GPT-4'e çok yakın Llama 65B SFT yaptılar.
TR'de aynı insight:
- 100K MT-translated TR Alpaca → kalite orta
- 3-5K human-curated TR → kalite çok daha yüksek
Niye? Modeller "instruction following" davranışını az örnekten de öğrenir. Asıl bilgi pre-train'de. Curated data format + ton + edge case öğretir.
2. TR SFT Dataset Tablosu (2026)#
| Dataset | Size | Kalite | Source | Lisans |
|---|---|---|---|---|
| malhajar/alpaca-gpt4-tr | 52K | orta (MT) | GPT-4 EN → TR | unknown |
| OpenAssistant/oasst1 (TR filter) | 8K | yüksek (human) | community | Apache 2.0 |
| Mukayese (eval set ama da) | 2K | çok yüksek (Türk NLP araştırmacıları) | curated | Apache 2.0 |
| Cosmos-LLaMA-TR-instruct | 10K | yüksek | curated + filter | MIT |
| ShareGPT-TR | 30K | orta-düşük (sosyal) | ChatGPT logs | CC |
| Custom domain | 1-5K | en yüksek | senin verin | senin |
Cookbook'un kuralı:
- Generic chat: TR-Alpaca + OASST-TR mix (60K toplam)
- Domain-specific (e-ticaret, hukuk, vs.): %20 generic + %80 custom
python
# === Curated TR SFT Mix Reçetesi ===from datasets import load_dataset, concatenate_datasets, interleave_datasets # Mixtr_alpaca = load_dataset("malhajar/alpaca-gpt4-tr", split="train").select(range(20000))oasst_tr = load_dataset("OpenAssistant/oasst1", split="train")oasst_tr = oasst_tr.filter(lambda x: x["lang"] == "tr").select(range(5000))cosmos = load_dataset("cosmos/cosmos-llama-tr-instruct", split="train").select(range(8000)) # Custom domain (örn. müşteri hizmetleri)custom_cs = load_dataset("user/custom-cs-tr", split="train").select(range(3000)) # Interleave with τ=0.4import numpy as npsizes = np.array([len(tr_alpaca), len(oasst_tr), len(cosmos), len(custom_cs)])weights = (sizes ** 0.4) / (sizes ** 0.4).sum() mixed = interleave_datasets( [tr_alpaca, oasst_tr, cosmos, custom_cs], probabilities=weights.tolist(), seed=42,)print(f"Mixed: {len(mixed)}") # ~30K # SFT — Llama 3.1 8B QLoRA, 12 dakika# Kalite: TR-MMLU 39.8 (50K alpaca alone) → 41.7 (curated mix)curated TR SFT mix
✅ Teslim
- Curated 5K TR SFT mix oluştur. 2) Aynı modeli 50K alpaca-only vs 5K curated ile FT et. 3) MT-Bench-TR karşılaştır. 4) Sonraki ders: 9.6 — TR Models Reverse Engineering.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations