How many training examples are enough?

Minimum 500, ideal 2000-5000. Diminishing returns above that. The more complex the domain, the more examples needed. For Turkish, 30% more examples are usually needed (more tokens per example).

Prompt Distillation: Transferring the Big Model's Prompt to Small Model and a 95% Cost Reduction

You can transfer a complex prompt working with Sonnet 4.6 to Haiku 4.5 via fine-tuning, achieving the same quality at 95% lower cost. This lesson covers the distillation pipeline, eval setup, and break-even analysis.

Şükrü Yusuf KAYA

18 min read

5/14/2026

Advanced

Prompt Distillation: Büyük Modelin Promptunu Küçük Modele Aktarmak ve %95 Maliyet Düşüşü

🎓 "Önce büyük modele yaptırırız, sonra küçük modeli onun gibi konuşmaya öğretiriz."

Distillation'ın kalbi bu cümlede. Önce Sonnet'in 1.000 örnekte ne yaptığını yakala, sonra Haiku'yu aynı output'ları üretecek şekilde fine-tune et. Sonuç: %95 daha ucuz, neredeyse aynı kalite.

Distillation — Genel Mantık#

Bir öğretmen-öğrenci ilişkisi:

TEACHER (Sonnet 4.6) — pahalı, kaliteli
   ↓
   Bir görev verisini cevaplar (örn. 1000 örnek)
   ↓
[(input, output) çiftleri]
   ↓
STUDENT (Haiku 4.5) — ucuz, fine-tune edilebilir
   ↓
   Teacher'ın output'larını mimic etmeyi öğrenir

Sonunda: Student model, Teacher'ın spesifik görev üzerinde aynı performansı verir. Daha ucuz, daha hızlı.

Hangi Görevler Distillation'a Uygun?#

✅ İyi adaylar#

Tek-domain görevler (e-ticaret sınıflandırma, customer support routing)
Tekrarlanan görevler (her gün milyonlarca tahmin)
Specific output format (JSON, classification labels)
Türkçe-ağırlıklı (small model Türkçe için fine-tune ile çok iyileşir)

❌ Kötü adaylar#

Genel-amaç chatbot
Düşük-volume kullanım (fine-tune maliyeti amortize olmaz)
Sürekli değişen prompt'lar
Cutting-edge reasoning (small model'ın yetenek sınırı)

Distillation Pipeline — 5 Adım#

Adım 1 — Veri Toplama#

Production'dan veya synthetic data:

# Production loglarından
training_data = []
for log_entry in production_logs:
    training_data.append({
        "input": log_entry["user_message"],
        "teacher_output": log_entry["sonnet_response"],
    })

# Hedef: 1000-10.000 örnek
print(f"Topladığımız örnek: {len(training_data)}")

⚠️ PII kontrolü — Modül 4.5'te ele aldık.

Adım 2 — Quality Filter#

Teacher output'larının her biri eşit kalitede değil. Filtrele:

def quality_filter(example):
    # Bazı checks
    if len(example["teacher_output"]) < 10: return False  # çok kısa
    if "I cannot" in example["teacher_output"]: return False  # refusal
    if has_pii(example["teacher_output"]): return False  # PII
    return True

filtered = [ex for ex in training_data if quality_filter(ex)]

Adım 3 — Format için JSONL#

# OpenAI fine-tune format
with open("training.jsonl", "w") as f:
    for ex in filtered:
        record = {
            "messages": [
                {"role": "system", "content": "You are a customer support agent..."},
                {"role": "user", "content": ex["input"]},
                {"role": "assistant", "content": ex["teacher_output"]},
            ]
        }
        f.write(json.dumps(record) + "\n")

Adım 4 — Fine-tune#

# OpenAI fine-tune (GPT-5-mini için)
file_id = openai.files.create(file=open("training.jsonl", "rb"), purpose="fine-tune")
job = openai.fine_tuning.jobs.create(
    training_file=file_id.id,
    model="gpt-5-mini-2026-01-15",  # snapshot version
    hyperparameters={"n_epochs": 3},
)
# Wait, monitor, get fine-tuned model ID

Anthropic, Together, Fireworks de fine-tune destekliyor.

Adım 5 — Eval#

# Test set (100-200 örnek, training'den farklı)
test_set = load_test_data()

results = []
for ex in test_set:
    student_output = call_finetuned(ex["input"])
    teacher_output = ex["teacher_output"]

    # LLM-as-judge ile karşılaştır
    score = judge_quality(student_output, teacher_output)
    results.append(score)

avg_quality = np.mean(results)
print(f"Quality vs teacher: {avg_quality:.2f}")  # hedef >= 0.85

Break-Even Analizi#

Distillation kâr eder mi? Hesap:

Maliyet#

Kalem	Tutar
Teacher'dan 5000 örnek üretme	5K × 1500 token × $15/M =$ 112
Fine-tune (GPT-5-mini, 5K örnek)	5K × 1500 × $25/M =$ 187
Inference cost (FT, $0.60/M input vs$ 0.40/M standart)	+%50

Toplam upfront: ~$300

Tasarruf#

Senaryo	Aylık
Önce (Sonnet 4.6): 100K req × 3K token × $3/M	$900
Sonra (FT GPT-5-mini): 100K req × 3K × $0.60/M	$180
Aylık tasarruf	$720

Break-even#

300 /

720 = 0.4 ay (yaklaşık 12 gün)

İlk 2 hafta sonra kâr. Sonraki her ay $720 net.

Yıllık tasarruf#

720 × 12 -

300 = $8.340 yıllık kâr sadece bir feature için.

Sınırlamalar#

1. Kalite uçurum sınırı#

Student kabaca Teacher'ın %85-95'ini yakalar. Eğer Sonnet 4.6'da %95 accuracy, FT Haiku'da %85-90 olabilir. Production'da kabul edilebilir mi? Use-case'e bağlı.

2. Drift#

Production data zaman içinde değişir. Fine-tune model stale olur. 6 ayda bir retrain öner.

3. Multilingual#

Eğer hem Türkçe hem İngilizce destekliyorsan, dataset'i dengeli oluştur. Aksi takdirde model bir dilde gerileyebilir.

4. Fine-tune fiyatı#

Anthropic fine-tuning şu an enterprise'a açık (2026 Q2). Açık modeli olmayan kapalı LLM'lerle dolayısıyla distill etmek için (örn: GPT-5 → Haiku) Anthropic FT yapamazsın.

Alternatifler — Distillation yapamıyorsan#

Alt 1 — Prompt-engineering only#

Modül 5'in tüm tekniklerini uygula, küçük modele uyumlu prompt yaz.

Alt 2 — Self-hosted distillation#

Llama 3.3 70B (teacher) → Llama 3.1 8B (student) açık-ağırlık modellerde distill yap.

Alt 3 — RAG-augmented small model#

Küçük model + güçlü RAG = büyük model performansının yakın seviyesi, hiç fine-tune gerekmez.

▶️ Sıradaki ders

6.5 — Kalite-Monitored Compression. Bütün modülün metodolojisi: her compression tekniğinin kalite sınırını nasıl ölçeriz? LLM-as-judge framework, A/B test, regression detection.

Frequently Asked Questions

For domain-specific tasks yes (85-95% parity). For edge cases, I recommend Teacher fallback: if Student has low confidence, route to Teacher. Module 8's cascade routing pattern has exactly this logic.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Pillar topics this article maps to

Pillar Topic

LLMOps: Production-Grade LLM Operations

LLMOps is the engineering discipline that covers the development, deployment, monitoring, evaluation and cost management of LLM-powered applications — extending classic MLOps with prompt versioning, eval-driven CI and observability tailored for non-deterministic systems.