Synthetic Data: Self-Instruct, Evol-Instruct, OSS-Instruct, MAGPIE (TR Adaptation)

Instruction data is scarce for TR. Solution: synthetic generation. TR adaptation of Self-Instruct (Stanford 2022), Evol-Instruct (WizardLM), OSS-Instruct (Magicoder), MAGPIE (2024). Teacher model selection ethics (GPT-4 ToS), prompt engineering, automated quality control.

Şükrü Yusuf KAYA

36 min read

5/14/2026

Advanced

Synthetic Data: Self-Instruct, Evol-Instruct, OSS-Instruct, MAGPIE (TR İçin Adaptasyon)

🎯 Hedef

10K seed Türkçe instruction'dan 100K kalitedinli synthetic instruction üretmek. RTX 4090 üzerinde local teacher (Llama 3.1 70B QLoRA-served) ile 24 saatte mümkün. Maliyet: ~$3-5 (eğer cloud teacher kullanırsan), local'da sadece elektrik.

1. Self-Instruct (Wang et al. 2022)#

Algoritma:

175 manuel seed instruction
LLM'e: "Bunlardan birine benzer 8 yeni instruction üret"
Filter: (a) çok benzer → at; (b) format kötü → at
Her instruction için LLM cevap üretir
Pool'a ekle, adım 2'ye

Stanford'un 2022 sayıları: 175 seed → 52K instruction (Alpaca dataset).

TR adaptation:

seed_instructions_tr = [
    "İstanbul'un kuruluş tarihi nedir?",
    "Bir Python fonksiyonu nasıl yazılır?",
    "Aşağıdaki cümleyi düzelt: 'Yarın hava nasil olacak?'",
    # ... 175 örnek (manuel hazırla, kalite yüksek olsun)
]

def generate_batch(teacher, pool, n=8):
    examples = random.sample(pool, 6)
    prompt = f"""Aşağıda 6 örnek Türkçe instruction var. Aynı kalitede, farklı konularda {n} yeni instruction üret. Çeşitlilik önemli.

{chr(10).join(f'{i+1}. {ex}' for i, ex in enumerate(examples))}

Yeni instruction'lar:"""
    output = teacher.generate(prompt, max_new_tokens=600, temperature=0.9)
    return parse_instructions(output)

2. Evol-Instruct (WizardLM, Xu et al. 2023)#

Mevcut instruction'ları 'evolve' et: zor versiyonu, breadth versiyonu, depth versiyonu.

5 evolution operatörü:

Add Constraints — "Java ile yazma" → "Java 17+ feature kullanarak, max 50 satır"
Deepening — "Yapay zeka nedir?" → "Yapay zekanın 3 alt-dalını ve her birinin tarihçesini anlat"
Concretizing — "Bir hikaye yaz" → "İstanbul'da yaşayan 35 yaşında öğretmen Ayşe'nin bir günü hakkında 500 kelimelik hikaye"
Increased Reasoning — "X = 2 + 3" → "X² + 5X = 26 ise X kaçtır, adım adım çöz"
Concrete Examples — "Algoritma yaz" → "Aşağıdaki problem için Python algoritması: ..."

EVOL_PROMPT_TEMPLATE = """Aşağıdaki instruction'ı {operator} operatörü ile evolved versiyonuna dönüştür:

Original: {instruction}

Evolved (zor versiyon):"""

operators = ["constraint ekle", "derinleştir", "somutlaştır", "akıl yürütmeyi artır", "örnek ekle"]

evolved = []
for ins in original_pool:
    op = random.choice(operators)
    prompt = EVOL_PROMPT_TEMPLATE.format(operator=op, instruction=ins)
    new_ins = teacher.generate(prompt, max_new_tokens=200)
    evolved.append(new_ins)

3. MAGPIE (Xu et al. 2024)#

Self-Instruct'ın prompt'suz versiyonu. Insight: aligned modeller (Llama-3-Instruct vs.) chat template ile başladıkları cümle pre-trained 'instruction'ları replay ediyor.

Trick: Modele

<|start_header_id|>user<|end_header_id|>\n\n

token'larını ver, kendiliğinden bir user prompt üretmesini izle.

def magpie_extract(model, tokenizer, n_samples=1000):
    pool = []
    pre_query_template = "<|start_header_id|>user<|end_header_id|>\n\n"

    for _ in range(n_samples):
        # Model self-extract'le user prompt üretir
        out = model.generate(
            tokenizer(pre_query_template, return_tensors="pt").to("cuda").input_ids,
            max_new_tokens=200, temperature=1.0, do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
        user_prompt = tokenizer.decode(out[0]).split(pre_query_template)[-1].split("<|eot_id|>")[0]

        # Sonra normal generation ile cevabı al
        full_prompt = pre_query_template + user_prompt + "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        response = model.generate(...)

        pool.append({"instruction": user_prompt, "response": response})
    return pool

MAGPIE'nın avantajı: hiç manuel seed gerekmiyor. Llama-3-70B-Instruct + MAGPIE → 300K instruction dataset 24 saatte (8×H100). Kalite yüksek (manuel filtreden sonra ~%85 keepable).

4. Teacher Model Seçimi — Etik & Hukuk#

Teacher	Lisans	Synthetic dataset yayınlanabilir mi?
GPT-4 (OpenAI)	ToS yasaklıyor — output ile "competing model" eğitmek yasak	hayır (ya da private use only)
Claude (Anthropic)	Benzer ToS sınırı	hayır
Llama 3.x-Instruct	Llama Community License — derivative OK	EVET
Qwen 2.5-Instruct	Apache 2.0	EVET
Gemma 3-Instruct	Gemma ToS — derivative OK	EVET
Mistral-Instruct (Apache)	Apache 2.0	EVET
DeepSeek-R1	MIT	EVET

Cookbook'un kuralı: Public datasete teacher GPT-4 / Claude kullanma. Llama-3-70B / Qwen-72B / DeepSeek-R1 kullan — local'de QLoRA-served ile pratik.

Bonus: R1-distillation (Part XII) için DeepSeek-R1 reasoning trace'leri MIT lisansla yayınlandı; cookbook bu reçeteyi öğretiyor.

python

# === Automated quality control loop ===
import re
from concurrent.futures import ThreadPoolExecutor
 
def filter_synthetic(example):
    instruction = example["instruction"]
    response = example["response"]
 
    # Heuristic kontroller
    if len(instruction) < 10: return False
    if len(response) < 20: return False
    if len(instruction) > 800: return False
    if len(response) > 4000: return False
 
    # Repetition (sequence repeats)
    if has_repetition(response, n=3, max_repeat=5): return False
 
    # Refusal pattern (teacher reddetti)
    refusals = ["yapamam", "üzgünüm, bunu", "bu konuda yardımcı olamam"]
    if any(r in response.lower() for r in refusals): return False
 
    # Language (TR olmalı)
    if not is_turkish(response): return False
 
    return True
 
def has_repetition(text, n=3, max_repeat=5):
    """N-gram'ın max_repeat'den fazla tekrarlandığını yakala."""
    tokens = text.split()
    for i in range(len(tokens) - n):
        ngram = tuple(tokens[i:i+n])
        count = sum(1 for j in range(i+n, len(tokens)-n) if tuple(tokens[j:j+n]) == ngram)
        if count >= max_repeat: return True
    return False
 
# Pool'u temizle
clean_pool = [ex for ex in synthetic_pool if filter_synthetic(ex)]
print(f"Survival rate: {len(clean_pool)/len(synthetic_pool):.1%}")

automated quality filter

✅ Teslim

100 seed instruction yaz, Llama-3-8B-Instruct + Self-Instruct ile 1000 üret. 2) Quality filter ile %70'i geçeni tut. 3) Evol-Instruct ile 200 instruction'ı zorla. 4) Sonraki ders: 2.8 — Data Mixing Math.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Synthetic Data: Self-Instruct, Evol-Instruct, OSS-Instruct, MAGPIE (TR Adaptation)

1. Self-Instruct (Wang et al. 2022)#

2. Evol-Instruct (WizardLM, Xu et al. 2023)#

3. MAGPIE (Xu et al. 2024)#

4. Teacher Model Seçimi — Etik & Hukuk#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter