Reasoning Trace Dataset Üretimi: Teacher Distillation + Self-Bootstrapping

Reasoning SFT için trace data üretimi: (a) Teacher distillation — DeepSeek-R1 (MIT lisans!), Gemini-thinking, o3 API çağrısıyla trace topla; (b) Self-bootstrapping — küçük model trace üret + verifiable filter ile doğru olanları tut; (c) Hybrid. RTX 4090'da Llama 3.1 70B teacher local serve + 10K trace üretimi (~24 saat).

Şükrü Yusuf KAYA

32 dakikalık okuma

26.06.2026

İleri

Reasoning Trace Dataset Üretimi: Teacher Distillation + Self-Bootstrapping

python

# === DeepSeek-R1 ile reasoning trace toplama ===
# DeepSeek API ya da local R1-distill (8B/14B/32B/70B)
from openai import OpenAI
 
client = OpenAI(
    api_key="ds-xxx",
    base_url="https://api.deepseek.com/v1",
)
 
def get_r1_trace(problem):
    response = client.chat.completions.create(
        model="deepseek-reasoner",                 # R1
        messages=[
            {"role": "user", "content": problem},
        ],
        max_tokens=8192,
    )
    full_response = response.choices[0].message.content
    # R1 output: <think>...</think>...answer...
    return full_response
 
# Math problem dataset
problems = load_dataset("openai/gsm8k", split="train").select(range(2000))
 
traces = []
for p in problems:
    trace = get_r1_trace(p["question"])
    # Verifiable filter: math reward
    if math_grader(trace, p["answer"]) == "correct":
        traces.append({
            "question": p["question"],
            "response": trace,
            "gold": p["answer"],
        })
 
# 2000 problem → ~1700 doğru trace (R1 accuracy ~%85)
# Cost: ~$10-20 (DeepSeek API)
 
# Save for SFT
import json
with open("r1_traces_gsm8k.jsonl", "w") as f:
    for t in traces:
        f.write(json.dumps(t) + "\n")

DeepSeek-R1 ile trace toplama

1. Self-Bootstrapping Pattern#

Teacher gerek olmadan trace üretmek:

1. Küçük base model (örn. Llama 3.1 8B) ile problem N temperature ile generate
2. Her generate'i verifiable reward ile filter (math correctness, code execution)
3. Doğru olanları SFT data'sına al
4. Yeni FT model'le adım 1'e dön
5. Iteratively traces kalitesi artar (rejection sampling)

Self-bootstrapping avantaj: External API yok, MIT lisans sorunu yok. Dezavantaj: Initial model kötüyse trace kalitesi düşük.

✅ Teslim