Speculative Decoding Production: Draft + Target Pairing + Accept Rate Measurement
Speculative decoding (Leviathan et al. 2023, Chen et al. 2023) — small draft model predicts 4-8 tokens, target model **verifies**. High accept rate → 2-3x throughput. EAGLE-2 (Li et al. 2024), MEDUSA head training. Llama 3.1 8B target + Llama 3.2 1B draft on RTX 4090: 175 → 290 tok/s.
Şükrü Yusuf KAYA
30 min read
Advanced1. Speculative Decoding Mechanism#
Klasik decode: target generates 1 token at a time forward pass: O(N × layers × hidden²) tek token / N forward Speculative: draft model (küçük): 4-8 token tahmin et — hızlı (10-20x küçük) target (büyük): bu 4-8 token'ı tek bir forward'da DOĞRULA - Eğer kabul: 4-8 token tek forward'da → speedup - Eğer reject: target'in en son kabul ettiği token'dan devam
Speedup formula:
speedup ≈ E[accepted_tokens] / (1 + ε)
ε = draft inference overhead (~%5).
Accept rate %80 ise: speedup ≈ 3-4x token başına amortize edilmiş forward.
2. Draft Model Seçimi#
| Target | Önerilen Draft | Accept rate (TR) |
|---|---|---|
| Llama 3.1 70B | Llama 3.2 3B veya 1B | ~%75 |
| Llama 3.1 8B | Llama 3.2 1B | ~%72 |
| Qwen 2.5 32B | Qwen 2.5 1.5B | ~%80 |
| Mixtral 8×7B | Mistral 7B v0.3 | ~%65 |
Kural: Draft model aynı tokenizer ile train edilmeli — yoksa accept rate düşer.
python
# === vLLM ile speculative decoding ===from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", speculative_model="meta-llama/Llama-3.2-1B-Instruct", num_speculative_tokens=5, # draft 5 token tahmin et gpu_memory_utilization=0.92,) outputs = llm.generate(["Türkçe metin..."], SamplingParams(temperature=0.7, max_tokens=500)) # Bench (RTX 4090):# - Target only (Llama 8B AWQ): 175 tok/s# - With Llama 3.2 1B draft: 290 tok/s (+66%)# - With EAGLE-2: 340 tok/s (+94%)vLLM speculative decoding
3. EAGLE-2 + MEDUSA — Draft Yerine Head#
Klasik speculative: ayrı draft model = 1B parametre + extra GPU memory + ayrı inference.
EAGLE-2 (Li et al. 2024): Target model'in hidden state'i üzerinde küçük bir head — hidden_state → next_token logits. Draft model gerek yok!
MEDUSA (Cai et al. 2024): Target model'e N adet ek lm_head ekle, her biri farklı pozisyondaki token'ı tahmin etsin.
Training: Target model frozen, sadece ek head'ler train edilir. Cookbook'ta RTX 4090'da Llama 8B'ye MEDUSA heads eğitmek 2-3 saat.
✅ Teslim
- vLLM ile spec-decoding aktif et (Llama 8B + 1B draft). 2) Accept rate'i ölç. 3) Sonraki ders: 15.9 — Disaggregated Serving (Prefill/Decode).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations