Speculative Decoding: Draft + Target + Accept Rate — RTX 4090 +66% Throughput

Speculative Decoding Production: Draft + Target Pairing + Accept Rate Measurement

Speculative decoding (Leviathan et al. 2023, Chen et al. 2023) — small draft model predicts 4-8 tokens, target model **verifies**. High accept rate → 2-3x throughput. EAGLE-2 (Li et al. 2024), MEDUSA head training. Llama 3.1 8B target + Llama 3.2 1B draft on RTX 4090: 175 → 290 tok/s.

Şükrü Yusuf KAYA

30 min read

5/14/2026

Advanced

1. Speculative Decoding Mechanism#

Klasik decode:
  target generates 1 token at a time
  forward pass: O(N × layers × hidden²)
  tek token / N forward

Speculative:
  draft model (küçük): 4-8 token tahmin et — hızlı (10-20x küçük)
  target (büyük): bu 4-8 token'ı tek bir forward'da DOĞRULA
    - Eğer kabul: 4-8 token tek forward'da → speedup
    - Eğer reject: target'in en son kabul ettiği token'dan devam

Speedup formula:

speedup ≈ E[accepted_tokens] / (1 + ε)

ε = draft inference overhead (~%5).

Accept rate %80 ise: speedup ≈ 3-4x token başına amortize edilmiş forward.

2. Draft Model Seçimi#

Target	Önerilen Draft	Accept rate (TR)
Llama 3.1 70B	Llama 3.2 3B veya 1B	~%75
Llama 3.1 8B	Llama 3.2 1B	~%72
Qwen 2.5 32B	Qwen 2.5 1.5B	~%80
Mixtral 8×7B	Mistral 7B v0.3	~%65

Kural: Draft model aynı tokenizer ile train edilmeli — yoksa accept rate düşer.

python

# === vLLM ile speculative decoding ===
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,                        # draft 5 token tahmin et
    gpu_memory_utilization=0.92,
)
 
outputs = llm.generate(["Türkçe metin..."],
                      SamplingParams(temperature=0.7, max_tokens=500))
 
# Bench (RTX 4090):
# - Target only (Llama 8B AWQ): 175 tok/s
# - With Llama 3.2 1B draft:    290 tok/s (+66%)
# - With EAGLE-2:               340 tok/s (+94%)

vLLM speculative decoding

3. EAGLE-2 + MEDUSA — Draft Yerine Head#

Klasik speculative: ayrı draft model = 1B parametre + extra GPU memory + ayrı inference.

EAGLE-2 (Li et al. 2024): Target model'in hidden state'i üzerinde küçük bir head — hidden_state → next_token logits. Draft model gerek yok!

MEDUSA (Cai et al. 2024): Target model'e N adet ek lm_head ekle, her biri farklı pozisyondaki token'ı tahmin etsin.

Training: Target model frozen, sadece ek head'ler train edilir. Cookbook'ta RTX 4090'da Llama 8B'ye MEDUSA heads eğitmek 2-3 saat.

✅ Teslim

vLLM ile spec-decoding aktif et (Llama 8B + 1B draft). 2) Accept rate'i ölç. 3) Sonraki ders: 15.9 — Disaggregated Serving (Prefill/Decode).

Speculative Decoding Production: Draft + Target Pairing + Accept Rate Measurement

1. Speculative Decoding Mechanism#

2. Draft Model Seçimi#

3. EAGLE-2 + MEDUSA — Draft Yerine Head#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter