vLLM Internals: Continuous Batching + PagedAttention + Prefix Cache

vLLM (Kwon et al. 2023) — production LLM serving'in altın standardı. Continuous batching: yeni request'ler batch'e dinamik eklenir, finished olanlar çıkarılır → GPU idle bitti. PagedAttention: KV-cache'i fixed-size block'larda yönet → fragmentation %0. Prefix cache: common system prompt'lar tekrar hesaplanmaz. RTX 4090'da Llama 3.1 8B serving (175 tok/s batch=1, 920 tok/s batch=16).

Şükrü Yusuf KAYA

38 dakikalık okuma

14.05.2026

İleri

vLLM Internals: Continuous Batching + PagedAttention + Prefix Cache

1. vLLM'in 3 Çekirdek İnovasyonu#

İnovasyon	Klasik HF generate	vLLM
Batching	Static (batch size fixed during decode)	Continuous (dynamic add/remove)
KV-cache memory	Contiguous allocation	Paged (fixed-size blocks)
Prefix sharing	Yok (her request her şeyi tekrar hesaplar)	Prefix cache (common prompt cache'lenir)

Sonuç: Aynı GPU üzerinde 2-24x daha yüksek throughput, 2-5x daha düşük latency.

2. Continuous Batching#

Klasik HF: 8 prompt geldi, hepsini birden 'batch' yap, en uzun cevabı bekleyene kadar diğerleri idle.

vLLM: her prompt iteration'ında scheduler:

Decode step her aktif request için 1 token üret
Finished request'leri batch'ten çıkar
Yeni request varsa batch'e dahil et (prefill phase)
GPU her zaman dolu

t=0:  [req1: prefill, req2: prefill, req3: prefill]
t=1:  [req1: decode, req2: decode, req3: decode]
t=2:  [req1: decode, req2: FINISHED → out, req3: decode, req4: prefill]
t=3:  [req1: decode, req3: decode, req4: decode]
...

3. PagedAttention — KV-Cache Memory Management#

KV-cache size =

batch × seq_len × layers × kv_heads × head_dim × 2 × 2 bytes (bf16)

Llama 3.1 8B, batch=16, seq=2048:

KV = 16 × 2048 × 32 × 8 × 128 × 2 × 2 = 8.5 GB

Klasik: her request için ayrı contiguous tensor → fragmentation: %30-40 boşa giden bellek.

PagedAttention (vLLM):

KV-cache'i 16-token block'lara böl (sayfa boyutu)
Her request için logical → physical block table
Memory page-able, fragmentation %0
16-block birden share edilebilir (prefix cache)

Sonuç: Aynı 24 GB GPU'da klasik 50 request, vLLM 150-200 request paralel.

4. Prefix Cache — Common System Prompt Tasarrufu#

System prompt 500 token. 1000 request hep aynı system prompt'u kullanır.

Klasik: her request 500-token prefill (1000 × 500 = 500K flop)
vLLM prefix cache: 1 kez compute, 1000 kez reuse (500 × 1 = 500 flop, 1000x tasarruf)

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,                # 🔑 KEY
    gpu_memory_utilization=0.92,
)

Cookbook'un kuralı: Sistem prompt'un sabitse veya birkaç template varsa mutlaka aktif et — %30-90 prefill cost azalır.

python

# === vLLM ile Llama 3.1 8B serve ===
# Komut satırı
# vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
#     --host 0.0.0.0 --port 8000 \
#     --max-model-len 8192 \
#     --gpu-memory-utilization 0.92 \
#     --enable-prefix-caching \
#     --quantization awq \          # AWQ quantized version varsa
#     --dtype bfloat16 \
#     --enforce-eager false           # CUDA graph for speed
 
# Python embeddings içinde
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="llama-3.1-8b-int4-awq",                # quantized merged model
    quantization="awq",
    dtype="bfloat16",
    gpu_memory_utilization=0.92,
    max_model_len=8192,
    enable_prefix_caching=True,
    enable_chunked_prefill=True,                   # büyük prompt'ları chunk'la
)
 
sampling = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=500)
outputs = llm.generate(
    ["Soru 1...", "Soru 2...", "Soru 3..."],
    sampling,
)
for out in outputs:
    print(out.outputs[0].text)

vLLM serving — production config

5. RTX 4090 + Llama 3.1 8B Throughput#

Config	tok/s (batch=1)	tok/s (batch=16)	tok/s (batch=64)	Memory
HF transformers .generate	35	85	OOM	naïve
vLLM bf16	95	540	1240	%92 GPU
vLLM AWQ int4	175	920	2150	%75 GPU
vLLM FP8	155	1080	2520	%85 GPU
+ prefix cache (sabit prompt)	+%30-90	+%50-90	+%70-90	—

Cookbook'un default'u: AWQ int4 + prefix cache + bf16 fallback.

✅ Teslim

Llama 3.1 8B AWQ'yu vLLM ile serve et. 2) batch=1, 8, 32 throughput karşılaştır. 3) Sonraki ders: 15.2 — LoRA Hot-Swap Multiplexing.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

vLLM Internals: Continuous Batching + PagedAttention + Prefix Cache

1. vLLM'in 3 Çekirdek İnovasyonu#

2. Continuous Batching#

3. PagedAttention — KV-Cache Memory Management#

4. Prefix Cache — Common System Prompt Tasarrufu#

5. RTX 4090 + Llama 3.1 8B Throughput#

Yorumlar & Soru-Cevap

İlgili İçerikler

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Bültenime Abone Olun