Skip to content

vLLM Internals: Continuous Batching + PagedAttention + Prefix Cache

vLLM (Kwon et al. 2023) — gold standard of production LLM serving. Continuous batching: requests added/removed dynamically → GPU idle ends. PagedAttention: KV-cache managed in fixed blocks → 0% fragmentation. Prefix cache: common system prompts not recomputed. Llama 3.1 8B serving on RTX 4090 (175 tok/s batch=1, 920 tok/s batch=16).

Şükrü Yusuf KAYA
38 min read
Advanced
vLLM Internals: Continuous Batching + PagedAttention + Prefix Cache

1. vLLM'in 3 Çekirdek İnovasyonu#

İnovasyonKlasik HF generatevLLM
BatchingStatic (batch size fixed during decode)Continuous (dynamic add/remove)
KV-cache memoryContiguous allocationPaged (fixed-size blocks)
Prefix sharingYok (her request her şeyi tekrar hesaplar)Prefix cache (common prompt cache'lenir)
Sonuç: Aynı GPU üzerinde 2-24x daha yüksek throughput, 2-5x daha düşük latency.

2. Continuous Batching#

Klasik HF: 8 prompt geldi, hepsini birden 'batch' yap, en uzun cevabı bekleyene kadar diğerleri idle.
vLLM: her prompt iteration'ında scheduler:
  1. Decode step her aktif request için 1 token üret
  2. Finished request'leri batch'ten çıkar
  3. Yeni request varsa batch'e dahil et (prefill phase)
  4. GPU her zaman dolu
t=0: [req1: prefill, req2: prefill, req3: prefill] t=1: [req1: decode, req2: decode, req3: decode] t=2: [req1: decode, req2: FINISHED → out, req3: decode, req4: prefill] t=3: [req1: decode, req3: decode, req4: decode] ...

3. PagedAttention — KV-Cache Memory Management#

KV-cache size =
batch × seq_len × layers × kv_heads × head_dim × 2 × 2 bytes (bf16)
Llama 3.1 8B, batch=16, seq=2048:
KV = 16 × 2048 × 32 × 8 × 128 × 2 × 2 = 8.5 GB
Klasik: her request için ayrı contiguous tensor → fragmentation: %30-40 boşa giden bellek.
PagedAttention (vLLM):
  • KV-cache'i 16-token block'lara böl (sayfa boyutu)
  • Her request için logical → physical block table
  • Memory page-able, fragmentation %0
  • 16-block birden share edilebilir (prefix cache)
Sonuç: Aynı 24 GB GPU'da klasik 50 request, vLLM 150-200 request paralel.

4. Prefix Cache — Common System Prompt Tasarrufu#

System prompt 500 token. 1000 request hep aynı system prompt'u kullanır.
  • Klasik: her request 500-token prefill (1000 × 500 = 500K flop)
  • vLLM prefix cache: 1 kez compute, 1000 kez reuse (500 × 1 = 500 flop, 1000x tasarruf)
from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", enable_prefix_caching=True, # 🔑 KEY gpu_memory_utilization=0.92, )
Cookbook'un kuralı: Sistem prompt'un sabitse veya birkaç template varsa mutlaka aktif et — %30-90 prefill cost azalır.
python
# === vLLM ile Llama 3.1 8B serve ===
# Komut satırı
# vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
# --host 0.0.0.0 --port 8000 \
# --max-model-len 8192 \
# --gpu-memory-utilization 0.92 \
# --enable-prefix-caching \
# --quantization awq \ # AWQ quantized version varsa
# --dtype bfloat16 \
# --enforce-eager false # CUDA graph for speed
 
# Python embeddings içinde
from vllm import LLM, SamplingParams
 
llm = LLM(
model="llama-3.1-8b-int4-awq", # quantized merged model
quantization="awq",
dtype="bfloat16",
gpu_memory_utilization=0.92,
max_model_len=8192,
enable_prefix_caching=True,
enable_chunked_prefill=True, # büyük prompt'ları chunk'la
)
 
sampling = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=500)
outputs = llm.generate(
["Soru 1...", "Soru 2...", "Soru 3..."],
sampling,
)
for out in outputs:
print(out.outputs[0].text)
vLLM serving — production config

5. RTX 4090 + Llama 3.1 8B Throughput#

Configtok/s (batch=1)tok/s (batch=16)tok/s (batch=64)Memory
HF transformers .generate3585OOMnaïve
vLLM bf16955401240%92 GPU
vLLM AWQ int41759202150%75 GPU
vLLM FP815510802520%85 GPU
+ prefix cache (sabit prompt)+%30-90+%50-90+%70-90
Cookbook'un default'u: AWQ int4 + prefix cache + bf16 fallback.
✅ Teslim
  1. Llama 3.1 8B AWQ'yu vLLM ile serve et. 2) batch=1, 8, 32 throughput karşılaştır. 3) Sonraki ders: 15.2 — LoRA Hot-Swap Multiplexing.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content