KV Cache + Paged Attention: Inference Serving Optimization — vLLM Paged Attention and Continuous Batching
LLM inference serving optimization: KV cache anatomy (prefill vs decode phases), memory fragmentation problem, paged attention (vLLM 2023 Kwon), continuous batching, dynamic memory allocation. Llama-3 production serving math: throughput, latency trade-offs, multi-tenancy.
Şükrü Yusuf KAYA
70 min read
Advanced🚀 Inference servingin gizli kahramanı — KV cache + paged attention
Llama-3-8B'i serve etmek istiyorsun. 1000 user concurrent. Her user 8K context istiyor. Naive: her request 16 GB KV cache. 1000 user = 16 TB GPU memory. İmkansız. Modern çözüm: KV cache + paged attention (vLLM 2023). Memory dynamic allocate, fragmentation eliminate, throughput 10-30x artar. 70 dakika sonra: KV cache anatomy'sini, paged attention algoritmasını, continuous batching'i, vLLM benchmark'larını derinlemesine kavramış olacaksın. Modern LLM serving'in olmazsa olmazı.
Ders Haritası (10 Bölüm)#
- Prefill vs decode — iki farklı phase
- KV cache anatomy — what gets cached, how much memory
- Memory fragmentation problem — naive serving issue
- Paged attention (Kwon 2023) — virtual memory analogy
- Page tables — implementation detail
- Continuous batching — dynamic request batching
- vLLM architecture — production-grade serving
- Llama-3 serving math — H100 capacity planning
- Comparison — vLLM vs TGI vs SGLang
- Edge cases — prefix caching, swap to CPU
1-3. Prefill vs Decode + KV Cache#
1.1 İki phase'li inference#
LLM autoregressive generation:
Prefill (initial):
- User prompt ile başlar (e.g., 1000 token)
- Tüm prompt aynı anda forward pass (parallel)
- KV cache prompt için doldurulur
- Compute-bound (matrix multiply)
Decode (generation):
- Her sonraki token tek tek üretilir
- Önceki tokens KV cache'ten gelir
- Her step yeni K[i], V[i] hesaplanır + cache'e eklenir
- Memory-bound (KV cache'ten read)
1.2 KV cache memory math#
Llama-3-8B per layer per request:
KV cache size = 2 (K+V) × n_kv_heads × d_head × seq_len × bytes_per_element = 2 × 8 × 128 × 8192 × 2 (bf16) = 32 MB per layer
32 layers × 32 MB = 1 GB per request for 8K context.
128K context: 16 GB per request. 1000 concurrent users: 16 TB — yetmez.
1.3 Memory fragmentation problem#
Naive serving: her request için reserved memory block.
- Request 1: 8K context reserved, but only used 200 tokens
- Request 2: 8K reserved, used 500
- Total reserved: 16K, actually used: 700
- Wasted: 15.3K worth of memory
Ve variable-length sequences: 1K, 2K, 8K, 500 — fragmentation explode.
1.4 Real-world: %60+ memory waste#
Kwon 2023 paper: naive systems %60-80 memory wasted on internal fragmentation. Ya çok large reserved (waste) ya da small reserved (request fail).
4-7. Paged Attention + vLLM#
4.1 Inspiration: OS virtual memory#
İşletim sistemleri 1960'lardan beri paging kullanıyor. Process'in memory'sini sabit-boyut sayfalara böl, dinamik allocate.
Kwon 2023 fikri: aynı şeyi KV cache için yap.
4.2 KV cache pages#
- Page size: typically 16 tokens worth of KV
- Each page: ~32 KB (Llama-3-8B params)
- Pool of free pages on GPU
- Request'ler dynamic page request eder
4.3 Page table#
Her request için page table:
request_id: [logical_block_0, logical_block_1, logical_block_2, ...] [page_ptr_0, page_ptr_1, page_ptr_2, ...]
Logical block (sequential) → physical page (potentially non-contiguous).
4.4 Implementation#
Generate token i: block_idx = i // PAGE_SIZE intra_block_offset = i % PAGE_SIZE page_ptr = page_table[request_id][block_idx] K_cache[page_ptr][intra_block_offset] = new_K V_cache[page_ptr][intra_block_offset] = new_V
Attention compute:
For each block_idx in request: page_ptr = page_table[request_id][block_idx] attend to K[page_ptr], V[page_ptr]
4.5 Continuous batching#
Naive: static batch (e.g., 32 requests). Yavaş requests batch'i bloke eder.
Continuous: request'ler dynamic gel/git. Tamamlanan requests batch'ten çıkar, yeni request hemen girer. GPU never idle.
4.6 vLLM architecture#
┌─────────────────────────────────────────────┐ │ User requests (variable length) │ └──────────────────┬──────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ Continuous batching scheduler │ │ - Mix prefill + decode requests │ │ - Manage page table per request │ └──────────────────┬──────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ GPU compute (FlashAttention) │ │ - Paged attention kernel │ │ - Mixed-precision (BF16 forward) │ └─────────────────────────────────────────────┘
4.7 Performance gains#
Kwon 2023 paper: %96 memory utilization (vs naive %30).
Throughput: 14-24x improvement vs naive serving.
Latency: comparable single-request, dramatic better multi-tenant.
8. Llama-3 Production Serving Math#
8.1 H100 80GB capacity (Llama-3-8B)#
Model weights bf16: 16 GB. Remaining: 64 GB for KV cache + activations.
Per request KV cache:
- 8K context: 1 GB → 64 concurrent
- 32K context: 4 GB → 16 concurrent
- 128K context: 16 GB → 4 concurrent
vLLM with paging: %95 memory utilization → 8K context capable 60 concurrent (vs naive 20).
8.2 Throughput numbers (real benchmarks 2026)#
vLLM 0.5+ on H100 80GB serving Llama-3-8B:
- Single request, 8K context, batch=1: 80 tokens/sec
- 32 concurrent, mixed batch: ~3000 tokens/sec aggregate
- 64 concurrent: ~4500 tokens/sec
8.3 Llama-3-70B serving#
Weights: 140 GB → needs 2 H100 (TP=2) or 1 H200.
KV cache (with GQA): 0.25 GB per request 8K context.
Concurrent: ~200+ requests.
Throughput: ~1500 tokens/sec aggregate.
8.4 Cost economics#
H100 hourly cloud cost (2026): ~2.5 / 10.8M = $0.23 per 1M tokens.
OpenAI API GPT-4o input: $2.5/1M. Self-host 10x cheaper for high-volume.
Break-even: ~100K user/day. Aşağıda OpenAI, üstünde self-host.
✅ Ders 8.4 Özeti — KV Cache + Paged Attention
LLM inference: prefill (parallel prompt processing) vs decode (token-by-token generation). KV cache büyük (8K context = 1 GB per request Llama-3-8B). Naive serving %60+ memory waste. Paged attention (Kwon 2023): OS virtual memory inspired, page table per request, dynamic allocation. Continuous batching: requests dynamic gel/git, GPU never idle. vLLM: production-grade serving, %95 memory utilization, 14-24x throughput. Self-host Llama-3-8B 60 concurrent 8K context H100'de. Cost: 2.5 → 10x cheaper at scale. Ders 8.5'te attention'ın alternatif mimarisi (RetNet, Mamba, Linear Attention) ve quadratic limit'i kırma denemelerine geçeceğiz.
Sıradaki Ders: Linear Attention + RetNet + Mamba#
Ders 8.5 (Modül 8 capstone): quadratic attention'a alternatifler — Linear Attention (Katharopoulos 2020), RetNet (Sun 2023), Mamba (Gu Dao 2023). Long context için sub-quadratic mimarileri.
Frequently Asked Questions
vLLM: most widespread, mature paged attention. TGI (HF text-generation-inference): HF ecosystem integration. SGLang: novel programming model + caching. 2026 mainstream: vLLM. Special use cases: SGLang. HF deployment: TGI.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup