Skip to content

KV Cache + Paged Attention: Inference Serving Optimization — vLLM Paged Attention and Continuous Batching

LLM inference serving optimization: KV cache anatomy (prefill vs decode phases), memory fragmentation problem, paged attention (vLLM 2023 Kwon), continuous batching, dynamic memory allocation. Llama-3 production serving math: throughput, latency trade-offs, multi-tenancy.

Şükrü Yusuf KAYA
70 min read
Advanced
KV Cache + Paged Attention: Inference Serving Optimization — vLLM Paged Attention ve Continuous Batching
🚀 Inference servingin gizli kahramanı — KV cache + paged attention
Llama-3-8B'i serve etmek istiyorsun. 1000 user concurrent. Her user 8K context istiyor. Naive: her request 16 GB KV cache. 1000 user = 16 TB GPU memory. İmkansız. Modern çözüm: KV cache + paged attention (vLLM 2023). Memory dynamic allocate, fragmentation eliminate, throughput 10-30x artar. 70 dakika sonra: KV cache anatomy'sini, paged attention algoritmasını, continuous batching'i, vLLM benchmark'larını derinlemesine kavramış olacaksın. Modern LLM serving'in olmazsa olmazı.

Ders Haritası (10 Bölüm)#

  1. Prefill vs decode — iki farklı phase
  2. KV cache anatomy — what gets cached, how much memory
  3. Memory fragmentation problem — naive serving issue
  4. Paged attention (Kwon 2023) — virtual memory analogy
  5. Page tables — implementation detail
  6. Continuous batching — dynamic request batching
  7. vLLM architecture — production-grade serving
  8. Llama-3 serving math — H100 capacity planning
  9. Comparison — vLLM vs TGI vs SGLang
  10. Edge cases — prefix caching, swap to CPU

1-3. Prefill vs Decode + KV Cache#

1.1 İki phase'li inference#

LLM autoregressive generation:
Prefill (initial):
  • User prompt ile başlar (e.g., 1000 token)
  • Tüm prompt aynı anda forward pass (parallel)
  • KV cache prompt için doldurulur
  • Compute-bound (matrix multiply)
Decode (generation):
  • Her sonraki token tek tek üretilir
  • Önceki tokens KV cache'ten gelir
  • Her step yeni K[i], V[i] hesaplanır + cache'e eklenir
  • Memory-bound (KV cache'ten read)

1.2 KV cache memory math#

Llama-3-8B per layer per request:
KV cache size = 2 (K+V) × n_kv_heads × d_head × seq_len × bytes_per_element = 2 × 8 × 128 × 8192 × 2 (bf16) = 32 MB per layer
32 layers × 32 MB = 1 GB per request for 8K context.
128K context: 16 GB per request. 1000 concurrent users: 16 TB — yetmez.

1.3 Memory fragmentation problem#

Naive serving: her request için reserved memory block.
  • Request 1: 8K context reserved, but only used 200 tokens
  • Request 2: 8K reserved, used 500
  • Total reserved: 16K, actually used: 700
  • Wasted: 15.3K worth of memory
Ve variable-length sequences: 1K, 2K, 8K, 500 — fragmentation explode.

1.4 Real-world: %60+ memory waste#

Kwon 2023 paper: naive systems %60-80 memory wasted on internal fragmentation. Ya çok large reserved (waste) ya da small reserved (request fail).

4-7. Paged Attention + vLLM#

4.1 Inspiration: OS virtual memory#

İşletim sistemleri 1960'lardan beri paging kullanıyor. Process'in memory'sini sabit-boyut sayfalara böl, dinamik allocate.
Kwon 2023 fikri: aynı şeyi KV cache için yap.

4.2 KV cache pages#

  • Page size: typically 16 tokens worth of KV
  • Each page: ~32 KB (Llama-3-8B params)
  • Pool of free pages on GPU
  • Request'ler dynamic page request eder

4.3 Page table#

Her request için page table:
request_id: [logical_block_0, logical_block_1, logical_block_2, ...] [page_ptr_0, page_ptr_1, page_ptr_2, ...]
Logical block (sequential) → physical page (potentially non-contiguous).

4.4 Implementation#

Generate token i: block_idx = i // PAGE_SIZE intra_block_offset = i % PAGE_SIZE page_ptr = page_table[request_id][block_idx] K_cache[page_ptr][intra_block_offset] = new_K V_cache[page_ptr][intra_block_offset] = new_V
Attention compute:
For each block_idx in request: page_ptr = page_table[request_id][block_idx] attend to K[page_ptr], V[page_ptr]

4.5 Continuous batching#

Naive: static batch (e.g., 32 requests). Yavaş requests batch'i bloke eder.
Continuous: request'ler dynamic gel/git. Tamamlanan requests batch'ten çıkar, yeni request hemen girer. GPU never idle.

4.6 vLLM architecture#

┌─────────────────────────────────────────────┐ │ User requests (variable length) │ └──────────────────┬──────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ Continuous batching scheduler │ │ - Mix prefill + decode requests │ │ - Manage page table per request │ └──────────────────┬──────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ GPU compute (FlashAttention) │ │ - Paged attention kernel │ │ - Mixed-precision (BF16 forward) │ └─────────────────────────────────────────────┘

4.7 Performance gains#

Kwon 2023 paper: %96 memory utilization (vs naive %30). Throughput: 14-24x improvement vs naive serving. Latency: comparable single-request, dramatic better multi-tenant.

8. Llama-3 Production Serving Math#

8.1 H100 80GB capacity (Llama-3-8B)#

Model weights bf16: 16 GB. Remaining: 64 GB for KV cache + activations.
Per request KV cache:
  • 8K context: 1 GB → 64 concurrent
  • 32K context: 4 GB → 16 concurrent
  • 128K context: 16 GB → 4 concurrent
vLLM with paging: %95 memory utilization → 8K context capable 60 concurrent (vs naive 20).

8.2 Throughput numbers (real benchmarks 2026)#

vLLM 0.5+ on H100 80GB serving Llama-3-8B:
  • Single request, 8K context, batch=1: 80 tokens/sec
  • 32 concurrent, mixed batch: ~3000 tokens/sec aggregate
  • 64 concurrent: ~4500 tokens/sec

8.3 Llama-3-70B serving#

Weights: 140 GB → needs 2 H100 (TP=2) or 1 H200. KV cache (with GQA): 0.25 GB per request 8K context. Concurrent: ~200+ requests. Throughput: ~1500 tokens/sec aggregate.

8.4 Cost economics#

H100 hourly cloud cost (2026): ~2.5/hr(spot).Throughput3000tokens/sec10.8Mtokens/hour.Cost:2.5/hr (spot). Throughput 3000 tokens/sec → 10.8M tokens/hour. Cost: 2.5 / 10.8M = $0.23 per 1M tokens.
OpenAI API GPT-4o input: $2.5/1M. Self-host 10x cheaper for high-volume.
Break-even: ~100K user/day. Aşağıda OpenAI, üstünde self-host.
✅ Ders 8.4 Özeti — KV Cache + Paged Attention
LLM inference: prefill (parallel prompt processing) vs decode (token-by-token generation). KV cache büyük (8K context = 1 GB per request Llama-3-8B). Naive serving %60+ memory waste. Paged attention (Kwon 2023): OS virtual memory inspired, page table per request, dynamic allocation. Continuous batching: requests dynamic gel/git, GPU never idle. vLLM: production-grade serving, %95 memory utilization, 14-24x throughput. Self-host Llama-3-8B 60 concurrent 8K context H100'de. Cost: 0.23/1MtokenvsOpenAI0.23/1M token vs OpenAI 2.5 → 10x cheaper at scale. Ders 8.5'te attention'ın alternatif mimarisi (RetNet, Mamba, Linear Attention) ve quadratic limit'i kırma denemelerine geçeceğiz.

Sıradaki Ders: Linear Attention + RetNet + Mamba#

Ders 8.5 (Modül 8 capstone): quadratic attention'a alternatifler — Linear Attention (Katharopoulos 2020), RetNet (Sun 2023), Mamba (Gu Dao 2023). Long context için sub-quadratic mimarileri.

Frequently Asked Questions

vLLM: most widespread, mature paged attention. TGI (HF text-generation-inference): HF ecosystem integration. SGLang: novel programming model + caching. 2026 mainstream: vLLM. Special use cases: SGLang. HF deployment: TGI.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content