What does prefix caching do?

Common system prompts (e.g., 'You are a helpful Turkish assistant') are the same across many requests. Prefix caching: reuse KV cache for shared prefix — compute once, use many times. vLLM, SGLang support. 2-3x throughput boost in common prefix scenarios.

Can KV cache be swapped to CPU?

Yes, vLLM has 'CPU swap' feature. When GPU full, old request KV cache moved to CPU RAM, loaded back when needed. Trade-off: latency hit (PCIe transfer), but increases throughput. Cold storage tier for inactive sessions.

KV Cache + Paged Attention: Inference Serving Optimization — vLLM Paged Attention and Continuous Batching

LLM inference serving optimization: KV cache anatomy (prefill vs decode phases), memory fragmentation problem, paged attention (vLLM 2023 Kwon), continuous batching, dynamic memory allocation. Llama-3 production serving math: throughput, latency trade-offs, multi-tenancy.

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

KV Cache + Paged Attention: Inference Serving Optimization — vLLM Paged Attention ve Continuous Batching

🚀 Inference servingin gizli kahramanı — KV cache + paged attention

Llama-3-8B'i serve etmek istiyorsun. 1000 user concurrent. Her user 8K context istiyor. Naive: her request 16 GB KV cache. 1000 user = 16 TB GPU memory. İmkansız. Modern çözüm: KV cache + paged attention (vLLM 2023). Memory dynamic allocate, fragmentation eliminate, throughput 10-30x artar. 70 dakika sonra: KV cache anatomy'sini, paged attention algoritmasını, continuous batching'i, vLLM benchmark'larını derinlemesine kavramış olacaksın. Modern LLM serving'in olmazsa olmazı.

Ders Haritası (10 Bölüm)#

Prefill vs decode — iki farklı phase
KV cache anatomy — what gets cached, how much memory
Memory fragmentation problem — naive serving issue
Paged attention (Kwon 2023) — virtual memory analogy
Page tables — implementation detail
Continuous batching — dynamic request batching
vLLM architecture — production-grade serving
Llama-3 serving math — H100 capacity planning
Comparison — vLLM vs TGI vs SGLang
Edge cases — prefix caching, swap to CPU

1-3. Prefill vs Decode + KV Cache#

1.1 İki phase'li inference#

LLM autoregressive generation:

Prefill (initial):

User prompt ile başlar (e.g., 1000 token)
Tüm prompt aynı anda forward pass (parallel)
KV cache prompt için doldurulur
Compute-bound (matrix multiply)

Decode (generation):

Her sonraki token tek tek üretilir
Önceki tokens KV cache'ten gelir
Her step yeni K[i], V[i] hesaplanır + cache'e eklenir
Memory-bound (KV cache'ten read)

1.2 KV cache memory math#

Llama-3-8B per layer per request:

KV cache size = 2 (K+V) × n_kv_heads × d_head × seq_len × bytes_per_element
              = 2 × 8 × 128 × 8192 × 2 (bf16)
              = 32 MB per layer

32 layers × 32 MB = 1 GB per request for 8K context.

128K context: 16 GB per request. 1000 concurrent users: 16 TB — yetmez.

1.3 Memory fragmentation problem#

Naive serving: her request için reserved memory block.

Request 1: 8K context reserved, but only used 200 tokens
Request 2: 8K reserved, used 500
Total reserved: 16K, actually used: 700
Wasted: 15.3K worth of memory

Ve variable-length sequences: 1K, 2K, 8K, 500 — fragmentation explode.

1.4 Real-world: %60+ memory waste#

Kwon 2023 paper: naive systems %60-80 memory wasted on internal fragmentation. Ya çok large reserved (waste) ya da small reserved (request fail).

4-7. Paged Attention + vLLM#

4.1 Inspiration: OS virtual memory#

İşletim sistemleri 1960'lardan beri paging kullanıyor. Process'in memory'sini sabit-boyut sayfalara böl, dinamik allocate.

Kwon 2023 fikri: aynı şeyi KV cache için yap.

4.2 KV cache pages#

Page size: typically 16 tokens worth of KV
Each page: ~32 KB (Llama-3-8B params)
Pool of free pages on GPU
Request'ler dynamic page request eder

4.3 Page table#

Her request için page table:

request_id: [logical_block_0, logical_block_1, logical_block_2, ...]
            [page_ptr_0,      page_ptr_1,      page_ptr_2,      ...]

Logical block (sequential) → physical page (potentially non-contiguous).

4.4 Implementation#

Generate token i:
  block_idx = i // PAGE_SIZE
  intra_block_offset = i % PAGE_SIZE
  
  page_ptr = page_table[request_id][block_idx]
  K_cache[page_ptr][intra_block_offset] = new_K
  V_cache[page_ptr][intra_block_offset] = new_V

Attention compute:

For each block_idx in request:
  page_ptr = page_table[request_id][block_idx]
  attend to K[page_ptr], V[page_ptr]

4.5 Continuous batching#

Naive: static batch (e.g., 32 requests). Yavaş requests batch'i bloke eder.

Continuous: request'ler dynamic gel/git. Tamamlanan requests batch'ten çıkar, yeni request hemen girer. GPU never idle.

4.6 vLLM architecture#

┌─────────────────────────────────────────────┐
│ User requests (variable length)             │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│ Continuous batching scheduler                │
│ - Mix prefill + decode requests             │
│ - Manage page table per request             │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│ GPU compute (FlashAttention)                │
│ - Paged attention kernel                    │
│ - Mixed-precision (BF16 forward)            │
└─────────────────────────────────────────────┘

4.7 Performance gains#

Kwon 2023 paper: %96 memory utilization (vs naive %30). Throughput: 14-24x improvement vs naive serving. Latency: comparable single-request, dramatic better multi-tenant.

8. Llama-3 Production Serving Math#

8.1 H100 80GB capacity (Llama-3-8B)#

Model weights bf16: 16 GB. Remaining: 64 GB for KV cache + activations.

Per request KV cache:

8K context: 1 GB → 64 concurrent
32K context: 4 GB → 16 concurrent
128K context: 16 GB → 4 concurrent

vLLM with paging: %95 memory utilization → 8K context capable 60 concurrent (vs naive 20).

8.2 Throughput numbers (real benchmarks 2026)#

vLLM 0.5+ on H100 80GB serving Llama-3-8B:

Single request, 8K context, batch=1: 80 tokens/sec
32 concurrent, mixed batch: ~3000 tokens/sec aggregate
64 concurrent: ~4500 tokens/sec

8.3 Llama-3-70B serving#

Weights: 140 GB → needs 2 H100 (TP=2) or 1 H200. KV cache (with GQA): 0.25 GB per request 8K context. Concurrent: ~200+ requests. Throughput: ~1500 tokens/sec aggregate.

8.4 Cost economics#

H100 hourly cloud cost (2026): ~

2.5/hr (spot). Throughput 3000 tokens/sec → 10.8M tokens/hour. Cost:

2.5 / 10.8M = $0.23 per 1M tokens.

OpenAI API GPT-4o input: $2.5/1M. Self-host 10x cheaper for high-volume.

Break-even: ~100K user/day. Aşağıda OpenAI, üstünde self-host.

✅ Ders 8.4 Özeti — KV Cache + Paged Attention

LLM inference: prefill (parallel prompt processing) vs decode (token-by-token generation). KV cache büyük (8K context = 1 GB per request Llama-3-8B). Naive serving %60+ memory waste. Paged attention (Kwon 2023): OS virtual memory inspired, page table per request, dynamic allocation. Continuous batching: requests dynamic gel/git, GPU never idle. vLLM: production-grade serving, %95 memory utilization, 14-24x throughput. Self-host Llama-3-8B 60 concurrent 8K context H100'de. Cost:

0.23/1M token vs OpenAI

2.5 → 10x cheaper at scale. Ders 8.5'te attention'ın alternatif mimarisi (RetNet, Mamba, Linear Attention) ve quadratic limit'i kırma denemelerine geçeceğiz.

Sıradaki Ders: Linear Attention + RetNet + Mamba#

Ders 8.5 (Modül 8 capstone): quadratic attention'a alternatifler — Linear Attention (Katharopoulos 2020), RetNet (Sun 2023), Mamba (Gu Dao 2023). Long context için sub-quadratic mimarileri.

Frequently Asked Questions

vLLM: most widespread, mature paged attention. TGI (HF text-generation-inference): HF ecosystem integration. SGLang: novel programming model + caching. 2026 mainstream: vLLM. Special use cases: SGLang. HF deployment: TGI.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...