Skip to content

PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation

Deep anatomy of vLLM's killer feature PagedAttention: split KV-cache into 16-token blocks, logical→physical block table, copy-on-write (prefix sharing), 0% fragmentation. CUDA implementation snippets, vLLM source reading. Prefix cache hit-rate 50%+ → throughput +60% on RTX 4090.

Şükrü Yusuf KAYA
28 min read
Advanced
PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation

1. PagedAttention Konsepti#

Klasik LLM serving:
  • Her request'in KV-cache'i contiguous tensor
  • Request'in beklenen max-length boyunca pre-allocate
  • Fragmentation: %30-40 boşa giden bellek
vLLM PagedAttention:
  • KV-cache 16-token block'lara bölünmüş
  • Her request'in block table'i var (logical idx → physical block ptr)
  • Block-level allocation → no fragmentation
  • Copy-on-write: prefix cache (system prompt) ile multiple request aynı block'a pointer
Request A: "You are a helpful assistant. Hello, my name is Ahmet." Request B: "You are a helpful assistant. What is 2+2?" Block 0: "You are a helpful" → shared between A & B (ref count 2) Block 1: "assistant." → shared Block 2 (A): "Hello, my name is Ahmet." Block 2 (B): "What is 2+2?"
Sonuç: Aynı 24 GB'da 4-8x daha fazla concurrent request.
✅ Teslim
  1. vLLM source code'da `vllm/core/block_manager.py` oku. 2) Prefix cache hit-rate metrics'i serving'de izle. 3) Sonraki ders: 13.6 — torch.compile + Inductor.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content