PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation
Deep anatomy of vLLM's killer feature PagedAttention: split KV-cache into 16-token blocks, logical→physical block table, copy-on-write (prefix sharing), 0% fragmentation. CUDA implementation snippets, vLLM source reading. Prefix cache hit-rate 50%+ → throughput +60% on RTX 4090.
Şükrü Yusuf KAYA
28 min read
Advanced1. PagedAttention Konsepti#
Klasik LLM serving:
- Her request'in KV-cache'i contiguous tensor
- Request'in beklenen max-length boyunca pre-allocate
- Fragmentation: %30-40 boşa giden bellek
vLLM PagedAttention:
- KV-cache 16-token block'lara bölünmüş
- Her request'in block table'i var (logical idx → physical block ptr)
- Block-level allocation → no fragmentation
- Copy-on-write: prefix cache (system prompt) ile multiple request aynı block'a pointer
Request A: "You are a helpful assistant. Hello, my name is Ahmet." Request B: "You are a helpful assistant. What is 2+2?" Block 0: "You are a helpful" → shared between A & B (ref count 2) Block 1: "assistant." → shared Block 2 (A): "Hello, my name is Ahmet." Block 2 (B): "What is 2+2?"
Sonuç: Aynı 24 GB'da 4-8x daha fazla concurrent request.
✅ Teslim
- vLLM source code'da `vllm/core/block_manager.py` oku. 2) Prefix cache hit-rate metrics'i serving'de izle. 3) Sonraki ders: 13.6 — torch.compile + Inductor.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations