PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation

Deep anatomy of vLLM's killer feature PagedAttention: split KV-cache into 16-token blocks, logical→physical block table, copy-on-write (prefix sharing), 0% fragmentation. CUDA implementation snippets, vLLM source reading. Prefix cache hit-rate 50%+ → throughput +60% on RTX 4090.

Şükrü Yusuf KAYA

28 min read

5/14/2026

Advanced

PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation

1. PagedAttention Konsepti#

Klasik LLM serving:

Her request'in KV-cache'i contiguous tensor
Request'in beklenen max-length boyunca pre-allocate
Fragmentation: %30-40 boşa giden bellek

vLLM PagedAttention:

KV-cache 16-token block'lara bölünmüş
Her request'in block table'i var (logical idx → physical block ptr)
Block-level allocation → no fragmentation
Copy-on-write: prefix cache (system prompt) ile multiple request aynı block'a pointer

Request A: "You are a helpful assistant. Hello, my name is Ahmet."
Request B: "You are a helpful assistant. What is 2+2?"

Block 0: "You are a helpful" → shared between A & B (ref count 2)
Block 1: "assistant." → shared
Block 2 (A): "Hello, my name is Ahmet."
Block 2 (B): "What is 2+2?"

Sonuç: Aynı 24 GB'da 4-8x daha fazla concurrent request.

✅ Teslim

vLLM source code'da `vllm/core/block_manager.py` oku. 2) Prefix cache hit-rate metrics'i serving'de izle. 3) Sonraki ders: 13.6 — torch.compile + Inductor.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

PagedAttention (vLLM): Block Table + Copy-on-Write + KV-Cache Fragmentation

1. PagedAttention Konsepti#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter