Disaggregated Serving: Prefill/Decode Separation — Mooncake + DistServe

Latest trend in modern LLM serving (2024-2026): prefill (input encoding) and decode (generation) on different GPUs. Prefill compute-bound, decode memory-bound — separation gives 30-50% throughput gain. Mooncake (Kimi), DistServe (UCB) recipes. Conceptual in RTX 4090 multi-GPU.

Şükrü Yusuf KAYA

24 min read

5/14/2026

Advanced

Disaggregated Serving: Prefill/Decode Ayrımı — Mooncake + DistServe

1. Niye Disaggregation?#

LLM inference iki farklı workload:

Phase	Karakter	Bottleneck
Prefill	input token'ları için forward, KV cache oluştur	Compute-bound (parallelizable)
Decode	tek tek token üret, KV cache'i kullan	Memory-bound (sequential)

Klasik (co-located): Aynı GPU prefill + decode → bir phase'in bottleneck'i diğerini bekletir.

Disaggregated: Prefill GPU'sunu Prefill için, Decode GPU'sunu Decode için → her ikisi de optimum.

Sonuçlar (DistServe paper):

Aynı GPU sayısında +%30-50 throughput
P99 latency %40-60 düşer

RTX 4090 senaryosu: Tek GPU disaggregation yapamazsın. 2× 4090 veya cloud multi-GPU gerekir. Cookbook'taki Lab kavramsal, gerçek deploy multi-node.

2. Mooncake (Kimi/Moonshot) + DistServe (UCB)#

Mooncake (2024):

Prefill ve Decode için ayrı GPU pool'ları
KV cache GPU-to-GPU transfer (NVLink/RDMA)
Centralized cache pool (SSD/RAM) — long context için
Kimi-1.5'in 2M context'ini destekliyor

DistServe (UCB, 2024):

Open-source reference impl
Goodput optimization (SLO-aware scheduling)
Per-GPU SLA: P95 latency target → routing kararı

Cookbook Lab (multi-GPU senaryo):

# Mooncake server (multi-node)
# Node 1: prefill workers
mooncake-prefill --model llama-3.1-70b --gpu 0,1,2,3 --kv_transfer rdma

# Node 2: decode workers
mooncake-decode --model llama-3.1-70b --gpu 0,1,2,3 --kv_source node1-rdma

✅ Teslim

DistServe paper'ı oku (concept kavramı için). 2) Mooncake repo'sunu incele. 3) Sonraki ders: 15.10 — Edge Inference: ONNX + Jetson + NPU.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Disaggregated Serving: Prefill/Decode Separation — Mooncake + DistServe

1. Niye Disaggregation?#

2. Mooncake (Kimi/Moonshot) + DistServe (UCB)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter