# vLLM Internals and Custom Backend Engineering Training (PagedAttention + Continuous Batching + Speculative Decoding + NVIDIA Dynamo)

> Source: https://sukruyusufkaya.com/en/training/vllm-internals-custom-backend-muhendisligi-egitimi
> Updated: 2026-05-19T18:54:34.515Z
> Level: advanced
> Topics: vllm, pagedattention, continuous batching, speculative decoding, eagle-3, medusa, tensor parallelism, pipeline parallelism, kv cache, awq marlin, gptq, fp8 fp4, nvidia dynamo, disaggregated serving, ray serve, kubernetes vllm, prefill decode separation, custom backend, inference engine, llm serving
**TLDR:** A 3-day advanced Turkish training that covers end to end the production LLM inference-engine standard vLLM's internal architecture, the PagedAttention algorithm, continuous-batching mechanics, speculative decoding (EAGLE-3 + MEDUSA), tensor + pipeline + expert parallelism, AWQ + GPTQ + FP8 + FP4 quantization integration, custom-model integration, and NVIDIA Dynamo disaggregated serving discipline. Includes Kubernetes + Ray Serve + Prometheus + Langfuse production stack.

## Açıklama

The vLLM Internals and Custom Backend Engineering Training is a 3-day advanced program designed to teach end to end — in Turkish — the internal architecture, algorithmic foundations, and production deployment discipline of vLLM, which has become the inference-engine standard of 2024-2026. Calibrated for ML Engineers, Inference Engineers, ML Platform Engineers, Senior Backend Developers, and SREs.

## Kazanımlar

- Understand vLLM's 5 main components (LLMEngine, Scheduler, Worker, BlockManager, Sampler) at source-code level.
- Build the PagedAttention algorithm at the mathematical level.
- Use continuous batching + chunked prefill + scheduling policy in production.
- Perform speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM integration.
- Set up 70B-671B model multi-GPU + multi-node serving with TP + PP + EP.
- Skillfully use the AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack.
- Write custom model + custom layer + PagedAttention-compatible attention with ModelRegistry.
- Deploy NVIDIA Dynamo + Mooncake disaggregated serving architecture.
- Build a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack.
- Optimize performance with tuning parameters + vllm-bench + Pareto frontier.

<p>This training is designed to teach end to end — in Turkish — the internal architecture, algorithmic foundations, and production deployment discipline of vLLM, which has become the de facto inference-engine standard of the 2024-2026 period. The journey that began with the PagedAttention paper UC Berkeley Sky Computing Lab presented at SOSP in September 2023 has been transformed into a production-grade platform with 30K+ GitHub stars, 2025 incubation under LF AI & Data Foundation, vLLM v1 redesign (March 2025), NVIDIA Dynamo collaboration (March 2025), and the Neural Magic + Anyscale + Red Hat ecosystem. In Turkey, a training that addresses this discipline from source-code level to production Kubernetes deployment end to end is virtually nonexistent — existing content either stays at short vLLM tutorials or freezes at OpenAI-compatible server usage demos. This program is designed to fill that gap as Turkey's most comprehensive production-grade vLLM internals reference training.</p>

<p>The program's strategic backbone is the first module, which clarifies vLLM's birth and rise, why it became the inference-engine standard, and the 2026 ecosystem landscape. The Kwon 2023 PagedAttention paper (SOSP) from UC Berkeley Sky Lab; 2024 production spread; LF AI & Data Foundation 2025 incubation; vLLM v1 redesign (March 2025 — sync-to-async architecture transition, 1.7x throughput); NVIDIA Dynamo collaboration; Neural Magic acquisition (by Red Hat in 2024). Inference engine comparison: vLLM vs SGLang (CMU + Stanford, radix attention), vs TensorRT-LLM (NVIDIA-only, fastest NVIDIA-native), vs Hugging Face TGI (simple + production-ready), vs LMDeploy (Shanghai AI Lab, TurboMind kernel). 2026 inference landscape: multi-vendor inference (NVIDIA H100/B200, AMD MI300X/MI355X, AWS Trainium 2, Apple Silicon), unique requirements of reasoning-model + agent + long-context serving, open-source vs commercial inference (Anyscale, Together AI, Fireworks).</p>

<p>The second module covers vLLM's internal architectural components at the source-code level. Five main components: LLMEngine (central coordinator, manages request lifecycle), Scheduler (request queueing + scheduling policy, running/waiting/swapped queues), Worker (GPU model executor + ModelRunner), BlockManager (PagedAttention block allocation + physical/logical block table), Sampler (greedy / top-p / temperature / typical-p). v0 → v1 redesign (March 2025): transition from synchronous architecture to decoupled async; separate scheduler + worker processes; 1.7x throughput improvement + breaking changes. Python entry points: vllm.LLM (sync API, offline batch inference), vllm.AsyncLLMEngine (async server use), OpenAI-compatible HTTP server (vllm serve). Without understanding this architecture, custom backends cannot be written.</p>

<p>The third module mathematically builds vLLM's core innovation — PagedAttention. In classical attention, the KV cache requires contiguous memory allocation per sequence; variable sequence length causes internal fragmentation (allocated but unused slots) and external fragmentation (memory holes), totaling 60-80% memory waste. PagedAttention adapts OS virtual-memory paging logic to LLM serving: the KV cache is divided into fixed-size blocks (default 16 tokens/block), each sequence keeps its own logical block table, physical GPU memory is dynamically allocated from a free-block pool. Result: memory fragmentation drops to 4%, throughput increases 2-4x. Advanced features: prefix caching (shared block reference counting — requests with the same system prompt share KV blocks), beam search + parallel sampling memory sharing, block swap (GPU → CPU offloading).</p>

<p>The fourth module addresses the continuous-batching discipline, introduced in the 2022 Orca paper (OSDI) and made production-grade in vLLM, replacing classical static batching (waiting for a request group + idle GPU for the longest sequence). Iteration-level scheduling: at each token-generation step, the batch composition is updated; completed requests leave the batch, waiting requests enter. GPU utilization is 30-50% in static batching, 90%+ in continuous batching. Chunked prefill: prefill (compute-bound, matmul-heavy) and decode (memory-bound, KV cache read) are different patterns; long prompts are processed in chunks parallel to decode requests via chunked prefill. Optimal GPU use with mixed prefill-decode batching. Scheduling policy: FCFS + priority + fairness; preemption (KV cache swap to CPU or recompute).</p>

<p>The fifth module covers in detail speculative decoding's integration with vLLM. Speculative decoding (Leviathan 2023) two-stage decoding: a small draft model predicts multiple tokens; the large target model verifies in parallel; acceptance rate × speedup formula. Modern variants: EAGLE-3 (UCSB 2024, feature-level draft + tree attention, 4-5x speedup), MEDUSA (Princeton 2024, multi-head self-speculation, no draft model needed), Lookahead decoding (Fu 2024, ngram + n-token prediction). In vLLM: --speculative-config CLI parameter + ngram_speculator + draft model orchestration; tree-based vs sequential spec decoding selection. Spec decoding provides 40-60% cost reduction for reasoning-model (o3, DeepSeek R1) serving of 16K-128K thinking traces — a critical production advantage.</p>

<p>The sixth module addresses multi-GPU + multi-node distributed serving in scenarios where large models (70B+) don't fit on a single GPU. Tensor Parallelism (TP, Megatron-LM 2019 logic): intra-layer attention + MLP split, all-reduce communication; --tensor-parallel-size 8 is ideal for single-node 8x H100. Pipeline Parallelism (PP, GPipe + 1F1B): inter-layer split, micro-batch pipeline, micro-batch size tuning critical; --pipeline-parallel-size 2 for multi-node. Expert Parallelism (EP): expert distribution for MoE DeepSeek V3 671B (37B active). Data Parallelism (DP): replication, scale-out. Multi-node orchestration with vLLM + Ray Serve; NCCL all-reduce + ring all-reduce; NVLink 5 (B200) + NVSwitch + InfiniBand 400Gb topology; Blackwell GB200 NVL72 rack architecture (72-GPU coherent fabric). 16x H100 minimum for DeepSeek V3 671B (FP8) deployment.</p>

<p>The seventh module covers in detail vLLM's quantization support. AWQ (Lin 2023 + Marlin kernel Neural Magic 2024 optimization): 4-bit weight serving 2-3x throughput; --quantization awq_marlin parameter. GPTQ (Frantar 2022 + GPTQModel kernel): act-order desc_act + group_size 128. FP8 (Hopper E4M3 native Tensor Core + Blackwell NVFP4): hardware-native low precision; --quantization fp8 + fp8_e4m3. INT8 W8A8 (SmoothQuant outlier migration). KV cache quantization: --kv-cache-dtype fp8 critical for reasoning model long-trace; KIVI 2-bit experimental support (vLLM 0.7+). Quality + throughput + memory triangle Pareto frontier: concrete benchmark numbers for Llama 3.3 70B FP16 (140GB) vs AWQ-INT4 (35GB) vs FP8 (70GB). Production decision matrix: quality regression budget + cost target.</p>

<p>The eighth module covers in detail the discipline of integrating a new model architecture not built-in to vLLM. vllm.ModelRegistry API + register_model() decorator; CausalLM interface implementation (forward() + sample() + get_input_embeddings()); vLLM source code (vllm/model_executor/models/) structure and existing model implementations (Llama, Qwen, Mistral, Gemma, DeepSeek, MoE variants). Weight loading: Hugging Face safetensors → vLLM weight-tensor mapping; sharded weight loading (multi-file safetensors); quantized weight (AWQ / GPTQ / FP8) loading. Custom layer: writing PagedAttention-compatible attention layers; rotary embedding (RoPE + YaRN scaling) + GroupedQueryAttention (GQA) + Multi-head Latent Attention (MLA, DeepSeek V3); MoE expert routing + top-k expert selection layer. Adding a Turkish CPT-trained custom Llama 4 / Qwen3 model to vLLM is shown practically.</p>

<p>The ninth module addresses the disaggregated-serving discipline — the hottest 2024-2026 inference-engineering frontier. Prefill (compute-bound, matmul-heavy, B200/H200 ideal) and decode (memory-bound, KV cache read, H100 / consumer GPU sufficient) run on separate GPU pools — enabling optimal use of heterogeneous hardware (B200 prefill + H100 decode). KV cache transfer (from prefill node to decode node): GPU-to-GPU NCCL + RDMA + GPUDirect Storage; InfiniBand 400Gb + NVLink Switch System topology. NVIDIA Dynamo (March 2025 release) production-grade disaggregated inference platform; vLLM + SGLang + TensorRT-LLM backend support; smart router. Academic precursors: Mooncake (Moonshot AI 2024, with KV cache pool), DistServe (UCSD 2024). Latency overhead vs throughput gain trade-off — typical scenarios yield 2-4x throughput improvement + 30-50% cost reduction.</p>

<p>The tenth module covers end to end the discipline of taking vLLM to production. Kubernetes deployment: vLLM Helm chart + NVIDIA GPU operator + nvidia-device-plugin + nvidia-container-toolkit; Deployment + Service + Ingress + HPA YAML; PVC for model-weight cache (ReadWriteMany NFS / S3 mount). NVIDIA Dynamo platform deployment. Monitoring: Prometheus metrics endpoint (vllm:request_latency_seconds histogram, vllm:gpu_cache_usage_perc gauge, vllm:request_prompt_tokens, vllm:request_generation_tokens, vllm:e2e_request_latency_seconds); Grafana vLLM dashboard; Langfuse + Phoenix LLM-observability integration + OpenTelemetry GenAI semantic conventions. Autoscaling: HPA (custom metric: GPU utilization + queue depth), KEDA event-driven scaling, Karpenter scale-to-zero + GPU spot instance. Load balancing: round-robin (default), prefix-aware routing (SGLang inspired — same-prefix requests to the same replica), session-sticky routing.</p>

<p>The eleventh module addresses in detail the performance front of production vLLM deployment. Tuning parameter details: --max-num-seqs (concurrent request limit, by GPU memory, typically 256-1024), --max-num-batched-tokens (per-step total token budget, 8K-32K), --gpu-memory-utilization (default 0.9, 0.85 for safety margin), --enable-prefix-caching (system prompt cache hit 30-70%), --enable-chunked-prefill (long prompt + streaming friendly), --num-scheduler-steps (multi-step scheduling, reduces batch-decision overhead). Benchmark: vllm-bench serving (online) + offline benchmark tools; realistic workload simulation with ShareGPT + Anthropic + Alpaca trace datasets. TTFT vs throughput vs cost Pareto frontier analysis for production decision-making. Cost optimization: spot instance + scale-to-zero + multi-region failover; reasoning-model serving specific tuning (long-context KV cache). Error diagnosis: OOM (KV cache + activation-memory diagnosis), request stalls (queue-depth analysis), slow tokenization (HF tokenizer profiling).</p>

<p>In the capstone module, each participant designs an end-to-end vLLM serving stack tailored to their own production scenario: target model (Llama 3.3 70B Instruct / Qwen3 32B / Gemma 3 27B / DeepSeek V3 671B MoE / their CPT-trained custom model), hardware target (single H100 80GB / dual H200 / 8x H100 / 16x B200 cluster / heterogeneous disaggregated B200 prefill + H100 decode), quantization stack selection (AWQ INT4 + FP8 KV cache or FP8 weight + FP8 KV cache or FP4 NVFP4), parallelism strategy (TP 8 single-node or TP 8 × PP 2 multi-node or TP 8 + EP for MoE), serving topology (single-node monolithic or multi-node Ray Serve or disaggregated NVIDIA Dynamo), Kubernetes Helm chart + NVIDIA GPU operator + autoscaling (HPA + KEDA), observability stack (Prometheus + Grafana + Langfuse + Phoenix), benchmark + Pareto frontier + cost analysis, 90-day production deployment + scaling roadmap. By the end of the training, participants reach a level of technical competence to understand vLLM's 5 main components at the source-code level; build the PagedAttention algorithm via the OS-paging analogy; apply continuous batching + chunked prefill + speculative decoding + tensor/pipeline parallelism in production; integrate the AWQ + GPTQ + FP8 + FP4 quantization stack with Marlin + Machete kernels; add custom model + custom layer via ModelRegistry; build NVIDIA Dynamo + Mooncake disaggregated serving architecture; deploy a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack; and optimize production performance with tuning parameters + benchmark + cost analysis. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.</p>