vLLM vs SGLang vs TensorRT-LLM — which should I use in production?

General recommendation as of 2026: (1) Default production choice → vLLM (largest community, broadest model support, multi-vendor hardware, open source). (2) Reasoning model + complex prompt + RadixAttention benefit → SGLang (prefix-aware optimization, structured generation). (3) NVIDIA-only + maximum throughput + early model access → TensorRT-LLM. (4) Simple setup + Hugging Face native → TGI. (5) Apple Silicon / CPU edge → llama.cpp. vLLM is the safest + most flexible choice in most Turkish enterprise scenarios. Module 1.2 covers in detail.

What are vLLM v1 (March 2025) breaking changes? Do I need to migrate from v0?

v1 redesign brought important changes: (1) Sync → async architecture transition, ~1.7x throughput; (2) Decoupled scheduler + worker processes; (3) Some flags / API surface changed (e.g., --enable-prefix-caching default on, --use-v2-block-manager removed). (4) Some experimental features (multi-step scheduling) stabilized in v1. Migration: opt-in with VLLM_USE_V1=1, then became default. Production recommendation: full regression test in staging before v1 transition; drop-in compatible in most use cases but watch for edge cases. Module 2.3 covers in detail.

Can I serve a 70B model on a single H100 (80GB)?

No in FP16 (140GB > 80GB). Paths: (1) AWQ-INT4 quantization → 35GB + KV cache, possible on a single H100 but with small context window (~8K); (2) FP8 quantization → 70GB + KV cache, very little margin; (3) Disaggregated serving (prefill 8x H100, decode 4x H100) — but not single-node. (4) Recommended: TP 2 dual H100 (FP16, 32K context) or single H100 + AWQ (8K context). Modules 6 + 12 capstone make a decision matrix tailored to your scenario.

How do I deploy DeepSeek V3 671B MoE in vLLM?

For DeepSeek V3 (671B total, 37B active per token): FP8 quantization + Expert Parallelism (EP) + TP combination required. Minimum: 16x H100 (1.2TB+ VRAM) or 8x B200; cost-optimal: 16x H100 FP8 + EP 8 + TP 2. Multi-node: Ray Serve + InfiniBand. vLLM 0.7+ has DeepSeek V3 native support; MLA (Multi-head Latent Attention) custom layer + MoE routing built-in. Hugging Face deepseek-ai/DeepSeek-V3 weights download → vllm serve. Modules 6 + 8 cover in detail.

NVIDIA Dynamo vs Mooncake disaggregated serving — which should I choose?

As of March 2025: (1) NVIDIA Dynamo is production-grade, NVIDIA-managed, with vLLM/SGLang/TensorRT-LLM backend support, NVIDIA hardware optimized — production-ready; (2) Mooncake (Moonshot AI 2024) is academic + research, with KV cache pool design, novel but less documentation for production deployment. Practical recommendation: NVIDIA GPU cluster + production scale → Dynamo; research + custom prototype → Mooncake forking. Module 9 provides a comparison matrix for both.

How do I integrate my own Turkish CPT-trained model into vLLM?

Two paths: (1) If standard architecture (Llama 3.3 / Qwen3 / Gemma 3 base) → Hugging Face safetensors export + vllm serve works directly. (2) If custom architecture (modified attention, custom MoE) → register a new model class with vllm.ModelRegistry + register_model(); CausalLM interface implementation; writing PagedAttention-compatible attention layer; weight-loading mapping. Most Turkish CPTs (Cosmos / Trendyol AI / KUIS-AI) are Llama-based — path 1 is sufficient. Path 2 is mandatory for custom MLA / GQA / RoPE scaling. Module 8 shows the practical details.

I'm getting OOM (out of memory) errors in production — what should I do?

Systematic troubleshooting: (1) Reduce --gpu-memory-utilization (0.9 → 0.85 → 0.80); (2) Reduce --max-num-seqs (concurrent request limit); (3) Reduce --max-model-len (shrink context window); (4) Activate quantization (AWQ-INT4 + FP8 KV cache); (5) --enable-prefix-caching false (frees cache memory but reduces throughput); (6) Increase --swap-space (CPU offload); (7) Increase TP (distribute memory). Activation memory + KV cache + model weight + buffer breakdown analysis is critical. Module 11.3 provides detailed OOM diagnosis.

Does speculative decoding really yield 2-4x throughput? In which scenarios does it work?

Depends on the scenario: (1) Long generation (reasoning-model 4K-32K thinking trace) → 3-5x speedup (highest gain); (2) Code generation (deterministic, predictable tokens) → 2-3x speedup; (3) Conversational chat (short response) → 1.3-1.8x speedup; (4) High-temperature sampling → less gain (acceptance rate drops). EAGLE-3 (UCSB 2024) yields the highest gain (4-5x in R1 serving); MEDUSA is moderate; ngram speculator is simple but gives 30-50% boost in coding/repetitive tasks. Module 5.3 provides the Pareto frontier of each.

Minimum hardware + cost for Kubernetes vLLM deployment?

Minimum production scenarios (2026 cloud pricing): (1) 7B model FP16 — single L40S (48GB) ~$1.5/hour (RunPod); (2) 13B model AWQ — single L40S; (3) 70B model AWQ-INT4 + TP 2 — 2x A100 (40GB) ~$3/hour; (4) 70B model FP16 + TP 4 — 4x A100 (80GB) ~$6/hour; (5) DeepSeek V3 671B FP8 + EP 8 + TP 2 — 16x H100 ~$50/hour. Kubernetes overhead: GPU operator + monitoring stack +5%. Scale-to-zero can reduce idle cost. Modules 10 + 11 cover cost optimization in detail.

How should I tune vLLM for reasoning-model (R1, o3) serving?

Reasoning models produce 16K-128K thinking traces, different from classical chat. Tuning: (1) --max-model-len 65536 or more; (2) --kv-cache-dtype fp8 (long-trace KV cache memory dominant); (3) --enable-chunked-prefill true (for long prompts); (4) Increase --max-num-batched-tokens (8192+); (5) Reduce --max-num-seqs (each sequence uses large KV cache); (6) Activate speculative decoding (EAGLE-3 + reasoning model 4-5x speedup); (7) Reuse system prompt + reasoning instructions via prefix caching. Module 11 covers reasoning-specific tuning in detail.

What concrete artifacts will I have at the end of the training?

The following artifacts are produced in the capstone project: (1) a vLLM serving stack tailored to your scenario (Python + Kubernetes Helm chart + Docker Compose); (2) hardware + model + quantization + parallelism decision document; (3) custom model integration (if applicable) ModelRegistry registration code; (4) Prometheus + Grafana vLLM dashboard config; (5) Langfuse + Phoenix observability integration; (6) HPA + KEDA autoscaling YAML; (7) vllm-bench benchmark report + Pareto frontier analysis; (8) 90-day production deployment + scaling roadmap + cost analysis.

Can the training be customized for our enterprise team?

Yes. Beyond the standard 3-day program, we offer customized private-classroom versions for enterprise clients. Module weights and capstone scenarios are tailored to your team's existing hardware infrastructure (H100 cluster, B200 cluster, AMD MI300X, AWS Trainium 2), target models (Llama / Qwen / DeepSeek / your own CPT model), production SLA goals (TTFT, throughput, cost), compliance requirements (KVKK self-hosted, EU AI Act), and existing inference stack (TGI / Ray Serve / Triton Inference Server history).

About this training

A 3-day advanced Turkish training that covers end to end the production LLM inference-engine standard vLLM's internal architecture, the PagedAttention algorithm, continuous-batching mechanics, speculative decoding (EAGLE-3 + MEDUSA), tensor + pipeline + expert parallelism, AWQ + GPTQ + FP8 + FP4 quantization integration, custom-model integration, and NVIDIA Dynamo disaggregated serving discipline. Includes Kubernetes + Ray Serve + Prometheus + Langfuse production stack.

This training is designed for: ML Engineers and Inference Engineers deploying inference engines for enterprise LLM products ML Platform engineers production-serving DeepSeek V3 / Llama 4 / Qwen3 / Gemma 3 Senior backend developers who need to optimize reasoning-model (o3, R1) long-context serving cost Inference researchers working on NVIDIA Dynamo + disaggregated serving Teams who want to integrate their own CPT model (Turkish LLM, domain-specific) into vLLM SREs and Platform Engineers managing production GPU clusters (H100 / B200)

Why this course matters: The only advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end + production-grade in Turkish. Full mathematical construction of PagedAttention (Kwon 2023) + continuous batching (Orca 2022) + speculative decoding (EAGLE-3 + MEDUSA). Hands-on DeepSeek V3 671B MoE production deployment with Tensor + Pipeline + Expert Parallelism. AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization. Custom model + PagedAttention-compatible layer writing with ModelRegistry. NVIDIA Dynamo (March 2025) + Mooncake + DistServe disaggregated-serving frontier coverage. Building a Kubernetes + Ray Serve + Prometheus + Langfuse production stack. Through the capstone project, equips the participant with a vLLM serving stack applicable on their own hardware target.

Learning outcomes by the end of the programme: Understand vLLM's 5 main components (LLMEngine, Scheduler, Worker, BlockManager, Sampler) at source-code level. Build the PagedAttention algorithm at the mathematical level. Use continuous batching + chunked prefill + scheduling policy in production. Perform speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM integration. Set up 70B-671B model multi-GPU + multi-node serving with TP + PP + EP. Skillfully use the AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack. Write custom model + custom layer + PagedAttention-compatible attention with ModelRegistry. Deploy NVIDIA Dynamo + Mooncake disaggregated serving architecture. Build a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack. Optimize performance with tuning parameters + vllm-bench + Pareto frontier.

Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch + CUDA At least conceptual experience with LLM inference + serving (vLLM / TGI / TensorRT-LLM) Docker + Kubernetes + Helm chart experience (for production deployment) Basic GPU + CUDA knowledge (CUDA kernel writing not in training, only usage) Basic knowledge of linear algebra + transformer architecture RunPod / Lambda Labs / AWS H100 access before the training (for the capstone)

The only production-grade advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end in Turkish
Mathematical construction of the PagedAttention algorithm via the OS-paging analogy
Deep dive into continuous batching + chunked prefill + iteration-level scheduling
Speculative decoding: EAGLE-3 + MEDUSA + draft model + ngram speculator integration
Production deployment of 70B-671B models with Tensor + Pipeline + Expert Parallelism
AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization
Custom model adding: ModelRegistry + writing PagedAttention-compatible layers
NVIDIA Dynamo + Mooncake disaggregated serving (March 2025 frontier) implementation

Key Takeaways

Understand vLLM's 5 main components (LLMEngine, Scheduler, Worker, BlockManager, Sampler) at source-code level.
Build the PagedAttention algorithm at the mathematical level.
Use continuous batching + chunked prefill + scheduling policy in production.
Perform speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM integration.
Set up 70B-671B model multi-GPU + multi-node serving with TP + PP + EP.
Skillfully use the AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack.
Write custom model + custom layer + PagedAttention-compatible attention with ModelRegistry.
Deploy NVIDIA Dynamo + Mooncake disaggregated serving architecture.
Build a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack.
Optimize performance with tuning parameters + vllm-bench + Pareto frontier.

Advanced Level3 Gün

vLLM Internals and Custom Backend Engineering Training (PagedAttention + Continuous Batching + Speculative Decoding + NVIDIA Dynamo)

Enroll Now

About This Course

This training is designed to teach end to end — in Turkish — the internal architecture, algorithmic foundations, and production deployment discipline of vLLM, which has become the de facto inference-engine standard of the 2024-2026 period. The journey that began with the PagedAttention paper UC Berkeley Sky Computing Lab presented at SOSP in September 2023 has been transformed into a production-grade platform with 30K+ GitHub stars, 2025 incubation under LF AI & Data Foundation, vLLM v1 redesign (March 2025), NVIDIA Dynamo collaboration (March 2025), and the Neural Magic + Anyscale + Red Hat ecosystem. In Turkey, a training that addresses this discipline from source-code level to production Kubernetes deployment end to end is virtually nonexistent — existing content either stays at short vLLM tutorials or freezes at OpenAI-compatible server usage demos. This program is designed to fill that gap as Turkey's most comprehensive production-grade vLLM internals reference training.

The program's strategic backbone is the first module, which clarifies vLLM's birth and rise, why it became the inference-engine standard, and the 2026 ecosystem landscape. The Kwon 2023 PagedAttention paper (SOSP) from UC Berkeley Sky Lab; 2024 production spread; LF AI & Data Foundation 2025 incubation; vLLM v1 redesign (March 2025 — sync-to-async architecture transition, 1.7x throughput); NVIDIA Dynamo collaboration; Neural Magic acquisition (by Red Hat in 2024). Inference engine comparison: vLLM vs SGLang (CMU + Stanford, radix attention), vs TensorRT-LLM (NVIDIA-only, fastest NVIDIA-native), vs Hugging Face TGI (simple + production-ready), vs LMDeploy (Shanghai AI Lab, TurboMind kernel). 2026 inference landscape: multi-vendor inference (NVIDIA H100/B200, AMD MI300X/MI355X, AWS Trainium 2, Apple Silicon), unique requirements of reasoning-model + agent + long-context serving, open-source vs commercial inference (Anyscale, Together AI, Fireworks).

The second module covers vLLM's internal architectural components at the source-code level. Five main components: LLMEngine (central coordinator, manages request lifecycle), Scheduler (request queueing + scheduling policy, running/waiting/swapped queues), Worker (GPU model executor + ModelRunner), BlockManager (PagedAttention block allocation + physical/logical block table), Sampler (greedy / top-p / temperature / typical-p). v0 → v1 redesign (March 2025): transition from synchronous architecture to decoupled async; separate scheduler + worker processes; 1.7x throughput improvement + breaking changes. Python entry points: vllm.LLM (sync API, offline batch inference), vllm.AsyncLLMEngine (async server use), OpenAI-compatible HTTP server (vllm serve). Without understanding this architecture, custom backends cannot be written.

The third module mathematically builds vLLM's core innovation — PagedAttention. In classical attention, the KV cache requires contiguous memory allocation per sequence; variable sequence length causes internal fragmentation (allocated but unused slots) and external fragmentation (memory holes), totaling 60-80% memory waste. PagedAttention adapts OS virtual-memory paging logic to LLM serving: the KV cache is divided into fixed-size blocks (default 16 tokens/block), each sequence keeps its own logical block table, physical GPU memory is dynamically allocated from a free-block pool. Result: memory fragmentation drops to 4%, throughput increases 2-4x. Advanced features: prefix caching (shared block reference counting — requests with the same system prompt share KV blocks), beam search + parallel sampling memory sharing, block swap (GPU → CPU offloading).

The fourth module addresses the continuous-batching discipline, introduced in the 2022 Orca paper (OSDI) and made production-grade in vLLM, replacing classical static batching (waiting for a request group + idle GPU for the longest sequence). Iteration-level scheduling: at each token-generation step, the batch composition is updated; completed requests leave the batch, waiting requests enter. GPU utilization is 30-50% in static batching, 90%+ in continuous batching. Chunked prefill: prefill (compute-bound, matmul-heavy) and decode (memory-bound, KV cache read) are different patterns; long prompts are processed in chunks parallel to decode requests via chunked prefill. Optimal GPU use with mixed prefill-decode batching. Scheduling policy: FCFS + priority + fairness; preemption (KV cache swap to CPU or recompute).

The fifth module covers in detail speculative decoding's integration with vLLM. Speculative decoding (Leviathan 2023) two-stage decoding: a small draft model predicts multiple tokens; the large target model verifies in parallel; acceptance rate × speedup formula. Modern variants: EAGLE-3 (UCSB 2024, feature-level draft + tree attention, 4-5x speedup), MEDUSA (Princeton 2024, multi-head self-speculation, no draft model needed), Lookahead decoding (Fu 2024, ngram + n-token prediction). In vLLM: --speculative-config CLI parameter + ngram_speculator + draft model orchestration; tree-based vs sequential spec decoding selection. Spec decoding provides 40-60% cost reduction for reasoning-model (o3, DeepSeek R1) serving of 16K-128K thinking traces — a critical production advantage.

The sixth module addresses multi-GPU + multi-node distributed serving in scenarios where large models (70B+) don't fit on a single GPU. Tensor Parallelism (TP, Megatron-LM 2019 logic): intra-layer attention + MLP split, all-reduce communication; --tensor-parallel-size 8 is ideal for single-node 8x H100. Pipeline Parallelism (PP, GPipe + 1F1B): inter-layer split, micro-batch pipeline, micro-batch size tuning critical; --pipeline-parallel-size 2 for multi-node. Expert Parallelism (EP): expert distribution for MoE DeepSeek V3 671B (37B active). Data Parallelism (DP): replication, scale-out. Multi-node orchestration with vLLM + Ray Serve; NCCL all-reduce + ring all-reduce; NVLink 5 (B200) + NVSwitch + InfiniBand 400Gb topology; Blackwell GB200 NVL72 rack architecture (72-GPU coherent fabric). 16x H100 minimum for DeepSeek V3 671B (FP8) deployment.

The seventh module covers in detail vLLM's quantization support. AWQ (Lin 2023 + Marlin kernel Neural Magic 2024 optimization): 4-bit weight serving 2-3x throughput; --quantization awq_marlin parameter. GPTQ (Frantar 2022 + GPTQModel kernel): act-order desc_act + group_size 128. FP8 (Hopper E4M3 native Tensor Core + Blackwell NVFP4): hardware-native low precision; --quantization fp8 + fp8_e4m3. INT8 W8A8 (SmoothQuant outlier migration). KV cache quantization: --kv-cache-dtype fp8 critical for reasoning model long-trace; KIVI 2-bit experimental support (vLLM 0.7+). Quality + throughput + memory triangle Pareto frontier: concrete benchmark numbers for Llama 3.3 70B FP16 (140GB) vs AWQ-INT4 (35GB) vs FP8 (70GB). Production decision matrix: quality regression budget + cost target.

The eighth module covers in detail the discipline of integrating a new model architecture not built-in to vLLM. vllm.ModelRegistry API + register_model() decorator; CausalLM interface implementation (forward() + sample() + get_input_embeddings()); vLLM source code (vllm/model_executor/models/) structure and existing model implementations (Llama, Qwen, Mistral, Gemma, DeepSeek, MoE variants). Weight loading: Hugging Face safetensors → vLLM weight-tensor mapping; sharded weight loading (multi-file safetensors); quantized weight (AWQ / GPTQ / FP8) loading. Custom layer: writing PagedAttention-compatible attention layers; rotary embedding (RoPE + YaRN scaling) + GroupedQueryAttention (GQA) + Multi-head Latent Attention (MLA, DeepSeek V3); MoE expert routing + top-k expert selection layer. Adding a Turkish CPT-trained custom Llama 4 / Qwen3 model to vLLM is shown practically.

The ninth module addresses the disaggregated-serving discipline — the hottest 2024-2026 inference-engineering frontier. Prefill (compute-bound, matmul-heavy, B200/H200 ideal) and decode (memory-bound, KV cache read, H100 / consumer GPU sufficient) run on separate GPU pools — enabling optimal use of heterogeneous hardware (B200 prefill + H100 decode). KV cache transfer (from prefill node to decode node): GPU-to-GPU NCCL + RDMA + GPUDirect Storage; InfiniBand 400Gb + NVLink Switch System topology. NVIDIA Dynamo (March 2025 release) production-grade disaggregated inference platform; vLLM + SGLang + TensorRT-LLM backend support; smart router. Academic precursors: Mooncake (Moonshot AI 2024, with KV cache pool), DistServe (UCSD 2024). Latency overhead vs throughput gain trade-off — typical scenarios yield 2-4x throughput improvement + 30-50% cost reduction.

The tenth module covers end to end the discipline of taking vLLM to production. Kubernetes deployment: vLLM Helm chart + NVIDIA GPU operator + nvidia-device-plugin + nvidia-container-toolkit; Deployment + Service + Ingress + HPA YAML; PVC for model-weight cache (ReadWriteMany NFS / S3 mount). NVIDIA Dynamo platform deployment. Monitoring: Prometheus metrics endpoint (vllm:request_latency_seconds histogram, vllm:gpu_cache_usage_perc gauge, vllm:request_prompt_tokens, vllm:request_generation_tokens, vllm:e2e_request_latency_seconds); Grafana vLLM dashboard; Langfuse + Phoenix LLM-observability integration + OpenTelemetry GenAI semantic conventions. Autoscaling: HPA (custom metric: GPU utilization + queue depth), KEDA event-driven scaling, Karpenter scale-to-zero + GPU spot instance. Load balancing: round-robin (default), prefix-aware routing (SGLang inspired — same-prefix requests to the same replica), session-sticky routing.

The eleventh module addresses in detail the performance front of production vLLM deployment. Tuning parameter details: --max-num-seqs (concurrent request limit, by GPU memory, typically 256-1024), --max-num-batched-tokens (per-step total token budget, 8K-32K), --gpu-memory-utilization (default 0.9, 0.85 for safety margin), --enable-prefix-caching (system prompt cache hit 30-70%), --enable-chunked-prefill (long prompt + streaming friendly), --num-scheduler-steps (multi-step scheduling, reduces batch-decision overhead). Benchmark: vllm-bench serving (online) + offline benchmark tools; realistic workload simulation with ShareGPT + Anthropic + Alpaca trace datasets. TTFT vs throughput vs cost Pareto frontier analysis for production decision-making. Cost optimization: spot instance + scale-to-zero + multi-region failover; reasoning-model serving specific tuning (long-context KV cache). Error diagnosis: OOM (KV cache + activation-memory diagnosis), request stalls (queue-depth analysis), slow tokenization (HF tokenizer profiling).

In the capstone module, each participant designs an end-to-end vLLM serving stack tailored to their own production scenario: target model (Llama 3.3 70B Instruct / Qwen3 32B / Gemma 3 27B / DeepSeek V3 671B MoE / their CPT-trained custom model), hardware target (single H100 80GB / dual H200 / 8x H100 / 16x B200 cluster / heterogeneous disaggregated B200 prefill + H100 decode), quantization stack selection (AWQ INT4 + FP8 KV cache or FP8 weight + FP8 KV cache or FP4 NVFP4), parallelism strategy (TP 8 single-node or TP 8 × PP 2 multi-node or TP 8 + EP for MoE), serving topology (single-node monolithic or multi-node Ray Serve or disaggregated NVIDIA Dynamo), Kubernetes Helm chart + NVIDIA GPU operator + autoscaling (HPA + KEDA), observability stack (Prometheus + Grafana + Langfuse + Phoenix), benchmark + Pareto frontier + cost analysis, 90-day production deployment + scaling roadmap. By the end of the training, participants reach a level of technical competence to understand vLLM's 5 main components at the source-code level; build the PagedAttention algorithm via the OS-paging analogy; apply continuous batching + chunked prefill + speculative decoding + tensor/pipeline parallelism in production; integrate the AWQ + GPTQ + FP8 + FP4 quantization stack with Marlin + Machete kernels; add custom model + custom layer via ModelRegistry; build NVIDIA Dynamo + Mooncake disaggregated serving architecture; deploy a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack; and optimize production performance with tuning parameters + benchmark + cost analysis. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.

Training Methodology

The only production-grade advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end in Turkish

Mathematical construction of the PagedAttention algorithm via the OS-paging analogy

Deep dive into continuous batching + chunked prefill + iteration-level scheduling

Speculative decoding: EAGLE-3 + MEDUSA + draft model + ngram speculator integration

Production deployment of 70B-671B models with Tensor + Pipeline + Expert Parallelism

AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization

Custom model adding: ModelRegistry + writing PagedAttention-compatible layers

NVIDIA Dynamo + Mooncake disaggregated serving (March 2025 frontier) implementation

Who Is This For?

ML Engineers and Inference Engineers deploying inference engines for enterprise LLM products

ML Platform engineers production-serving DeepSeek V3 / Llama 4 / Qwen3 / Gemma 3

Senior backend developers who need to optimize reasoning-model (o3, R1) long-context serving cost

Inference researchers working on NVIDIA Dynamo + disaggregated serving

Teams who want to integrate their own CPT model (Turkish LLM, domain-specific) into vLLM

SREs and Platform Engineers managing production GPU clusters (H100 / B200)

Why This Course?

The only advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end + production-grade in Turkish.

Full mathematical construction of PagedAttention (Kwon 2023) + continuous batching (Orca 2022) + speculative decoding (EAGLE-3 + MEDUSA).

Hands-on DeepSeek V3 671B MoE production deployment with Tensor + Pipeline + Expert Parallelism.

AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization.

Custom model + PagedAttention-compatible layer writing with ModelRegistry.

NVIDIA Dynamo (March 2025) + Mooncake + DistServe disaggregated-serving frontier coverage.

Building a Kubernetes + Ray Serve + Prometheus + Langfuse production stack.

Through the capstone project, equips the participant with a vLLM serving stack applicable on their own hardware target.

Learning Outcomes

Understand vLLM's 5 main components (LLMEngine, Scheduler, Worker, BlockManager, Sampler) at source-code level.

Build the PagedAttention algorithm at the mathematical level.

Use continuous batching + chunked prefill + scheduling policy in production.

Perform speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM integration.

Set up 70B-671B model multi-GPU + multi-node serving with TP + PP + EP.

Skillfully use the AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack.

Write custom model + custom layer + PagedAttention-compatible attention with ModelRegistry.

Deploy NVIDIA Dynamo + Mooncake disaggregated serving architecture.

Build a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack.

Optimize performance with tuning parameters + vllm-bench + Pareto frontier.

Requirements

Active Python experience (intermediate to advanced), basic use of PyTorch + CUDA

At least conceptual experience with LLM inference + serving (vLLM / TGI / TensorRT-LLM)

Docker + Kubernetes + Helm chart experience (for production deployment)

Basic GPU + CUDA knowledge (CUDA kernel writing not in training, only usage)

Basic knowledge of linear algebra + transformer architecture

RunPod / Lambda Labs / AWS H100 access before the training (for the capstone)

Course Curriculum

104 Lessons

Module 1: Strategic Introduction to the vLLM Era — Inference Engine Race from 2023 to 20269 Lessons

Module 2: vLLM Architecture Anatomy — LLMEngine, Scheduler, Worker, and BlockManager9 Lessons

Module 3: PagedAttention Deep Dive — Kwon 2023 Algorithm9 Lessons

Module 4: Continuous Batching — From Orca to vLLM's Iteration-Level Scheduling9 Lessons

Module 5: Speculative Decoding with vLLM — EAGLE-3, MEDUSA, and Draft Model9 Lessons

Module 6: Tensor Parallelism, Pipeline Parallelism, and Multi-GPU/Multi-Node Serving9 Lessons

Module 7: Quantization Integration — AWQ, GPTQ, FP8 KV Cache, and Marlin Kernel9 Lessons

Module 8: Custom Model Adding — New Architecture Integration9 Lessons

Module 9: Disaggregated Serving — Prefill/Decode Separation and NVIDIA Dynamo9 Lessons

Module 10: Production Deployment — Kubernetes, NVIDIA Dynamo, and Monitoring9 Lessons

Module 11: Performance Tuning — Throughput, Latency, and Cost Optimization9 Lessons

Module 12: Capstone — Building a Production vLLM Serving Stack5 Lessons

Instructor

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions

Apply for Training

Boutique training with limited seats.

Pre-register for Next Groups

Leave your info to be the first to know when the next batch opens.

Live & Interactive Sessions

Project-Based Learning

Industry-Focused Curriculum

Professional Networking

1-on-1 Mentorship

Book a private session.

Enroll

About this training

Key Takeaways

vLLM Internals and Custom Backend Engineering Training (PagedAttention + Continuous Batching + Speculative Decoding + NVIDIA Dynamo)