About this training
A 3-day advanced Turkish training that covers end to end the production LLM inference-engine standard vLLM's internal architecture, the PagedAttention algorithm, continuous-batching mechanics, speculative decoding (EAGLE-3 + MEDUSA), tensor + pipeline + expert parallelism, AWQ + GPTQ + FP8 + FP4 quantization integration, custom-model integration, and NVIDIA Dynamo disaggregated serving discipline. Includes Kubernetes + Ray Serve + Prometheus + Langfuse production stack.
This training is designed for: ML Engineers and Inference Engineers deploying inference engines for enterprise LLM products ML Platform engineers production-serving DeepSeek V3 / Llama 4 / Qwen3 / Gemma 3 Senior backend developers who need to optimize reasoning-model (o3, R1) long-context serving cost Inference researchers working on NVIDIA Dynamo + disaggregated serving Teams who want to integrate their own CPT model (Turkish LLM, domain-specific) into vLLM SREs and Platform Engineers managing production GPU clusters (H100 / B200)
Why this course matters: The only advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end + production-grade in Turkish. Full mathematical construction of PagedAttention (Kwon 2023) + continuous batching (Orca 2022) + speculative decoding (EAGLE-3 + MEDUSA). Hands-on DeepSeek V3 671B MoE production deployment with Tensor + Pipeline + Expert Parallelism. AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization. Custom model + PagedAttention-compatible layer writing with ModelRegistry. NVIDIA Dynamo (March 2025) + Mooncake + DistServe disaggregated-serving frontier coverage. Building a Kubernetes + Ray Serve + Prometheus + Langfuse production stack. Through the capstone project, equips the participant with a vLLM serving stack applicable on their own hardware target.
Learning outcomes by the end of the programme: Understand vLLM's 5 main components (LLMEngine, Scheduler, Worker, BlockManager, Sampler) at source-code level. Build the PagedAttention algorithm at the mathematical level. Use continuous batching + chunked prefill + scheduling policy in production. Perform speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM integration. Set up 70B-671B model multi-GPU + multi-node serving with TP + PP + EP. Skillfully use the AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack. Write custom model + custom layer + PagedAttention-compatible attention with ModelRegistry. Deploy NVIDIA Dynamo + Mooncake disaggregated serving architecture. Build a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack. Optimize performance with tuning parameters + vllm-bench + Pareto frontier.
Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch + CUDA At least conceptual experience with LLM inference + serving (vLLM / TGI / TensorRT-LLM) Docker + Kubernetes + Helm chart experience (for production deployment) Basic GPU + CUDA knowledge (CUDA kernel writing not in training, only usage) Basic knowledge of linear algebra + transformer architecture RunPod / Lambda Labs / AWS H100 access before the training (for the capstone)
- The only production-grade advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end in Turkish
- Mathematical construction of the PagedAttention algorithm via the OS-paging analogy
- Deep dive into continuous batching + chunked prefill + iteration-level scheduling
- Speculative decoding: EAGLE-3 + MEDUSA + draft model + ngram speculator integration
- Production deployment of 70B-671B models with Tensor + Pipeline + Expert Parallelism
- AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization
- Custom model adding: ModelRegistry + writing PagedAttention-compatible layers
- NVIDIA Dynamo + Mooncake disaggregated serving (March 2025 frontier) implementation
Key Takeaways
- Understand vLLM's 5 main components (LLMEngine, Scheduler, Worker, BlockManager, Sampler) at source-code level.
- Build the PagedAttention algorithm at the mathematical level.
- Use continuous batching + chunked prefill + scheduling policy in production.
- Perform speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM integration.
- Set up 70B-671B model multi-GPU + multi-node serving with TP + PP + EP.
- Skillfully use the AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack.
- Write custom model + custom layer + PagedAttention-compatible attention with ModelRegistry.
- Deploy NVIDIA Dynamo + Mooncake disaggregated serving architecture.
- Build a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack.
- Optimize performance with tuning parameters + vllm-bench + Pareto frontier.
vLLM Internals and Custom Backend Engineering Training (PagedAttention + Continuous Batching + Speculative Decoding + NVIDIA Dynamo)
A 3-day advanced Turkish training that covers end to end the production LLM inference-engine standard vLLM's internal architecture, the PagedAttention algorithm, continuous-batching mechanics, speculative decoding (EAGLE-3 + MEDUSA), tensor + pipeline + expert parallelism, AWQ + GPTQ + FP8 + FP4 quantization integration, custom-model integration, and NVIDIA Dynamo disaggregated serving discipline. Includes Kubernetes + Ray Serve + Prometheus + Langfuse production stack.
About This Course
This training is designed to teach end to end — in Turkish — the internal architecture, algorithmic foundations, and production deployment discipline of vLLM, which has become the de facto inference-engine standard of the 2024-2026 period. The journey that began with the PagedAttention paper UC Berkeley Sky Computing Lab presented at SOSP in September 2023 has been transformed into a production-grade platform with 30K+ GitHub stars, 2025 incubation under LF AI & Data Foundation, vLLM v1 redesign (March 2025), NVIDIA Dynamo collaboration (March 2025), and the Neural Magic + Anyscale + Red Hat ecosystem. In Turkey, a training that addresses this discipline from source-code level to production Kubernetes deployment end to end is virtually nonexistent — existing content either stays at short vLLM tutorials or freezes at OpenAI-compatible server usage demos. This program is designed to fill that gap as Turkey's most comprehensive production-grade vLLM internals reference training.
The program's strategic backbone is the first module, which clarifies vLLM's birth and rise, why it became the inference-engine standard, and the 2026 ecosystem landscape. The Kwon 2023 PagedAttention paper (SOSP) from UC Berkeley Sky Lab; 2024 production spread; LF AI & Data Foundation 2025 incubation; vLLM v1 redesign (March 2025 — sync-to-async architecture transition, 1.7x throughput); NVIDIA Dynamo collaboration; Neural Magic acquisition (by Red Hat in 2024). Inference engine comparison: vLLM vs SGLang (CMU + Stanford, radix attention), vs TensorRT-LLM (NVIDIA-only, fastest NVIDIA-native), vs Hugging Face TGI (simple + production-ready), vs LMDeploy (Shanghai AI Lab, TurboMind kernel). 2026 inference landscape: multi-vendor inference (NVIDIA H100/B200, AMD MI300X/MI355X, AWS Trainium 2, Apple Silicon), unique requirements of reasoning-model + agent + long-context serving, open-source vs commercial inference (Anyscale, Together AI, Fireworks).
The second module covers vLLM's internal architectural components at the source-code level. Five main components: LLMEngine (central coordinator, manages request lifecycle), Scheduler (request queueing + scheduling policy, running/waiting/swapped queues), Worker (GPU model executor + ModelRunner), BlockManager (PagedAttention block allocation + physical/logical block table), Sampler (greedy / top-p / temperature / typical-p). v0 → v1 redesign (March 2025): transition from synchronous architecture to decoupled async; separate scheduler + worker processes; 1.7x throughput improvement + breaking changes. Python entry points: vllm.LLM (sync API, offline batch inference), vllm.AsyncLLMEngine (async server use), OpenAI-compatible HTTP server (vllm serve). Without understanding this architecture, custom backends cannot be written.
The third module mathematically builds vLLM's core innovation — PagedAttention. In classical attention, the KV cache requires contiguous memory allocation per sequence; variable sequence length causes internal fragmentation (allocated but unused slots) and external fragmentation (memory holes), totaling 60-80% memory waste. PagedAttention adapts OS virtual-memory paging logic to LLM serving: the KV cache is divided into fixed-size blocks (default 16 tokens/block), each sequence keeps its own logical block table, physical GPU memory is dynamically allocated from a free-block pool. Result: memory fragmentation drops to 4%, throughput increases 2-4x. Advanced features: prefix caching (shared block reference counting — requests with the same system prompt share KV blocks), beam search + parallel sampling memory sharing, block swap (GPU → CPU offloading).
The fourth module addresses the continuous-batching discipline, introduced in the 2022 Orca paper (OSDI) and made production-grade in vLLM, replacing classical static batching (waiting for a request group + idle GPU for the longest sequence). Iteration-level scheduling: at each token-generation step, the batch composition is updated; completed requests leave the batch, waiting requests enter. GPU utilization is 30-50% in static batching, 90%+ in continuous batching. Chunked prefill: prefill (compute-bound, matmul-heavy) and decode (memory-bound, KV cache read) are different patterns; long prompts are processed in chunks parallel to decode requests via chunked prefill. Optimal GPU use with mixed prefill-decode batching. Scheduling policy: FCFS + priority + fairness; preemption (KV cache swap to CPU or recompute).
The fifth module covers in detail speculative decoding's integration with vLLM. Speculative decoding (Leviathan 2023) two-stage decoding: a small draft model predicts multiple tokens; the large target model verifies in parallel; acceptance rate × speedup formula. Modern variants: EAGLE-3 (UCSB 2024, feature-level draft + tree attention, 4-5x speedup), MEDUSA (Princeton 2024, multi-head self-speculation, no draft model needed), Lookahead decoding (Fu 2024, ngram + n-token prediction). In vLLM: --speculative-config CLI parameter + ngram_speculator + draft model orchestration; tree-based vs sequential spec decoding selection. Spec decoding provides 40-60% cost reduction for reasoning-model (o3, DeepSeek R1) serving of 16K-128K thinking traces — a critical production advantage.
The sixth module addresses multi-GPU + multi-node distributed serving in scenarios where large models (70B+) don't fit on a single GPU. Tensor Parallelism (TP, Megatron-LM 2019 logic): intra-layer attention + MLP split, all-reduce communication; --tensor-parallel-size 8 is ideal for single-node 8x H100. Pipeline Parallelism (PP, GPipe + 1F1B): inter-layer split, micro-batch pipeline, micro-batch size tuning critical; --pipeline-parallel-size 2 for multi-node. Expert Parallelism (EP): expert distribution for MoE DeepSeek V3 671B (37B active). Data Parallelism (DP): replication, scale-out. Multi-node orchestration with vLLM + Ray Serve; NCCL all-reduce + ring all-reduce; NVLink 5 (B200) + NVSwitch + InfiniBand 400Gb topology; Blackwell GB200 NVL72 rack architecture (72-GPU coherent fabric). 16x H100 minimum for DeepSeek V3 671B (FP8) deployment.
The seventh module covers in detail vLLM's quantization support. AWQ (Lin 2023 + Marlin kernel Neural Magic 2024 optimization): 4-bit weight serving 2-3x throughput; --quantization awq_marlin parameter. GPTQ (Frantar 2022 + GPTQModel kernel): act-order desc_act + group_size 128. FP8 (Hopper E4M3 native Tensor Core + Blackwell NVFP4): hardware-native low precision; --quantization fp8 + fp8_e4m3. INT8 W8A8 (SmoothQuant outlier migration). KV cache quantization: --kv-cache-dtype fp8 critical for reasoning model long-trace; KIVI 2-bit experimental support (vLLM 0.7+). Quality + throughput + memory triangle Pareto frontier: concrete benchmark numbers for Llama 3.3 70B FP16 (140GB) vs AWQ-INT4 (35GB) vs FP8 (70GB). Production decision matrix: quality regression budget + cost target.
The eighth module covers in detail the discipline of integrating a new model architecture not built-in to vLLM. vllm.ModelRegistry API + register_model() decorator; CausalLM interface implementation (forward() + sample() + get_input_embeddings()); vLLM source code (vllm/model_executor/models/) structure and existing model implementations (Llama, Qwen, Mistral, Gemma, DeepSeek, MoE variants). Weight loading: Hugging Face safetensors → vLLM weight-tensor mapping; sharded weight loading (multi-file safetensors); quantized weight (AWQ / GPTQ / FP8) loading. Custom layer: writing PagedAttention-compatible attention layers; rotary embedding (RoPE + YaRN scaling) + GroupedQueryAttention (GQA) + Multi-head Latent Attention (MLA, DeepSeek V3); MoE expert routing + top-k expert selection layer. Adding a Turkish CPT-trained custom Llama 4 / Qwen3 model to vLLM is shown practically.
The ninth module addresses the disaggregated-serving discipline — the hottest 2024-2026 inference-engineering frontier. Prefill (compute-bound, matmul-heavy, B200/H200 ideal) and decode (memory-bound, KV cache read, H100 / consumer GPU sufficient) run on separate GPU pools — enabling optimal use of heterogeneous hardware (B200 prefill + H100 decode). KV cache transfer (from prefill node to decode node): GPU-to-GPU NCCL + RDMA + GPUDirect Storage; InfiniBand 400Gb + NVLink Switch System topology. NVIDIA Dynamo (March 2025 release) production-grade disaggregated inference platform; vLLM + SGLang + TensorRT-LLM backend support; smart router. Academic precursors: Mooncake (Moonshot AI 2024, with KV cache pool), DistServe (UCSD 2024). Latency overhead vs throughput gain trade-off — typical scenarios yield 2-4x throughput improvement + 30-50% cost reduction.
The tenth module covers end to end the discipline of taking vLLM to production. Kubernetes deployment: vLLM Helm chart + NVIDIA GPU operator + nvidia-device-plugin + nvidia-container-toolkit; Deployment + Service + Ingress + HPA YAML; PVC for model-weight cache (ReadWriteMany NFS / S3 mount). NVIDIA Dynamo platform deployment. Monitoring: Prometheus metrics endpoint (vllm:request_latency_seconds histogram, vllm:gpu_cache_usage_perc gauge, vllm:request_prompt_tokens, vllm:request_generation_tokens, vllm:e2e_request_latency_seconds); Grafana vLLM dashboard; Langfuse + Phoenix LLM-observability integration + OpenTelemetry GenAI semantic conventions. Autoscaling: HPA (custom metric: GPU utilization + queue depth), KEDA event-driven scaling, Karpenter scale-to-zero + GPU spot instance. Load balancing: round-robin (default), prefix-aware routing (SGLang inspired — same-prefix requests to the same replica), session-sticky routing.
The eleventh module addresses in detail the performance front of production vLLM deployment. Tuning parameter details: --max-num-seqs (concurrent request limit, by GPU memory, typically 256-1024), --max-num-batched-tokens (per-step total token budget, 8K-32K), --gpu-memory-utilization (default 0.9, 0.85 for safety margin), --enable-prefix-caching (system prompt cache hit 30-70%), --enable-chunked-prefill (long prompt + streaming friendly), --num-scheduler-steps (multi-step scheduling, reduces batch-decision overhead). Benchmark: vllm-bench serving (online) + offline benchmark tools; realistic workload simulation with ShareGPT + Anthropic + Alpaca trace datasets. TTFT vs throughput vs cost Pareto frontier analysis for production decision-making. Cost optimization: spot instance + scale-to-zero + multi-region failover; reasoning-model serving specific tuning (long-context KV cache). Error diagnosis: OOM (KV cache + activation-memory diagnosis), request stalls (queue-depth analysis), slow tokenization (HF tokenizer profiling).
In the capstone module, each participant designs an end-to-end vLLM serving stack tailored to their own production scenario: target model (Llama 3.3 70B Instruct / Qwen3 32B / Gemma 3 27B / DeepSeek V3 671B MoE / their CPT-trained custom model), hardware target (single H100 80GB / dual H200 / 8x H100 / 16x B200 cluster / heterogeneous disaggregated B200 prefill + H100 decode), quantization stack selection (AWQ INT4 + FP8 KV cache or FP8 weight + FP8 KV cache or FP4 NVFP4), parallelism strategy (TP 8 single-node or TP 8 × PP 2 multi-node or TP 8 + EP for MoE), serving topology (single-node monolithic or multi-node Ray Serve or disaggregated NVIDIA Dynamo), Kubernetes Helm chart + NVIDIA GPU operator + autoscaling (HPA + KEDA), observability stack (Prometheus + Grafana + Langfuse + Phoenix), benchmark + Pareto frontier + cost analysis, 90-day production deployment + scaling roadmap. By the end of the training, participants reach a level of technical competence to understand vLLM's 5 main components at the source-code level; build the PagedAttention algorithm via the OS-paging analogy; apply continuous batching + chunked prefill + speculative decoding + tensor/pipeline parallelism in production; integrate the AWQ + GPTQ + FP8 + FP4 quantization stack with Marlin + Machete kernels; add custom model + custom layer via ModelRegistry; build NVIDIA Dynamo + Mooncake disaggregated serving architecture; deploy a Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack; and optimize production performance with tuning parameters + benchmark + cost analysis. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.
Training Methodology
The only production-grade advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end in Turkish
Mathematical construction of the PagedAttention algorithm via the OS-paging analogy
Deep dive into continuous batching + chunked prefill + iteration-level scheduling
Speculative decoding: EAGLE-3 + MEDUSA + draft model + ngram speculator integration
Production deployment of 70B-671B models with Tensor + Pipeline + Expert Parallelism
AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization
Custom model adding: ModelRegistry + writing PagedAttention-compatible layers
NVIDIA Dynamo + Mooncake disaggregated serving (March 2025 frontier) implementation
Who Is This For?
Why This Course?
The only advanced program in Turkey that addresses vLLM internals + custom backend discipline end to end + production-grade in Turkish.
Full mathematical construction of PagedAttention (Kwon 2023) + continuous batching (Orca 2022) + speculative decoding (EAGLE-3 + MEDUSA).
Hands-on DeepSeek V3 671B MoE production deployment with Tensor + Pipeline + Expert Parallelism.
AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimization.
Custom model + PagedAttention-compatible layer writing with ModelRegistry.
NVIDIA Dynamo (March 2025) + Mooncake + DistServe disaggregated-serving frontier coverage.
Building a Kubernetes + Ray Serve + Prometheus + Langfuse production stack.
Through the capstone project, equips the participant with a vLLM serving stack applicable on their own hardware target.
Learning Outcomes
Requirements
Course Curriculum
104 LessonsInstructor

Şükrü Yusuf KAYA
AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant
Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.
Frequently Asked Questions
Apply for Training
Boutique training with limited seats.
Pre-register for Next Groups
Leave your info to be the first to know when the next batch opens.
1-on-1 Mentorship
Book a private session.
Categories
Related programs
Professional Software Development with Claude Code Training
A comprehensive, advanced 4-day training program for software professionals seeking enterprise-level mastery of Anthropic's agentic coding platform, Claude Code. Production-grade agent architecture with MCP integrations, Hooks, Sub-agents, Skills, and the Claude Agent SDK.
4 GünadvancedLLM Alignment Engineering with RLHF, DPO, and GRPO Training
A 3-day advanced Turkish LLM alignment training that covers the RLHF (PPO), DPO, KTO, IPO, SimPO, ORPO, and DeepSeek R1 GRPO algorithms at both math and code level; and teaches reward modeling, Constitutional AI, RLAIF, reasoning-model alignment, and the TRL/Axolotl/LLaMA-Factory/OpenRLHF/verl toolchain at production grade.
3 GünadvancedBuilding AI Agents with the Claude Agent SDK Training
A comprehensive, advanced 4-day program for software engineers who want to develop production-grade AI agents with Anthropic's Claude Agent SDK. Tool-use orchestration, MCP server development, multi-agent patterns, prompt caching, and evaluation engineering.
4 Günadvanced