vLLM vs SGLang vs TensorRT-LLM — hangisini production'da kullanmalıyım?

2026 itibarıyla genel öneri: (1) Default production choice → vLLM (en yaygın community, en geniş model desteği, multi-vendor hardware, open-source). (2) Reasoning model + complex prompt + RadixAttention faydası → SGLang (prefix-aware optimization, structured generation). (3) NVIDIA-only + maximum throughput + early model access → TensorRT-LLM. (4) Simple setup + Hugging Face native → TGI. (5) Apple Silicon / CPU edge → llama.cpp. vLLM en güvenli + en esnek tercih çoğu Türk kurumsal senaryoda. Modül 1.2 detaylı işler.

vLLM v1 (Mart 2025) breaking changes nedir? v0'dan migrate etmem gerekir mi?

v1 redesign önemli değişiklikler getirdi: (1) Sync → async architecture geçişi, ~1.7x throughput; (2) Decoupled scheduler + worker processes; (3) Some flags / API surface değişti (örn: --enable-prefix-caching default açık, --use-v2-block-manager removed). (4) Some experimental features (multi-step scheduling) v1'de stabilize edildi. Migration: VLLM_USE_V1=1 ile opt-in, sonra default oldu. Production önerisi: v1'e geçiş öncesi staging'de full regression test; çoğu use case'de drop-in compatible ama edge case'lerde dikkat. Modül 2.3 detaylı işler.

70B model'i tek H100 (80GB)'da serve edebilir miyim?

FP16'da hayır (140GB > 80GB). Yollar: (1) AWQ-INT4 quantization → 35GB + KV cache, tek H100'de mümkün ama context window küçük (~8K); (2) FP8 quantization → 70GB + KV cache, çok az margin; (3) Disaggregated serving (prefill 8x H100, decode 4x H100) — ama tek node değil. (4) Önerilen: TP 2 dual H100 (FP16, 32K context) veya tek H100 + AWQ (8K context). Modül 6 + 12 capstone'da kendi senaryonuza özel karar matrisi yapılır.

DeepSeek V3 671B MoE'yi vLLM'de nasıl deploy ederim?

DeepSeek V3 (671B total, 37B active per token) için: FP8 quantization + Expert Parallelism (EP) + TP kombinasyonu gerekli. Minimum: 16x H100 (1.2TB+ VRAM) veya 8x B200; cost optimal: 16x H100 FP8 + EP 8 + TP 2. Multi-node: Ray Serve + InfiniBand. vLLM 0.7+ DeepSeek V3 native support; MLA (Multi-head Latent Attention) custom layer + MoE routing built-in. Hugging Face deepseek-ai/DeepSeek-V3 weights download → vllm serve. Modül 6 + 8 detaylı işler.

NVIDIA Dynamo vs Mooncake disaggregated serving — hangisini seçmeliyim?

Mart 2025 itibarıyla: (1) NVIDIA Dynamo production-grade, NVIDIA tarafından yönetiliyor, vLLM/SGLang/TensorRT-LLM backend support, NVIDIA hardware optimized — production'a hazır; (2) Mooncake (Moonshot AI 2024) academic + research, KV cache pool design, novel ama production deployment için daha az dokümantasyon. Pratik öneri: NVIDIA GPU cluster + production scale → Dynamo; research + custom prototype → Mooncake forking. Modül 9 her ikisinin karşılaştırma matrisini sunar.

Türkçe CPT edilmiş kendi model'imi vLLM'e nasıl entegre ederim?

İki yol: (1) Standart architecture (Llama 3.3 / Qwen3 / Gemma 3 base) ise → Hugging Face safetensors export + vllm serve direct çalışır. (2) Custom architecture (modified attention, custom MoE) ise → vllm.ModelRegistry + register_model() ile yeni model class kaydı; CausalLM interface implementation; PagedAttention-compatible attention layer yazımı; weight loading mapping. Çoğu Türkçe CPT (Cosmos / Trendyol AI / KUIS-AI) Llama base — yol 1 yeterli. Custom MLA / GQA / RoPE scaling ise yol 2 zorunlu. Modül 8 detaylı pratik gösterir.

Production'da OOM (out of memory) hatasıyla karşılaşıyorum, ne yapmalıyım?

Sistematik troubleshooting: (1) --gpu-memory-utilization azalt (0.9 → 0.85 → 0.80); (2) --max-num-seqs azalt (concurrent request limit); (3) --max-model-len reduce (context window küçült); (4) Quantization aktif et (AWQ-INT4 + FP8 KV cache); (5) --enable-prefix-caching false (cache memory'yi serbest bırakır ama throughput düşer); (6) --swap-space artır (CPU offload); (7) TP arttır (memory distribute et). Aktivation memory + KV cache + model weight + buffer breakdown analizi kritik. Modül 11.3 detaylı OOM diagnosis sunar.

Speculative decoding gerçekten 2-4x throughput artışı sağlıyor mu? Hangi senaryolarda işe yarar?

Senaryoya bağlı: (1) Long generation (reasoning model 4K-32K thinking trace) → 3-5x speedup (en yüksek gain); (2) Code generation (deterministic, predictable token) → 2-3x speedup; (3) Conversational chat (short response) → 1.3-1.8x speedup; (4) High temperature sampling → daha az gain (acceptance rate düşer). EAGLE-3 (UCSB 2024) en yüksek gain'i veriyor (4-5x R1 serving'inde); MEDUSA orta; ngram speculator basit ama coding/repetitive task'larda %30-50 boost. Modül 5.3 her birinin Pareto frontier'ını sunar.

Kubernetes vLLM deployment için minimum hardware + cost ne kadar?

Minimum production senaryoları (2026 cloud pricing): (1) 7B model FP16 — tek L40S (48GB) ~$1.5/saat (RunPod); (2) 13B model AWQ — tek L40S; (3) 70B model AWQ-INT4 + TP 2 — 2x A100 (40GB) ~$3/saat; (4) 70B model FP16 + TP 4 — 4x A100 (80GB) ~$6/saat; (5) DeepSeek V3 671B FP8 + EP 8 + TP 2 — 16x H100 ~$50/saat. Kubernetes overhead: GPU operator + monitoring stack +5%. Scale-to-zero ile idle cost azaltılabilir. Modül 10 + 11 cost optimization detaylı işler.

Reasoning model (R1, o3) serving için vLLM tuning'i nasıl yapmalıyım?

Reasoning model 16K-128K thinking trace üretir, klasik chat'ten farklı. Tuning: (1) --max-model-len 65536 veya daha fazla; (2) --kv-cache-dtype fp8 (long trace KV cache memory dominant); (3) --enable-chunked-prefill true (uzun prompt için); (4) --max-num-batched-tokens artır (8192+); (5) --max-num-seqs azalt (her sequence büyük KV cache kullanır); (6) Speculative decoding aktif et (EAGLE-3 + reasoning model 4-5x speedup); (7) Prefix caching ile system prompt + reasoning instruction reuse. Modül 11 reasoning-specific tuning detaylı işler.

Eğitim sonunda elimde hangi somut artefaktlar olacak?

Capstone projesinde şu artefaktlar üretilir: (1) Kendi senaryonuza özel vLLM serving stack (Python + Kubernetes Helm chart + Docker Compose); (2) Hardware + model + quantization + parallelism decision dokümanı; (3) Custom model integration (eğer applicable) ModelRegistry kayıt code; (4) Prometheus + Grafana vLLM dashboard config; (5) Langfuse + Phoenix observability integration; (6) HPA + KEDA autoscaling YAML; (7) vllm-bench benchmark raporu + Pareto frontier analizi; (8) 90 günlük production deployment + scaling roadmap + cost analysis.

Eğitim kurumsal ekibimize özel uyarlanabilir mi?

Evet. Standart 3 günlük programın yanında, kurumsal müşteriler için özelleştirilmiş kapalı sınıf versiyonlar düzenliyoruz. Ekibinizin mevcut hardware altyapısı (H100 cluster, B200 cluster, AMD MI300X, AWS Trainium 2), target model'ler (Llama / Qwen / DeepSeek / kendi CPT model'iniz), production SLA hedefleri (TTFT, throughput, cost), compliance gereksinimleri (KVKK self-hosted, EU AI Act), ve mevcut inference stack'iniz (TGI / Ray Serve / Triton Inference Server geçmiş) dikkate alınarak modül ağırlıkları + capstone senaryoları özelleştirilir.

Bu eğitim hakkında

Production LLM inference engine standardı vLLM'in iç mimarisini, PagedAttention algoritmasını, continuous batching mekaniği, speculative decoding (EAGLE-3 + MEDUSA), tensor + pipeline + expert parallelism, AWQ + GPTQ + FP8 + FP4 quantization integration, custom model entegrasyonu ve NVIDIA Dynamo disaggregated serving disiplinini uçtan uca işleyen 3 günlük ileri seviye Türkçe eğitim. Kubernetes + Ray Serve + Prometheus + Langfuse production stack dahil.

Bu eğitim şu kitleler için tasarlanmıştır: Kurumsal LLM ürünleri için inference engine deploy eden ML Engineer ve Inference Engineer'lar DeepSeek V3 / Llama 4 / Qwen3 / Gemma 3 production serving yapan ML Platform engineer'ları Reasoning model (o3, R1) long-context serving cost'u optimize etmesi gereken senior backend developer'lar NVIDIA Dynamo + disaggregated serving araştırması yapan inference researcher'lar Kendi CPT model'ini (Türkçe LLM, domain-specific) vLLM'e entegre etmek isteyen ekipler Production GPU cluster (H100 / B200) yönetimi yapan SRE ve Platform Engineer'lar

Bu eğitim neden önemli: Türkiye'de vLLM internals + custom backend disiplinini Türkçe + uçtan uca + production-grade işleyen tek ileri seviye program. PagedAttention (Kwon 2023) + continuous batching (Orca 2022) + speculative decoding (EAGLE-3 + MEDUSA) tam matematiksel inşası. Tensor + Pipeline + Expert Parallelism ile DeepSeek V3 671B MoE production deployment hands-on. AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimizasyonu. ModelRegistry ile custom model + PagedAttention-compatible layer yazımı. NVIDIA Dynamo (Mart 2025) + Mooncake + DistServe disaggregated serving frontier kapsama. Kubernetes + Ray Serve + Prometheus + Langfuse production stack inşa etme. Capstone projesi ile katılımcıya kendi hardware target'ında uygulanabilir vLLM serving stack üretimi sağlar.

Eğitim sonunda kazanacağınız çıktılar: vLLM'in 5 ana bileşenini (LLMEngine, Scheduler, Worker, BlockManager, Sampler) source code seviyesinde anlayabilirsiniz. PagedAttention algoritmasını matematik düzeyinde inşa edebilirsiniz. Continuous batching + chunked prefill + scheduling policy'i production'da kullanabilirsiniz. Speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM entegrasyonu yapabilirsiniz. TP + PP + EP ile 70B-671B model multi-GPU + multi-node serving kurabilirsiniz. AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack'ini ustaca kullanabilirsiniz. ModelRegistry ile custom model + custom layer + PagedAttention-compatible attention yazabilirsiniz. NVIDIA Dynamo + Mooncake disaggregated serving architecture deploy edebilirsiniz. Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack kurabilirsiniz. Tuning parametreleri + vllm-bench + Pareto frontier ile performance optimize edebilirsiniz.

Ön koşullar ve önerilen birikim: Aktif Python deneyimi (orta-üst seviye), PyTorch + CUDA temel kullanımı LLM inference + serving ile en az kavramsal deneyim (vLLM / TGI / TensorRT-LLM) Docker + Kubernetes + Helm chart deneyim (production deployment için) GPU + CUDA temel bilgisi (CUDA kernel yazımı eğitimde değil, kullanım kapsamında) Linear algebra + transformer mimarisi temel bilgisi Eğitim öncesinde RunPod / Lambda Labs / AWS H100 access (capstone için)

Türkiye'de vLLM internals + custom backend disiplinini Türkçe uçtan uca işleyen tek production-grade ileri seviye program
PagedAttention algoritmasının OS paging analogisi ile matematiksel inşası
Continuous batching + chunked prefill + iteration-level scheduling derinleşmesi
Speculative decoding: EAGLE-3 + MEDUSA + draft model + ngram speculator integration
Tensor + Pipeline + Expert Parallelism ile 70B-671B model production deployment
AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimizasyonu
Custom model adding: ModelRegistry + PagedAttention-compatible layer yazımı
NVIDIA Dynamo + Mooncake disaggregated serving (Mart 2025 frontier) implementation

Anahtar Çıkarımlar

vLLM'in 5 ana bileşenini (LLMEngine, Scheduler, Worker, BlockManager, Sampler) source code seviyesinde anlayabilirsiniz.
PagedAttention algoritmasını matematik düzeyinde inşa edebilirsiniz.
Continuous batching + chunked prefill + scheduling policy'i production'da kullanabilirsiniz.
Speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM entegrasyonu yapabilirsiniz.
TP + PP + EP ile 70B-671B model multi-GPU + multi-node serving kurabilirsiniz.
AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack'ini ustaca kullanabilirsiniz.
ModelRegistry ile custom model + custom layer + PagedAttention-compatible attention yazabilirsiniz.
NVIDIA Dynamo + Mooncake disaggregated serving architecture deploy edebilirsiniz.
Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack kurabilirsiniz.
Tuning parametreleri + vllm-bench + Pareto frontier ile performance optimize edebilirsiniz.

İleri Seviye3 Gün

vLLM Internals ve Custom Backend Mühendisliği Eğitimi (PagedAttention + Continuous Batching + Speculative Decoding + NVIDIA Dynamo)

Hemen Kaydol

Eğitim Hakkında

Bu eğitim, 2024-2026 döneminin de facto inference engine standardı haline gelen vLLM'in iç mimarisini, algoritma temellerini ve production deployment disiplinini Türkçe olarak uçtan uca öğretmek üzere tasarlanmıştır. UC Berkeley Sky Computing Lab'in Eylül 2023'te SOSP'te tanıttığı PagedAttention makalesinden başlayan yolculuk, 30K+ GitHub star, LF AI & Data Foundation altında 2025 incubation, vLLM v1 redesign (Mart 2025), NVIDIA Dynamo collaboration (Mart 2025), Neural Magic + Anyscale + Red Hat ekosistemi ile production-grade bir platforma dönüştü. Türkiye'de bu disiplini source code seviyesinden production Kubernetes deployment'a kadar uçtan uca işleyen bir eğitim neredeyse yoktur — mevcut içerikler ya kısa vLLM tutoriallerinde takılı kalıyor ya da OpenAI-compatible server kullanım demo'larında donuyor. Bu program söz konusu boşluğu Türkiye'nin en kapsamlı production-grade vLLM internals referans eğitimi olarak doldurmak üzere tasarlanmıştır.

Programın stratejik omurgasını, vLLM'in doğuş ve yükselişini, neden inference engine standardı haline geldiğini ve 2026 ekosistem manzarasını netleştiren ilk modül oluşturur. Kwon 2023 PagedAttention paper'ının (SOSP) UC Berkeley Sky Lab'den çıkışı; 2024 production yaygınlaşması; LF AI & Data Foundation 2025 incubation; vLLM v1 redesign (Mart 2025 — sync to async architecture geçişi, 1.7x throughput); NVIDIA Dynamo collaboration; Neural Magic acquisition (Red Hat tarafından 2024). Inference engine karşılaştırması: vLLM vs SGLang (CMU + Stanford, radix attention), vs TensorRT-LLM (NVIDIA-only, en hızlı NVIDIA-native), vs Hugging Face TGI (basit + production-ready), vs LMDeploy (Shanghai AI Lab, TurboMind kernel). 2026 inference manzarası: multi-vendor inference (NVIDIA H100/B200, AMD MI300X/MI355X, AWS Trainium 2, Apple Silicon), reasoning model + agent + long context serving'in unique gereklilikleri, open-source vs commercial inference (Anyscale, Together AI, Fireworks).

İkinci modül vLLM'in iç mimari bileşenlerini source code seviyesinde işler. Beş ana bileşen: LLMEngine (central coordinator, request lifecycle yönetir), Scheduler (request queueing + scheduling policy, running/waiting/swapped queue'lar), Worker (GPU model executor + ModelRunner), BlockManager (PagedAttention block allocation + physical/logical block table), Sampler (greedy / top-p / temperature / typical-p). v0 → v1 redesign (Mart 2025): synchronous architecture'dan decoupled async'e geçiş; separate scheduler + worker processes; 1.7x throughput improvement + breaking changes. Python entry points: vllm.LLM (sync API, offline batch inference), vllm.AsyncLLMEngine (async server use), OpenAI-compatible HTTP server (vllm serve). Bu mimari understanding olmadan custom backend yazılamaz.

Üçüncü modül vLLM'in temel innovation'ı olan PagedAttention'ı matematik düzeyinde inşa eder. Klasik attention'da KV cache her sequence için contiguous memory allocation gerektirir; variable sequence length nedeniyle internal fragmentation (allocated but unused slots) ve external fragmentation (memory holes) toplam %60-80 memory waste'e yol açıyordu. PagedAttention OS virtual memory paging mantığını LLM serving'e uyarlar: KV cache fixed-size block'lara bölünür (default 16 tokens/block), her sequence kendi logical block table'ını tutar, physical GPU memory free block pool'dan dinamik allocate edilir. Sonuç: memory fragmentation %4'e düşer, throughput 2-4x artar. Advanced features: prefix caching (shared block reference counting — same system prompt'lu request'ler aynı KV blocks'u paylaşır), beam search + parallel sampling memory sharing, block swap (GPU → CPU offloading).

Dördüncü modül klasik static batching'in (request grup gelene kadar bekleme + en uzun sequence'a kadar idle GPU) yerine 2022 Orca paper (OSDI) ile tanıtılan ve vLLM'de production-grade hale gelen continuous batching disiplinini ele alır. Iteration-level scheduling: her token generation step'inde batch composition güncellenir; tamamlanan request batch'ten çıkar, bekleyen request batch'e girer. Static batching'de GPU utilization %30-50, continuous batching'de %90+. Chunked prefill: prefill (compute-bound, matmul-heavy) ve decode (memory-bound, KV cache read) ayrı pattern'lar; long prompt chunked prefill ile decode request'lere paralel çalıştırılır. Mixed prefill-decode batching ile GPU optimal use. Scheduling policy: FCFS + priority + fairness; preemption (KV cache swap CPU'ya veya recompute).

Beşinci modül speculative decoding'in vLLM entegrasyonunu detaylı işler. Speculative decoding (Leviathan 2023) iki-stage decoding: küçük draft model birden fazla token tahmin eder, büyük target model parallel verify eder; acceptance rate × speedup formülü. Modern varyantlar: EAGLE-3 (UCSB 2024, feature-level draft + tree attention, 4-5x speedup), MEDUSA (Princeton 2024, multi-head self-speculation, draft model gerektirmez), Lookahead decoding (Fu 2024, ngram + n-token prediction). vLLM'de --speculative-config CLI parameter + ngram_speculator + draft model orchestration; tree-based vs sequential spec decoding seçimi. Reasoning model (o3, DeepSeek R1) için 16K-128K thinking trace serving'inde spec decoding %40-60 cost reduction sağlıyor — production'da kritik avantaj.

Altıncı modül büyük modellerin (70B+) tek GPU'ya sığmadığı senaryolarda multi-GPU + multi-node distribute serving'i ele alır. Tensor Parallelism (TP, Megatron-LM 2019 mantığı): intra-layer attention + MLP split, all-reduce communication; --tensor-parallel-size 8 single-node 8x H100 için ideal. Pipeline Parallelism (PP, GPipe + 1F1B): inter-layer split, micro-batch pipeline, micro-batch size tuning kritik; --pipeline-parallel-size 2 multi-node için. Expert Parallelism (EP): MoE DeepSeek V3 671B (37B active) için expert dağıtımı. Data Parallelism (DP): replication, scale-out. vLLM + Ray Serve ile multi-node orchestration; NCCL all-reduce + ring all-reduce; NVLink 5 (B200) + NVSwitch + InfiniBand 400Gb topology; Blackwell GB200 NVL72 rack architecture (72 GPU coherent fabric). DeepSeek V3 671B (FP8) deployment için 16x H100 minimum.

Yedinci modül vLLM'in quantization desteğini detaylı işler. AWQ (Lin 2023 + Marlin kernel Neural Magic 2024 optimization): 4-bit weight serving 2-3x throughput; --quantization awq_marlin parametresi. GPTQ (Frantar 2022 + GPTQModel kernel): act-order desc_act + group_size 128. FP8 (Hopper E4M3 native Tensor Core + Blackwell NVFP4): hardware-native düşük precision; --quantization fp8 + fp8_e4m3. INT8 W8A8 (SmoothQuant outlier migration). KV cache quantization: --kv-cache-dtype fp8 reasoning model long-trace için kritik; KIVI 2-bit experimental support (vLLM 0.7+). Quality + throughput + memory üçgeni Pareto frontier: Llama 3.3 70B FP16 (140GB) vs AWQ-INT4 (35GB) vs FP8 (70GB) somut benchmark sayıları. Production karar matrisi: quality regression budget + cost target.

Sekizinci modül vLLM'in built-in olarak desteklemediği yeni model mimarisini entegre etmenin disiplinini detaylı işler. vllm.ModelRegistry API + register_model() decorator; CausalLM interface implementation (forward() + sample() + get_input_embeddings()); vLLM source code (vllm/model_executor/models/) structure ve mevcut model implementation'lar (Llama, Qwen, Mistral, Gemma, DeepSeek, MoE varyantları). Weight loading: Hugging Face safetensors → vLLM weight tensor mapping; sharded weight loading (multi-file safetensors); quantized weight (AWQ / GPTQ / FP8) loading. Custom layer: PagedAttention-compatible attention layer yazımı; rotary embedding (RoPE + YaRN scaling) + GroupedQueryAttention (GQA) + Multi-head Latent Attention (MLA, DeepSeek V3); MoE expert routing + top-k expert selection layer. Türkçe CPT edilmiş custom Llama 4 / Qwen3 model'ini vLLM'e ekleme pratik gösterilir.

Dokuzuncu modül 2024-2026 inference engineering'in en sıcak frontier'ı olan disaggregated serving disiplinini ele alır. Prefill (compute-bound, matmul-heavy, B200/H200 ideal) ve decode (memory-bound, KV cache read, H100 / consumer GPU yeterli) ayrı GPU pool'larında çalıştırılır — bu sayede heterogen hardware (B200 prefill + H100 decode) optimal kullanım. KV cache transfer (prefill node'tan decode node'a): GPU-to-GPU NCCL + RDMA + GPUDirect Storage; InfiniBand 400Gb + NVLink Switch System topology. NVIDIA Dynamo (Mart 2025 release) production-grade disaggregated inference platform; vLLM + SGLang + TensorRT-LLM backend support; smart router. Academic precursor'lar: Mooncake (Moonshot AI 2024, KV cache pool ile birlikte), DistServe (UCSD 2024). Latency overhead vs throughput gain trade-off — typical scenario'da 2-4x throughput improvement + %30-50 cost reduction.

Onuncu modül vLLM'i production'a alma disiplinini uçtan uca işler. Kubernetes deployment: vLLM Helm chart + NVIDIA GPU operator + nvidia-device-plugin + nvidia-container-toolkit; Deployment + Service + Ingress + HPA YAML; PVC ile model weight cache (ReadWriteMany NFS / S3 mount). NVIDIA Dynamo platform deployment. Monitoring: Prometheus metrics endpoint (vllm:request_latency_seconds histogram, vllm:gpu_cache_usage_perc gauge, vllm:request_prompt_tokens, vllm:request_generation_tokens, vllm:e2e_request_latency_seconds); Grafana vLLM dashboard; Langfuse + Phoenix LLM observability integration + OpenTelemetry GenAI semantic conventions. Autoscaling: HPA (custom metric: GPU utilization + queue depth), KEDA event-driven scaling, Karpenter scale-to-zero + GPU spot instance. Load balancing: round-robin (default), prefix-aware routing (SGLang inspired — same prefix request'ler aynı replica'ya), session-sticky routing.

On birinci modül production vLLM deployment'ının performans cephesini detaylı ele alır. Tuning parametreleri detayı: --max-num-seqs (concurrent request limit, GPU memory'e göre, typical 256-1024), --max-num-batched-tokens (per-step total token budget, 8K-32K), --gpu-memory-utilization (default 0.9, safety margin için 0.85), --enable-prefix-caching (system prompt cache hit %30-70), --enable-chunked-prefill (long prompt + streaming friendly), --num-scheduler-steps (multi-step scheduling, batch decision overhead azaltır). Benchmark: vllm-bench serving (online) + offline benchmark tool'ları; ShareGPT + Anthropic + Alpaca trace dataset'leri ile realistic workload simulation. TTFT vs throughput vs cost Pareto frontier analizi production karar verme için. Cost optimization: spot instance + scale-to-zero + multi-region failover; reasoning model serving specific tuning (long context kvcache). Hata teşhisi: OOM (KV cache + activation memory diagnosis), request stalls (queue depth analysis), slow tokenization (HF tokenizer profiling).

Capstone modülünde her katılımcı, kendi production senaryosuna özel uçtan uca bir vLLM serving stack tasarlar: hedef model (Llama 3.3 70B Instruct / Qwen3 32B / Gemma 3 27B / DeepSeek V3 671B MoE / kendi CPT edilmiş custom model), hardware target (single H100 80GB / dual H200 / 8x H100 / 16x B200 cluster / heterogen disaggregated B200 prefill + H100 decode), quantization stack seçimi (AWQ INT4 + FP8 KV cache veya FP8 weight + FP8 KV cache veya FP4 NVFP4), parallelism strategy (TP 8 single-node veya TP 8 × PP 2 multi-node veya TP 8 + EP for MoE), serving topology (single-node monolithic veya multi-node Ray Serve veya disaggregated NVIDIA Dynamo), Kubernetes Helm chart + NVIDIA GPU operator + autoscaling (HPA + KEDA), observability stack (Prometheus + Grafana + Langfuse + Phoenix), benchmark + Pareto frontier + cost analizi, 90 günlük production deployment + scaling roadmap. Eğitim sonunda katılımcılar; vLLM'in 5 ana bileşenini source code seviyesinde anlayabilecek; PagedAttention algoritmasını OS paging analogisiyle inşa edebilecek; continuous batching + chunked prefill + speculative decoding + tensor/pipeline parallelism'i production'da uygulayabilecek; AWQ + GPTQ + FP8 + FP4 quantization stack'ini Marlin + Machete kernel ile entegre edebilecek; ModelRegistry ile custom model + custom layer ekleyebilecek; NVIDIA Dynamo + Mooncake disaggregated serving mimarisi kurabilecek; Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack'i deploy edebilecek; ve tuning parametreleri + benchmark + cost analizi ile production performance optimize edebilecek seviyede teknik yetkinliğe ulaşır. Eğitim 3 gün, 12 modül ve 100'ün üzerinde uygulamalı ders içerir.

Eğitim Metodolojisi

Türkiye'de vLLM internals + custom backend disiplinini Türkçe uçtan uca işleyen tek production-grade ileri seviye program

PagedAttention algoritmasının OS paging analogisi ile matematiksel inşası

Continuous batching + chunked prefill + iteration-level scheduling derinleşmesi

Speculative decoding: EAGLE-3 + MEDUSA + draft model + ngram speculator integration

Tensor + Pipeline + Expert Parallelism ile 70B-671B model production deployment

AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimizasyonu

Custom model adding: ModelRegistry + PagedAttention-compatible layer yazımı

NVIDIA Dynamo + Mooncake disaggregated serving (Mart 2025 frontier) implementation

Kimler İçindir?

Kurumsal LLM ürünleri için inference engine deploy eden ML Engineer ve Inference Engineer'lar

DeepSeek V3 / Llama 4 / Qwen3 / Gemma 3 production serving yapan ML Platform engineer'ları

Reasoning model (o3, R1) long-context serving cost'u optimize etmesi gereken senior backend developer'lar

NVIDIA Dynamo + disaggregated serving araştırması yapan inference researcher'lar

Kendi CPT model'ini (Türkçe LLM, domain-specific) vLLM'e entegre etmek isteyen ekipler

Production GPU cluster (H100 / B200) yönetimi yapan SRE ve Platform Engineer'lar

Neden Bu Eğitim?

Türkiye'de vLLM internals + custom backend disiplinini Türkçe + uçtan uca + production-grade işleyen tek ileri seviye program.

PagedAttention (Kwon 2023) + continuous batching (Orca 2022) + speculative decoding (EAGLE-3 + MEDUSA) tam matematiksel inşası.

Tensor + Pipeline + Expert Parallelism ile DeepSeek V3 671B MoE production deployment hands-on.

AWQ + GPTQ + FP8 + FP4 quantization integration + Marlin/Machete kernel optimizasyonu.

ModelRegistry ile custom model + PagedAttention-compatible layer yazımı.

NVIDIA Dynamo (Mart 2025) + Mooncake + DistServe disaggregated serving frontier kapsama.

Kubernetes + Ray Serve + Prometheus + Langfuse production stack inşa etme.

Capstone projesi ile katılımcıya kendi hardware target'ında uygulanabilir vLLM serving stack üretimi sağlar.

Kazanımlar

vLLM'in 5 ana bileşenini (LLMEngine, Scheduler, Worker, BlockManager, Sampler) source code seviyesinde anlayabilirsiniz.

PagedAttention algoritmasını matematik düzeyinde inşa edebilirsiniz.

Continuous batching + chunked prefill + scheduling policy'i production'da kullanabilirsiniz.

Speculative decoding (EAGLE-3, MEDUSA, ngram) vLLM entegrasyonu yapabilirsiniz.

TP + PP + EP ile 70B-671B model multi-GPU + multi-node serving kurabilirsiniz.

AWQ + GPTQ + FP8 + FP4 + KV cache quantization stack'ini ustaca kullanabilirsiniz.

ModelRegistry ile custom model + custom layer + PagedAttention-compatible attention yazabilirsiniz.

NVIDIA Dynamo + Mooncake disaggregated serving architecture deploy edebilirsiniz.

Kubernetes + Helm + Ray Serve + Prometheus + Langfuse production stack kurabilirsiniz.

Tuning parametreleri + vllm-bench + Pareto frontier ile performance optimize edebilirsiniz.

Gereksinimler

Aktif Python deneyimi (orta-üst seviye), PyTorch + CUDA temel kullanımı

LLM inference + serving ile en az kavramsal deneyim (vLLM / TGI / TensorRT-LLM)

Docker + Kubernetes + Helm chart deneyim (production deployment için)

GPU + CUDA temel bilgisi (CUDA kernel yazımı eğitimde değil, kullanım kapsamında)

Linear algebra + transformer mimarisi temel bilgisi

Eğitim öncesinde RunPod / Lambda Labs / AWS H100 access (capstone için)

Eğitim Müfredatı

104 Ders

Modül 1: vLLM Çağına Stratejik Giriş — 2023'ten 2026'ya Inference Engine Yarışı9 Ders

Modül 2: vLLM Mimari Anatomisi — LLMEngine, Scheduler, Worker ve BlockManager9 Ders

Modül 3: PagedAttention Derinlemesine — Kwon 2023 Algoritması9 Ders

Modül 4: Continuous Batching — Orca'dan vLLM'e Iteration-Level Scheduling9 Ders

Modül 5: Speculative Decoding ile vLLM — EAGLE-3, MEDUSA ve Draft Model9 Ders

Modül 6: Tensor Parallelism, Pipeline Parallelism ve Multi-GPU/Multi-Node Serving9 Ders

Modül 7: Quantization Integration — AWQ, GPTQ, FP8 KV Cache ve Marlin Kernel9 Ders

Modül 8: Custom Model Adding — Yeni Mimari Entegrasyonu9 Ders

Modül 9: Disaggregated Serving — Prefill/Decode Separation ve NVIDIA Dynamo9 Ders

Modül 10: Production Deployment — Kubernetes, NVIDIA Dynamo ve Monitoring9 Ders

Modül 11: Performance Tuning — Throughput, Latency ve Cost Optimization9 Ders

Modül 12: Capstone — Production vLLM Serving Stack İnşası5 Ders

Eğitmen

Şükrü Yusuf KAYA

Yapay Zeka Mimarı | Kurumsal AI & LLM Eğitimleri | Stanford University | Yazılım & Teknoloji Danışmanı

Şükrü Yusuf KAYA, yapay zekâ teknolojilerinin küresel iş dünyasına entegrasyonuna öncülük eden, uluslararası deneyime sahip bir Yapay Zekâ Danışmanı ve Teknoloji Stratejistidir. 6 farklı ülkede faaliyet gösteren KAYA, teknolojinin teorik sınırları ile pratik iş ihtiyaçları arasındaki boşluğu doldurarak, bankacılık, e-ticaret, perakende ve lojistik gibi veri açısından kritik sektörlerde uçtan uca yapay zekâ projelerini yönetmektedir. Özellikle Üretken Yapay Zekâ ve Büyük Dil Modelleri (LLM) alanındaki teknik uzmanlığını derinleştiren KAYA, kuruluşların kısa vadeli çözümlere güvenmek yerine geleceği şekillendiren mimariler oluşturmasını sağlamaktadır. Karmaşık algoritmaları ve gelişmiş sistemleri, kurumsal büyüme hedefleriyle uyumlu somut iş değerine dönüştürmeye yönelik vizyoner yaklaşımı, onu sektörde aranan bir çözüm ortağı haline getirmiştir. Danışmanlık ve proje yönetimi kariyerinin yanı sıra eğitmenlik rolüyle de öne çıkan Şükrü Yusuf KAYA, "Yapay Zekâyı herkes için erişilebilir ve uygulanabilir hale getirmek" mottosuyla hareket etmektedir. Teknik ekiplerden üst düzey yöneticilere kadar geniş bir yelpazedeki profesyoneller için tasarlanmış kapsamlı eğitim programları aracılığıyla, kuruluşların yapay zeka okuryazarlığını artırmaya ve sürdürülebilir bir teknolojik dönüşüm kültürü oluşturmaya öncelik veriyor.

Sıkça Sorulan Sorular

Eğitime Başvur

Sınırlı kontenjan ile butik eğitim.

Gelecek Gruplara Kayıt

Sıradaki grup açıldığında öncelikli bilgi almak için kaydınızı bırakın.

Canlı & İnteraktif Oturumlar

Proje Bazlı Öğrenme

Sektör Odaklı Müfredat

Profesyonel Networking

Birebir Danışmanlık

Eğitmen ile özel görüşme planlayın.

Kaydol

Kategoriler

AI Mühendisliği

Bu eğitim hakkında

Anahtar Çıkarımlar

vLLM Internals ve Custom Backend Mühendisliği Eğitimi (PagedAttention + Continuous Batching + Speculative Decoding + NVIDIA Dynamo)