About this training
A 3-day advanced Turkish training that covers end to end the discipline of reducing LLMs to 4-bit / 8-bit / FP8 / FP4 — shrinking model size 4-16x + reducing inference latency 2-4x. Includes GPTQ, AWQ, SmoothQuant, EXL2, GGUF/IQ-quants, NF4 BitsAndBytes, FP8/FP4 (Hopper/Blackwell), AQLM extreme 2-bit, KIVI/KVQuant KV-cache quantization, QAT, and vLLM/TensorRT-LLM/llama.cpp/SGLang production serving.
This training is designed for: ML Engineers and Inference Engineers who want to reduce production LLM inference cost 3-10x Senior backend developers who want to optimize reasoning-model (o3, R1, Claude Extended Thinking) serving cost via KV-cache quantization ML Platform engineers who want to fit open-source LLMs (Llama 3.3, Qwen3, DeepSeek V3, Gemma 3) onto a single H100 or RTX 4090 AI Engineers who need to deploy to Apple Silicon / CPU / Edge / Mobile MLOps engineers who want to learn the Hopper FP8 and Blackwell FP4 native-hardware-optimization discipline AI Researchers active in quantization research (following AQLM, QuIP#, BitNet, KIVI)
Why this course matters: The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production. Covers GPTQ, AWQ, SmoothQuant, EXL2, GGUF, AQLM, BitNet, QuIP#, HQQ, KIVI comparatively + hands-on. Ties Hopper FP8 + Blackwell FP4 native hardware advantage to the 2026 datacenter standards. Teaches the KV-cache quantization discipline for reasoning-model long-trace serving end to end. Provides quality-recovery recipes with QAT in extreme low-bit scenarios where PTQ is insufficient. Masters production deployment on the vLLM + TensorRT-LLM + llama.cpp + SGLang serving stacks. Through the capstone project, equips the participant with a quantization pipeline + cost analysis applicable on their own hardware target. Together with RLHF + Reasoning Models + Mech Interp + CPT + Quantization, completes a five-training frontier set covering the full arsenal of production LLM engineering.
Learning outcomes by the end of the programme: Select the right bit-width across the FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit spectrum. Implement the GPTQ Hessian-approximation and AWQ scaling-factor mechanisms. Build W8A8 production serving with SmoothQuant outlier migration. Deploy to Apple Silicon / Edge / CPU with GGUF + llama.cpp. Skillfully use Hopper FP8 and Blackwell FP4 native Tensor Cores. Serve a 70B model on an RTX 4090 (24GB) with AQLM. Reduce reasoning-model serving cost by 50-70% with KIVI 2-bit KV cache. Recover quality with QAT in extreme low-bit scenarios where PTQ is insufficient. Skillfully manage the vLLM, TensorRT-LLM, llama.cpp, and SGLang quantized-serving stacks. Design a quantization pipeline that reduces production cost 3-10x and latency 2-4x.
Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers LLM inference experience (at least conceptual familiarity with vLLM, llama.cpp, TGI, or similar) Foundations in linear algebra, numerical methods (matrix operations, Cholesky) Basic knowledge of transformer architecture (attention, MLP, residual stream) GPU access (RunPod, Lambda Labs, Modal) — H100 (80GB) recommended for the capstone; participation possible with RTX 4090 / Apple Silicon A Hugging Face account + an LLM-provider (OpenAI/Anthropic/Google) API key before the training
- The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production stack
- Mathematical construction of GPTQ (Frantar 2022) Hessian approximation + AWQ (Lin 2023) scaling factor
- W8A8 production serving with SmoothQuant + ZeroQuant + LLM.int8 outlier handling
- Edge / CPU / Apple Silicon deployment with GGUF + llama.cpp K-quants + IQ-quants
- Hopper FP8 (E4M3/E5M2) + Blackwell NVFP4/MXFP4 native hardware advantage
- AQLM + QuIP# + BitNet b1.58 + HQQ extreme low-bit (1-2 bit) 2024-2026 frontier
- KV-cache quantization for reasoning-model long-trace serving with KIVI + KVQuant
- QAT pipeline + Gemma 3 QAT (Google 2025) recipe + PTQ + QAT hybrid approach
Key Takeaways
- Select the right bit-width across the FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit spectrum.
- Implement the GPTQ Hessian-approximation and AWQ scaling-factor mechanisms.
- Build W8A8 production serving with SmoothQuant outlier migration.
- Deploy to Apple Silicon / Edge / CPU with GGUF + llama.cpp.
- Skillfully use Hopper FP8 and Blackwell FP4 native Tensor Cores.
- Serve a 70B model on an RTX 4090 (24GB) with AQLM.
- Reduce reasoning-model serving cost by 50-70% with KIVI 2-bit KV cache.
- Recover quality with QAT in extreme low-bit scenarios where PTQ is insufficient.
- Skillfully manage the vLLM, TensorRT-LLM, llama.cpp, and SGLang quantized-serving stacks.
- Design a quantization pipeline that reduces production cost 3-10x and latency 2-4x.
Advanced LLM Quantization Engineering Training (GPTQ + AWQ + EXL2 + GGUF + FP8 + FP4)
A 3-day advanced Turkish training that covers end to end the discipline of reducing LLMs to 4-bit / 8-bit / FP8 / FP4 — shrinking model size 4-16x + reducing inference latency 2-4x. Includes GPTQ, AWQ, SmoothQuant, EXL2, GGUF/IQ-quants, NF4 BitsAndBytes, FP8/FP4 (Hopper/Blackwell), AQLM extreme 2-bit, KIVI/KVQuant KV-cache quantization, QAT, and vLLM/TensorRT-LLM/llama.cpp/SGLang production serving.
About This Course
This training is designed to address end to end — with math + algorithms + production stack — the quantization discipline that forms the economic foundation of modern LLM inference. As of 2026, serving a 70B-parameter LLM in FP16 won't fit even on a single H100 (140GB > 80GB); in contrast, with 4-bit quantization the same model can run on a single RTX 4090 (24GB) at 10x lower cost. This dramatic difference has made quantization one of the priorities of production AI engineering. In Turkey, a training that addresses this discipline end to end — from Frantar's GPTQ derivation to the mathematical construction of Lin's AWQ scaling factor, from the SmoothQuant outlier-migration formulation to AQLM additive codebooks, from Hopper FP8 Tensor Cores to Blackwell B200 NVFP4 / MXFP4, from KIVI 2-bit KV cache to reasoning-model long-trace serving — is virtually nonexistent; existing content either stays at shallow tool tutorials or freezes at academic-paper summaries. This program is designed to fill that gap as Turkey's most comprehensive production-grade LLM quantization reference training.
The program's strategic backbone is the first module, which clarifies the cost-quality-throughput trade-off across the quantization spectrum (FP32 → BF16/FP16 → FP8 → INT8 → NF4/INT4 → FP4 → AQLM 1-2 bit). A 70B model's memory footprint is 140GB in FP16, 70GB in INT8, 35GB in INT4/NF4, 17.5GB in NVFP4, and 4GB in AQLM 2-bit; this difference produces not only memory but also a 2-8x throughput gain. Hopper H100/H200's FP8 (E4M3 + E5M2) native Tensor Cores and Blackwell B200/GB200's NVFP4 + MXFP4 Transformer Engine v2 support form the hardware foundation of the 2024-2026 industry transformation; AMD MI325X/MI355X FP8/FP4, Intel Gaudi 3, Google TPU v6/v7 (Trillium) joined this race as well. Decision framework: for production cost optimization, the $0.30/M output token vs $3/M comparison, the quality regression budget (is 0.5% MMLU loss tolerable?), and which bit-width is the right choice for which scenario — evidence-based answers.
The second module addresses the mathematical foundations of quantization. The linear-quantization formula q = round((x - z) / s); dequantization x' = s × q + z; the symmetric (zero-point = 0) vs asymmetric (zero-point ≠ 0) trade-off; min-max calibration vs percentile clipping (P99.9); the granularity-selection matrix per-tensor (coarsest) → per-channel → per-group (g=128, finest); outlier handling (SmoothQuant migration, MX-format logic). On the format side, NF4 (NormalFloat 4-bit, Dettmers 2023 — information-theoretic optimal 4-bit distribution, assuming weights are Gaussian-distributed), FP8 E4M3 (for the forward pass, more precise) vs E5M2 (for backward gradients, wider range), MXFP4 (OCP Microscaling 4-bit), and NVFP4 (Blackwell-native, NVIDIA's OCP variant) format distinctions are clarified. Without this foundation, modern quantization algorithms (GPTQ, AWQ, SmoothQuant) cannot be understood.
The third module addresses PTQ (post-training quantization) — the dominant production approach. Calibration-dataset selection (typically 128-512 samples are sufficient — C4, Wikitext, Pile, FineWeb sample; the Turkish FineWeb subset for Turkish-domain calibration), forward-pass tracking, activation-statistics collection, outlier detection — the emergent magnitude-outlier channel phenomenon discovered in 6.7B+ models in Dettmers 2022 LLM.int8 paper is analyzed in detail. These outlier channels (0.1-1% of all channels) carry the dominant share of model quality; the mixed-precision decision matrix (preserving outlier channels in FP16 + the rest in INT8) is the foundation of this discipline. Naive round-to-nearest quantization yields 5-15% MMLU loss at 4-bit; with modern GPTQ/AWQ it falls to 0.3-1% — this difference is emphasized. Tool stack: AutoGPTQ, AutoAWQ, llama.cpp, Hugging Face Optimum, NVIDIA Model Optimizer.
The fourth module mathematically builds GPTQ (Frantar 2022, ICLR 2023) — the first widespread modern LLM PTQ algorithm. The history of Optimal Brain Quantization (Hassibi 1993), the Hessian matrix approximation (H ≈ 2 X^T X), layer-by-layer one-shot quantization, error compensation (distributing each quantized weight's error to remaining weights), inverse-Hessian computation via Cholesky decomposition, block-wise quantization and group size (g=128, g=64), and the effect of the act-order (desc_act) parameter — every stage is mathematically derived. On the production side, the 4-bit GPTQ pipeline for Llama 3.3 70B, Qwen3 32B, and DeepSeek V3 671B (MoE) models is done hands-on with AutoGPTQ + GPTQModel; a 2-3x boost in GPTQ inference speed with ExLlamaV2 kernels; vLLM + Marlin kernel + GPTQ serving integration is covered in detail.
The fifth module analyzes in detail MIT Han Lab's Lin 2023 (NeurIPS 2023) AWQ algorithm. AWQ's key insight: 1% of salient weights carry the dominant share of the entire model's quality and are determined by activation magnitude. With a per-channel scaling factor, salient channels are scaled up → quantized → scaled down; this mechanism minimizes quantization error over the salient channels. The optimal scale α value is determined via grid search (128-256 sample calibration dataset is sufficient). Comparison with GPTQ: AWQ is simpler (no Hessian compute), faster (Llama 3.3 70B in 10-30 minutes), with similar or better quality (especially in reasoning and instruction following). Production: the 4-bit AWQ pipeline for Llama 3.3, Qwen3, DeepSeek V3 is done hands-on with the AutoAWQ + vLLM + Marlin + TensorRT-LLM stack.
The sixth module addresses the W8A8 discipline of quantizing not only weights but activations too. SmoothQuant (Xiao 2022) — migrating outliers from activations to weights via the identity Y = (X · diag(s)^-1) · (diag(s) · W) to ease activation quantization; migration-strength tuning with the α parameter (0.5-0.85). ZeroQuant (Yao 2022) — token-wise dynamic quantization. LLM.int8 — 8-bit + outlier-handling hybrid approach. W8A8 serving yields a 2-4x throughput increase over FP16 (especially critical not at batch size 1 but in high concurrency). Production: W8A8 serving with vLLM + LLM Compressor (SmoothQuant); TensorRT-LLM INT8 serving (H100 Tensor Cores); W4A8 mixed precision (weight 4-bit + activation 8-bit hybrid) is covered in detail.
The seventh module covers the GGUF format and the K-quants + IQ-quants family of the llama.cpp ecosystem — especially critical for Apple Silicon and CPU deployment. When Georgi Gerganov open-sourced llama.cpp in 2023, the ecosystem reached 70K+ GitHub stars and became the de facto standard for edge LLM serving in 2026. The GGUF format structure (header + metadata + tensor data), the K-quants family (Q4_K_M most popular quality/size balance, Q5_K_M quality-priority, Q6_K, Q8_0 max quality), the mixed-precision super-block + sub-block structure; IQ-quants extreme low-bit (IQ1_S 1.6-bit, IQ2_XXS, IQ3_S — codebook + importance-matrix based); smart bit allocation with imatrix. The recipe for fitting a 70B model into 24GB VRAM (RTX 4090) or 36GB RAM (Apple Silicon M3 Max) is shown practically. Mobile (LiteRT, MediaPipe) GGUF deployment, AMD Ryzen AI 9 NPU, and Intel Xeon AMX optimization are also covered.
The eighth module covers in detail modern GPU architectures' native low-precision floating-point support. Hopper H100/H200 FP8 (E4M3 forward, E5M2 backward) native Tensor Cores; Blackwell B200/GB200 NVFP4 (block scale + sub-block scale) + MXFP4 (OCP Microscaling) Transformer Engine v2 — forming NVIDIA's 2026 datacenter standards. DeepSeek V3's FP8 training recipe over 14.8 trillion tokens (scale-factor management, loss scaling, 30-40% cost saving vs BF16) is analyzed. 3-5x FP4 inference throughput increase on Blackwell B200/GB200; the FP4 model-export pipeline with TensorRT Model Optimizer; Hugging Face Optimum + NVIDIA TransformerEngine integration are shown practically. AMD MI325X/MI355X FP8/FP4, Intel Gaudi 3, Google TPU v6/v7 quantization comparison is made.
The ninth module is dedicated to the frontier extreme-quantization discipline of 2024-2026. AQLM (Egiazarian 2024 — Additive Quantization for Language Models, codebook + vector-quantization-based 2-bit; AQLM 2-bit accuracy surpasses GPTQ 4-bit); QuIP# (Tseng 2024 — Quantization with Incoherence Processing, E8 lattice + incoherence rotation); BitNet b1.58 (Microsoft 2024 — {-1, 0, 1} ternary native pre-training, not post-training); HQQ (Badri 2024 — Half-Quadratic Quantization, a calibration-free fast PTQ alternative). The recipe for shrinking a 70B model to 13GB to serve on an RTX 4090 (24GB) is shown practically. PTQ vs native low-bit pre-training (the BitNet approach) is addressed as the 2026 paradigm shift.
The tenth module focuses on a critical topic for the long thinking traces of modern reasoning models (o3/o4, DeepSeek R1, Claude Extended Thinking, Qwen3) — KV-cache quantization. The KV-cache size formula: 2 × layer × heads × dim × ctx × dtype; a 70B model + 32K context = 32GB KV cache (in FP16). The 16K-128K thinking trace of reasoning models explodes this memory. vLLM FP8 KV cache (2x memory savings + minimum quality loss), TensorRT-LLM FP8 KV cache serving, KIVI (Liu 2024 — 2-bit KV cache + per-channel/per-token scaling), KVQuant (Hooper 2024 — outlier-aware non-uniform quantization), CacheGen, and combinations of prefix cache + KV quantization (reasoning-trace reuse) are covered in detail. This discipline can reduce reasoning-model serving cost by 50-70%.
The eleventh module addresses QAT (Quantization-Aware Training) — which steps in for scenarios where PTQ is insufficient (extreme low-bit, severe quality regression). Fake quantization (quantize-dequantize in forward), STE (Straight-Through Estimator) backward gradient, learnable scale + zero-point (LSQ, Esser 2020), QLoRA-aware fine-tuning (4-bit base + LoRA + QAT), the Gemma 3 QAT (Google 2025) production recipe — which produced INT4 models with quality close to BF16 (1.5% MMLU loss). The mixed PTQ + QAT hybrid recipe (start with PTQ, recover loss with QAT) is shown practically. The end-to-end QAT pipeline with Hugging Face Optimum + NVIDIA Model Optimizer is covered.
In the capstone module, each participant designs an end-to-end quantization pipeline tailored to their own production scenario: model selection (Llama 3.3 70B, Qwen3 32B, DeepSeek V3, Gemma 3, Mistral, their own CPT model), hardware target (RTX 4090 24GB, H100 80GB, B200 192GB, Apple Silicon, AMD MI325X, Intel Xeon CPU), bit-width strategy (4-bit weight + 8-bit activation + FP8 KV cache; or AQLM 2-bit + FP8 KV; or GGUF Q4_K_M + Apple Silicon), algorithm selection (GPTQ vs AWQ vs SmoothQuant vs AQLM evidence-based), serving stack (vLLM, TensorRT-LLM, llama.cpp, SGLang), accuracy-validation framework (MMLU + HumanEval + Turkish MMLU + domain-benchmark regression), cost analysis (hourly GPU cost + token throughput + $/M token), 90-day production deployment roadmap. By the end of the training, participants reach a level of technical competence to dissect the quantization spectrum (FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit) in terms of compute economics; implement GPTQ Hessian-approximation and AWQ scaling-factor mechanisms; apply modern techniques like SmoothQuant outlier migration + KIVI 2-bit KV cache; leverage Hopper FP8 + Blackwell FP4 native hardware advantages; perform edge / CPU / Apple Silicon deployment with GGUF + llama.cpp; evaluate extreme low-bit approaches like AQLM + QuIP# + BitNet; recover loss with QAT in scenarios where PTQ is insufficient; and perform quantized production serving on vLLM / TensorRT-LLM / llama.cpp / SGLang stacks. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.
Training Methodology
The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production stack
Mathematical construction of GPTQ (Frantar 2022) Hessian approximation + AWQ (Lin 2023) scaling factor
W8A8 production serving with SmoothQuant + ZeroQuant + LLM.int8 outlier handling
Edge / CPU / Apple Silicon deployment with GGUF + llama.cpp K-quants + IQ-quants
Hopper FP8 (E4M3/E5M2) + Blackwell NVFP4/MXFP4 native hardware advantage
AQLM + QuIP# + BitNet b1.58 + HQQ extreme low-bit (1-2 bit) 2024-2026 frontier
KV-cache quantization for reasoning-model long-trace serving with KIVI + KVQuant
QAT pipeline + Gemma 3 QAT (Google 2025) recipe + PTQ + QAT hybrid approach
Who Is This For?
Why This Course?
The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production.
Covers GPTQ, AWQ, SmoothQuant, EXL2, GGUF, AQLM, BitNet, QuIP#, HQQ, KIVI comparatively + hands-on.
Ties Hopper FP8 + Blackwell FP4 native hardware advantage to the 2026 datacenter standards.
Teaches the KV-cache quantization discipline for reasoning-model long-trace serving end to end.
Provides quality-recovery recipes with QAT in extreme low-bit scenarios where PTQ is insufficient.
Masters production deployment on the vLLM + TensorRT-LLM + llama.cpp + SGLang serving stacks.
Through the capstone project, equips the participant with a quantization pipeline + cost analysis applicable on their own hardware target.
Together with RLHF + Reasoning Models + Mech Interp + CPT + Quantization, completes a five-training frontier set covering the full arsenal of production LLM engineering.
Learning Outcomes
Requirements
Course Curriculum
104 LessonsInstructor

Şükrü Yusuf KAYA
AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant
Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.
Frequently Asked Questions
Apply for Training
Boutique training with limited seats.
Pre-register for Next Groups
Leave your info to be the first to know when the next batch opens.
1-on-1 Mentorship
Book a private session.
Categories
Related programs
Professional Software Development with Claude Code Training
A comprehensive, advanced 4-day training program for software professionals seeking enterprise-level mastery of Anthropic's agentic coding platform, Claude Code. Production-grade agent architecture with MCP integrations, Hooks, Sub-agents, Skills, and the Claude Agent SDK.
4 GünadvancedLLM Alignment Engineering with RLHF, DPO, and GRPO Training
A 3-day advanced Turkish LLM alignment training that covers the RLHF (PPO), DPO, KTO, IPO, SimPO, ORPO, and DeepSeek R1 GRPO algorithms at both math and code level; and teaches reward modeling, Constitutional AI, RLAIF, reasoning-model alignment, and the TRL/Axolotl/LLaMA-Factory/OpenRLHF/verl toolchain at production grade.
3 GünadvancedBuilding AI Agents with the Claude Agent SDK Training
A comprehensive, advanced 4-day program for software engineers who want to develop production-grade AI agents with Anthropic's Claude Agent SDK. Tool-use orchestration, MCP server development, multi-agent patterns, prompt caching, and evaluation engineering.
4 Günadvanced