Should I choose GPTQ or AWQ in production? What's the real difference?

As of 2026, the general recommendation is AWQ — faster calibration (10-30 min vs GPTQ's 1-3 hours), simpler (no Hessian compute), and similar or better quality. GPTQ's advantage: with act-order desc_act, it can be 0.5-1% better on reasoning-heavy tasks; additionally, GPTQ has INT3 and INT2 variants, while AWQ mainstream is only INT4. Practical rule: speed + simplicity → AWQ; reasoning model + maximum quality → GPTQ (desc_act=true). In production, the safest path is to quantize the same model with both algorithms and run a benchmark comparison. Modules 4 and 5 cover both with concrete comparison tables.

How can I serve Llama 3.3 70B on an RTX 4090 (24GB)?

Three paths: (1) AWQ INT4 → 35GB, still doesn't fit; dual RTX 4090s (48GB total) needed. (2) AQLM 2-bit → 17.5GB, fits, but AQLM calibration takes 24-48 hours. (3) GGUF Q2_K_M (~25GB) llama.cpp + Apple Silicon M3 Max (36GB unified RAM) is an ideal alternative. Practical recommendation: on a single RTX 4090, Llama 3.3 8B (16GB AWQ) or 32B (16GB AQLM); for 70B, dual 4090s (AWQ), a single H100 (AWQ), or Apple Silicon M3 Max (GGUF Q4). These scenarios are shown practically in Module 9 and the Module 12 capstone.

When are FP8 and FP4 better than INT8 / INT4? Is hardware mandatory?

Native hardware support is mandatory. FP8: Hopper H100/H200 and newer (Tensor Cores native). FP4 (NVFP4/MXFP4): Blackwell B200/GB200 (late 2024+). On Ampere A100, FP8 is slow via software emulation; FP4 is impractical without hardware. On quality: FP8 (E4M3) is marginally better than INT8 (~0.2-0.5% MMLU); FP4 (NVFP4) is better than INT4 (~0.5-1% MMLU). If you have Hopper / Blackwell GPUs, FP8/FP4 should be the default choice. Hardware-native + 2-3x faster throughput. Module 8 covers this in detail.

How much quality loss from quantization? Is 0.5-1% MMLU loss acceptable?

It depends on the scenario. General chat / customer service / RAG: 1-2% MMLU loss is tolerable; the cost / latency gain is usually worth it. Reasoning models (o3, R1, Claude Extended Thinking): even 1% loss on math/code tasks may be high; extra caution. Code generation (Copilot-style): 1% loss on HumanEval = pages of reports' worth of difference; rigorous validation needed. Quality-threshold setting: critical to measure real user impact via production benchmarks + A/B testing. Modules 11 and 12 cover quality-threshold setting + an accuracy-validation framework in detail.

Can I really reduce reasoning-model serving cost 70% with KV-cache quantization?

Yes — for reasoning models' 32K-128K thinking traces, KV-cache memory dominance is very high. A 70B model + 32K context = 32GB KV cache (FP16); vLLM FP8 KV cache → 16GB (2x savings); KIVI 2-bit → 4GB (8x savings). This enables serving 4-8x more concurrent requests on the same GPU → throughput increases 4-8x → $/token cost drops 4-8x. In Anthropic and DeepSeek production experience, 50-70% reasoning-model serving cost reduction is possible with this approach. Module 10 covers KIVI + KVQuant + vLLM FP8 KV cache implementation in detail.

Is AQLM 2-bit really better than GPTQ 4-bit?

Yes — thanks to AQLM (Egiazarian 2024) codebook-based vector quantization, 2-bit can yield better MMLU + HumanEval scores than GPTQ INT4 (on Llama 3 70B, +0.5-1.5 points). Trade-off: AQLM calibration takes 24-48 hours (vs GPTQ's 1-3 hours); compute cost is high. Practical: research + extreme memory-constraint scenarios → AQLM; standard production + fast iteration → AWQ INT4 or GPTQ INT4. Module 9 presents the Pareto frontier of the AQLM + QuIP# + BitNet family.

Is BitNet b1.58 (1.58-bit pre-training) real or research hype?

Real — Microsoft Research's 2024 paper showed LLaMA 70B-level quality + 10x lower inference cost with {-1, 0, 1} ternary native pre-training. In 2025-2026, BitNet b1.58 + extensions (T-MAC, BitMoE) are moving toward production. Main challenge: from-scratch pre-training compute (not classic post-training quantization); a significant investment is needed. Practical use: research labs + large-scale pre-training pipelines; still early for medium-scale production teams. PTQ vs native low-bit pre-training is the 2026 paradigm shift — Module 9.2 covers in detail.

When is QAT (Quantization-Aware Training) better than PTQ?

QAT is clearly superior in two scenarios: (1) Extreme low-bit (INT3, INT2): PTQ has 5-10% MMLU loss, QAT reduces to 1-2%. (2) Severe quality regression (3%+ MMLU loss with PTQ): QAT can recover most of it. Gemma 3 QAT (Google 2025) achieved only 1.5% MMLU loss at INT4. Trade-off: QAT compute cost (1.5-2x full FT), training time + data needs. Practical: start with PTQ; if the quality threshold is not crossed, recover with a hybrid PTQ + QAT approach. Module 11 covers in detail.

Is Apple Silicon (M3 Max, M4 Max) sufficient for production LLM serving?

Yes in certain scenarios, no in others. Sufficient for: single user / few concurrent (5-20 req/sec), 7B-70B GGUF Q4_K_M, 100-300ms tolerable latency, batch size 1-4. Insufficient for: high-throughput multi-tenant SaaS (1000+ concurrent), reasoning-model long-trace (32K+ context KV-cache memory is insufficient), production scaling (multi-GPU cluster required). Practical: prototype + small business + on-device privacy-aware deployment → Apple Silicon ideal; enterprise scale → H100/B200 cluster. M4 Max with 128GB RAM can run 70B GGUF Q6_K. Module 7.3 covers in detail.

What concrete artifacts will I have at the end of the training?

The following artifacts are produced in the capstone project: (1) a quantization pipeline tailored to your production scenario (Python codebase + YAML config); (2) quantized model checkpoint(s) (GPTQ, AWQ, GGUF, or AQLM); (3) a vLLM / TensorRT-LLM / llama.cpp serving template; (4) a KV-cache quantization config (for reasoning models); (5) an accuracy-validation report (MMLU + HumanEval + domain-benchmark regression); (6) a cost analysis (hourly GPU + $/M token + alternative hardware comparison); (7) a quality threshold + A/B-test framework template; (8) a 90-day production deployment roadmap.

What can I do with the RLHF + Reasoning + Mech Interp + CPT + Quantization five-set?

These five trainings complete the full arsenal of production-grade LLM engineering in Turkey: with CPT you can adapt the model's knowledge base (Turkish or domain), with RLHF/DPO/GRPO align behavior, with Reasoning Models solve complex problems, with Mech Interp audit internal behavior (safety + audit), and with Quantization reduce production cost 3-10x. Together: an enterprise AI team can build a fully independent production-grade LLM product — base-model selection → CPT → SFT/DPO/GRPO → quantization → vLLM serving → audit + steering with mech interp. As of 2026, no other coverage in Turkey provides exactly this competence.

Can the training be customized for our enterprise team?

Yes. Beyond the standard 3-day program, we offer customized private-classroom versions for enterprise clients. Module weights and capstone scenarios are tailored to your team's existing LLM stack (Llama / Qwen / DeepSeek / Claude API / your own CPT model), hardware infrastructure (H100/H200/B200 cluster, AMD MI325X, Apple Silicon, Intel Xeon CPU), serving stack (vLLM, TensorRT-LLM, llama.cpp, SGLang), reasoning-model usage, target latency + throughput SLA, and cost-optimization goal.

About this training

A 3-day advanced Turkish training that covers end to end the discipline of reducing LLMs to 4-bit / 8-bit / FP8 / FP4 — shrinking model size 4-16x + reducing inference latency 2-4x. Includes GPTQ, AWQ, SmoothQuant, EXL2, GGUF/IQ-quants, NF4 BitsAndBytes, FP8/FP4 (Hopper/Blackwell), AQLM extreme 2-bit, KIVI/KVQuant KV-cache quantization, QAT, and vLLM/TensorRT-LLM/llama.cpp/SGLang production serving.

This training is designed for: ML Engineers and Inference Engineers who want to reduce production LLM inference cost 3-10x Senior backend developers who want to optimize reasoning-model (o3, R1, Claude Extended Thinking) serving cost via KV-cache quantization ML Platform engineers who want to fit open-source LLMs (Llama 3.3, Qwen3, DeepSeek V3, Gemma 3) onto a single H100 or RTX 4090 AI Engineers who need to deploy to Apple Silicon / CPU / Edge / Mobile MLOps engineers who want to learn the Hopper FP8 and Blackwell FP4 native-hardware-optimization discipline AI Researchers active in quantization research (following AQLM, QuIP#, BitNet, KIVI)

Why this course matters: The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production. Covers GPTQ, AWQ, SmoothQuant, EXL2, GGUF, AQLM, BitNet, QuIP#, HQQ, KIVI comparatively + hands-on. Ties Hopper FP8 + Blackwell FP4 native hardware advantage to the 2026 datacenter standards. Teaches the KV-cache quantization discipline for reasoning-model long-trace serving end to end. Provides quality-recovery recipes with QAT in extreme low-bit scenarios where PTQ is insufficient. Masters production deployment on the vLLM + TensorRT-LLM + llama.cpp + SGLang serving stacks. Through the capstone project, equips the participant with a quantization pipeline + cost analysis applicable on their own hardware target. Together with RLHF + Reasoning Models + Mech Interp + CPT + Quantization, completes a five-training frontier set covering the full arsenal of production LLM engineering.

Learning outcomes by the end of the programme: Select the right bit-width across the FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit spectrum. Implement the GPTQ Hessian-approximation and AWQ scaling-factor mechanisms. Build W8A8 production serving with SmoothQuant outlier migration. Deploy to Apple Silicon / Edge / CPU with GGUF + llama.cpp. Skillfully use Hopper FP8 and Blackwell FP4 native Tensor Cores. Serve a 70B model on an RTX 4090 (24GB) with AQLM. Reduce reasoning-model serving cost by 50-70% with KIVI 2-bit KV cache. Recover quality with QAT in extreme low-bit scenarios where PTQ is insufficient. Skillfully manage the vLLM, TensorRT-LLM, llama.cpp, and SGLang quantized-serving stacks. Design a quantization pipeline that reduces production cost 3-10x and latency 2-4x.

Prerequisites and recommended background: Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers LLM inference experience (at least conceptual familiarity with vLLM, llama.cpp, TGI, or similar) Foundations in linear algebra, numerical methods (matrix operations, Cholesky) Basic knowledge of transformer architecture (attention, MLP, residual stream) GPU access (RunPod, Lambda Labs, Modal) — H100 (80GB) recommended for the capstone; participation possible with RTX 4090 / Apple Silicon A Hugging Face account + an LLM-provider (OpenAI/Anthropic/Google) API key before the training

The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production stack
Mathematical construction of GPTQ (Frantar 2022) Hessian approximation + AWQ (Lin 2023) scaling factor
W8A8 production serving with SmoothQuant + ZeroQuant + LLM.int8 outlier handling
Edge / CPU / Apple Silicon deployment with GGUF + llama.cpp K-quants + IQ-quants
Hopper FP8 (E4M3/E5M2) + Blackwell NVFP4/MXFP4 native hardware advantage
AQLM + QuIP# + BitNet b1.58 + HQQ extreme low-bit (1-2 bit) 2024-2026 frontier
KV-cache quantization for reasoning-model long-trace serving with KIVI + KVQuant
QAT pipeline + Gemma 3 QAT (Google 2025) recipe + PTQ + QAT hybrid approach

Key Takeaways

Select the right bit-width across the FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit spectrum.
Implement the GPTQ Hessian-approximation and AWQ scaling-factor mechanisms.
Build W8A8 production serving with SmoothQuant outlier migration.
Deploy to Apple Silicon / Edge / CPU with GGUF + llama.cpp.
Skillfully use Hopper FP8 and Blackwell FP4 native Tensor Cores.
Serve a 70B model on an RTX 4090 (24GB) with AQLM.
Reduce reasoning-model serving cost by 50-70% with KIVI 2-bit KV cache.
Recover quality with QAT in extreme low-bit scenarios where PTQ is insufficient.
Skillfully manage the vLLM, TensorRT-LLM, llama.cpp, and SGLang quantized-serving stacks.
Design a quantization pipeline that reduces production cost 3-10x and latency 2-4x.

Advanced Level3 Gün

Advanced LLM Quantization Engineering Training (GPTQ + AWQ + EXL2 + GGUF + FP8 + FP4)

Enroll Now

About This Course

This training is designed to address end to end — with math + algorithms + production stack — the quantization discipline that forms the economic foundation of modern LLM inference. As of 2026, serving a 70B-parameter LLM in FP16 won't fit even on a single H100 (140GB > 80GB); in contrast, with 4-bit quantization the same model can run on a single RTX 4090 (24GB) at 10x lower cost. This dramatic difference has made quantization one of the priorities of production AI engineering. In Turkey, a training that addresses this discipline end to end — from Frantar's GPTQ derivation to the mathematical construction of Lin's AWQ scaling factor, from the SmoothQuant outlier-migration formulation to AQLM additive codebooks, from Hopper FP8 Tensor Cores to Blackwell B200 NVFP4 / MXFP4, from KIVI 2-bit KV cache to reasoning-model long-trace serving — is virtually nonexistent; existing content either stays at shallow tool tutorials or freezes at academic-paper summaries. This program is designed to fill that gap as Turkey's most comprehensive production-grade LLM quantization reference training.

The program's strategic backbone is the first module, which clarifies the cost-quality-throughput trade-off across the quantization spectrum (FP32 → BF16/FP16 → FP8 → INT8 → NF4/INT4 → FP4 → AQLM 1-2 bit). A 70B model's memory footprint is 140GB in FP16, 70GB in INT8, 35GB in INT4/NF4, 17.5GB in NVFP4, and 4GB in AQLM 2-bit; this difference produces not only memory but also a 2-8x throughput gain. Hopper H100/H200's FP8 (E4M3 + E5M2) native Tensor Cores and Blackwell B200/GB200's NVFP4 + MXFP4 Transformer Engine v2 support form the hardware foundation of the 2024-2026 industry transformation; AMD MI325X/MI355X FP8/FP4, Intel Gaudi 3, Google TPU v6/v7 (Trillium) joined this race as well. Decision framework: for production cost optimization, the $0.30/M output token vs $3/M comparison, the quality regression budget (is 0.5% MMLU loss tolerable?), and which bit-width is the right choice for which scenario — evidence-based answers.

The second module addresses the mathematical foundations of quantization. The linear-quantization formula q = round((x - z) / s); dequantization x' = s × q + z; the symmetric (zero-point = 0) vs asymmetric (zero-point ≠ 0) trade-off; min-max calibration vs percentile clipping (P99.9); the granularity-selection matrix per-tensor (coarsest) → per-channel → per-group (g=128, finest); outlier handling (SmoothQuant migration, MX-format logic). On the format side, NF4 (NormalFloat 4-bit, Dettmers 2023 — information-theoretic optimal 4-bit distribution, assuming weights are Gaussian-distributed), FP8 E4M3 (for the forward pass, more precise) vs E5M2 (for backward gradients, wider range), MXFP4 (OCP Microscaling 4-bit), and NVFP4 (Blackwell-native, NVIDIA's OCP variant) format distinctions are clarified. Without this foundation, modern quantization algorithms (GPTQ, AWQ, SmoothQuant) cannot be understood.

The third module addresses PTQ (post-training quantization) — the dominant production approach. Calibration-dataset selection (typically 128-512 samples are sufficient — C4, Wikitext, Pile, FineWeb sample; the Turkish FineWeb subset for Turkish-domain calibration), forward-pass tracking, activation-statistics collection, outlier detection — the emergent magnitude-outlier channel phenomenon discovered in 6.7B+ models in Dettmers 2022 LLM.int8 paper is analyzed in detail. These outlier channels (0.1-1% of all channels) carry the dominant share of model quality; the mixed-precision decision matrix (preserving outlier channels in FP16 + the rest in INT8) is the foundation of this discipline. Naive round-to-nearest quantization yields 5-15% MMLU loss at 4-bit; with modern GPTQ/AWQ it falls to 0.3-1% — this difference is emphasized. Tool stack: AutoGPTQ, AutoAWQ, llama.cpp, Hugging Face Optimum, NVIDIA Model Optimizer.

The fourth module mathematically builds GPTQ (Frantar 2022, ICLR 2023) — the first widespread modern LLM PTQ algorithm. The history of Optimal Brain Quantization (Hassibi 1993), the Hessian matrix approximation (H ≈ 2 X^T X), layer-by-layer one-shot quantization, error compensation (distributing each quantized weight's error to remaining weights), inverse-Hessian computation via Cholesky decomposition, block-wise quantization and group size (g=128, g=64), and the effect of the act-order (desc_act) parameter — every stage is mathematically derived. On the production side, the 4-bit GPTQ pipeline for Llama 3.3 70B, Qwen3 32B, and DeepSeek V3 671B (MoE) models is done hands-on with AutoGPTQ + GPTQModel; a 2-3x boost in GPTQ inference speed with ExLlamaV2 kernels; vLLM + Marlin kernel + GPTQ serving integration is covered in detail.

The fifth module analyzes in detail MIT Han Lab's Lin 2023 (NeurIPS 2023) AWQ algorithm. AWQ's key insight: 1% of salient weights carry the dominant share of the entire model's quality and are determined by activation magnitude. With a per-channel scaling factor, salient channels are scaled up → quantized → scaled down; this mechanism minimizes quantization error over the salient channels. The optimal scale α value is determined via grid search (128-256 sample calibration dataset is sufficient). Comparison with GPTQ: AWQ is simpler (no Hessian compute), faster (Llama 3.3 70B in 10-30 minutes), with similar or better quality (especially in reasoning and instruction following). Production: the 4-bit AWQ pipeline for Llama 3.3, Qwen3, DeepSeek V3 is done hands-on with the AutoAWQ + vLLM + Marlin + TensorRT-LLM stack.

The sixth module addresses the W8A8 discipline of quantizing not only weights but activations too. SmoothQuant (Xiao 2022) — migrating outliers from activations to weights via the identity Y = (X · diag(s)^-1) · (diag(s) · W) to ease activation quantization; migration-strength tuning with the α parameter (0.5-0.85). ZeroQuant (Yao 2022) — token-wise dynamic quantization. LLM.int8 — 8-bit + outlier-handling hybrid approach. W8A8 serving yields a 2-4x throughput increase over FP16 (especially critical not at batch size 1 but in high concurrency). Production: W8A8 serving with vLLM + LLM Compressor (SmoothQuant); TensorRT-LLM INT8 serving (H100 Tensor Cores); W4A8 mixed precision (weight 4-bit + activation 8-bit hybrid) is covered in detail.

The seventh module covers the GGUF format and the K-quants + IQ-quants family of the llama.cpp ecosystem — especially critical for Apple Silicon and CPU deployment. When Georgi Gerganov open-sourced llama.cpp in 2023, the ecosystem reached 70K+ GitHub stars and became the de facto standard for edge LLM serving in 2026. The GGUF format structure (header + metadata + tensor data), the K-quants family (Q4_K_M most popular quality/size balance, Q5_K_M quality-priority, Q6_K, Q8_0 max quality), the mixed-precision super-block + sub-block structure; IQ-quants extreme low-bit (IQ1_S 1.6-bit, IQ2_XXS, IQ3_S — codebook + importance-matrix based); smart bit allocation with imatrix. The recipe for fitting a 70B model into 24GB VRAM (RTX 4090) or 36GB RAM (Apple Silicon M3 Max) is shown practically. Mobile (LiteRT, MediaPipe) GGUF deployment, AMD Ryzen AI 9 NPU, and Intel Xeon AMX optimization are also covered.

The eighth module covers in detail modern GPU architectures' native low-precision floating-point support. Hopper H100/H200 FP8 (E4M3 forward, E5M2 backward) native Tensor Cores; Blackwell B200/GB200 NVFP4 (block scale + sub-block scale) + MXFP4 (OCP Microscaling) Transformer Engine v2 — forming NVIDIA's 2026 datacenter standards. DeepSeek V3's FP8 training recipe over 14.8 trillion tokens (scale-factor management, loss scaling, 30-40% cost saving vs BF16) is analyzed. 3-5x FP4 inference throughput increase on Blackwell B200/GB200; the FP4 model-export pipeline with TensorRT Model Optimizer; Hugging Face Optimum + NVIDIA TransformerEngine integration are shown practically. AMD MI325X/MI355X FP8/FP4, Intel Gaudi 3, Google TPU v6/v7 quantization comparison is made.

The ninth module is dedicated to the frontier extreme-quantization discipline of 2024-2026. AQLM (Egiazarian 2024 — Additive Quantization for Language Models, codebook + vector-quantization-based 2-bit; AQLM 2-bit accuracy surpasses GPTQ 4-bit); QuIP# (Tseng 2024 — Quantization with Incoherence Processing, E8 lattice + incoherence rotation); BitNet b1.58 (Microsoft 2024 — {-1, 0, 1} ternary native pre-training, not post-training); HQQ (Badri 2024 — Half-Quadratic Quantization, a calibration-free fast PTQ alternative). The recipe for shrinking a 70B model to 13GB to serve on an RTX 4090 (24GB) is shown practically. PTQ vs native low-bit pre-training (the BitNet approach) is addressed as the 2026 paradigm shift.

The tenth module focuses on a critical topic for the long thinking traces of modern reasoning models (o3/o4, DeepSeek R1, Claude Extended Thinking, Qwen3) — KV-cache quantization. The KV-cache size formula: 2 × layer × heads × dim × ctx × dtype; a 70B model + 32K context = 32GB KV cache (in FP16). The 16K-128K thinking trace of reasoning models explodes this memory. vLLM FP8 KV cache (2x memory savings + minimum quality loss), TensorRT-LLM FP8 KV cache serving, KIVI (Liu 2024 — 2-bit KV cache + per-channel/per-token scaling), KVQuant (Hooper 2024 — outlier-aware non-uniform quantization), CacheGen, and combinations of prefix cache + KV quantization (reasoning-trace reuse) are covered in detail. This discipline can reduce reasoning-model serving cost by 50-70%.

The eleventh module addresses QAT (Quantization-Aware Training) — which steps in for scenarios where PTQ is insufficient (extreme low-bit, severe quality regression). Fake quantization (quantize-dequantize in forward), STE (Straight-Through Estimator) backward gradient, learnable scale + zero-point (LSQ, Esser 2020), QLoRA-aware fine-tuning (4-bit base + LoRA + QAT), the Gemma 3 QAT (Google 2025) production recipe — which produced INT4 models with quality close to BF16 (1.5% MMLU loss). The mixed PTQ + QAT hybrid recipe (start with PTQ, recover loss with QAT) is shown practically. The end-to-end QAT pipeline with Hugging Face Optimum + NVIDIA Model Optimizer is covered.

In the capstone module, each participant designs an end-to-end quantization pipeline tailored to their own production scenario: model selection (Llama 3.3 70B, Qwen3 32B, DeepSeek V3, Gemma 3, Mistral, their own CPT model), hardware target (RTX 4090 24GB, H100 80GB, B200 192GB, Apple Silicon, AMD MI325X, Intel Xeon CPU), bit-width strategy (4-bit weight + 8-bit activation + FP8 KV cache; or AQLM 2-bit + FP8 KV; or GGUF Q4_K_M + Apple Silicon), algorithm selection (GPTQ vs AWQ vs SmoothQuant vs AQLM evidence-based), serving stack (vLLM, TensorRT-LLM, llama.cpp, SGLang), accuracy-validation framework (MMLU + HumanEval + Turkish MMLU + domain-benchmark regression), cost analysis (hourly GPU cost + token throughput + $/M token), 90-day production deployment roadmap. By the end of the training, participants reach a level of technical competence to dissect the quantization spectrum (FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit) in terms of compute economics; implement GPTQ Hessian-approximation and AWQ scaling-factor mechanisms; apply modern techniques like SmoothQuant outlier migration + KIVI 2-bit KV cache; leverage Hopper FP8 + Blackwell FP4 native hardware advantages; perform edge / CPU / Apple Silicon deployment with GGUF + llama.cpp; evaluate extreme low-bit approaches like AQLM + QuIP# + BitNet; recover loss with QAT in scenarios where PTQ is insufficient; and perform quantized production serving on vLLM / TensorRT-LLM / llama.cpp / SGLang stacks. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.

Training Methodology

The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production stack

Mathematical construction of GPTQ (Frantar 2022) Hessian approximation + AWQ (Lin 2023) scaling factor

W8A8 production serving with SmoothQuant + ZeroQuant + LLM.int8 outlier handling

Edge / CPU / Apple Silicon deployment with GGUF + llama.cpp K-quants + IQ-quants

Hopper FP8 (E4M3/E5M2) + Blackwell NVFP4/MXFP4 native hardware advantage

AQLM + QuIP# + BitNet b1.58 + HQQ extreme low-bit (1-2 bit) 2024-2026 frontier

KV-cache quantization for reasoning-model long-trace serving with KIVI + KVQuant

QAT pipeline + Gemma 3 QAT (Google 2025) recipe + PTQ + QAT hybrid approach

Who Is This For?

ML Engineers and Inference Engineers who want to reduce production LLM inference cost 3-10x

Senior backend developers who want to optimize reasoning-model (o3, R1, Claude Extended Thinking) serving cost via KV-cache quantization

ML Platform engineers who want to fit open-source LLMs (Llama 3.3, Qwen3, DeepSeek V3, Gemma 3) onto a single H100 or RTX 4090

AI Engineers who need to deploy to Apple Silicon / CPU / Edge / Mobile

MLOps engineers who want to learn the Hopper FP8 and Blackwell FP4 native-hardware-optimization discipline

AI Researchers active in quantization research (following AQLM, QuIP#, BitNet, KIVI)

Why This Course?

The only advanced program in Turkey that addresses LLM-quantization discipline end to end with math + algorithm + production.

Covers GPTQ, AWQ, SmoothQuant, EXL2, GGUF, AQLM, BitNet, QuIP#, HQQ, KIVI comparatively + hands-on.

Ties Hopper FP8 + Blackwell FP4 native hardware advantage to the 2026 datacenter standards.

Teaches the KV-cache quantization discipline for reasoning-model long-trace serving end to end.

Provides quality-recovery recipes with QAT in extreme low-bit scenarios where PTQ is insufficient.

Masters production deployment on the vLLM + TensorRT-LLM + llama.cpp + SGLang serving stacks.

Through the capstone project, equips the participant with a quantization pipeline + cost analysis applicable on their own hardware target.

Together with RLHF + Reasoning Models + Mech Interp + CPT + Quantization, completes a five-training frontier set covering the full arsenal of production LLM engineering.

Learning Outcomes

Select the right bit-width across the FP16 → INT8 → INT4 → FP8 → FP4 → AQLM 2-bit spectrum.

Implement the GPTQ Hessian-approximation and AWQ scaling-factor mechanisms.

Build W8A8 production serving with SmoothQuant outlier migration.

Deploy to Apple Silicon / Edge / CPU with GGUF + llama.cpp.

Skillfully use Hopper FP8 and Blackwell FP4 native Tensor Cores.

Serve a 70B model on an RTX 4090 (24GB) with AQLM.

Reduce reasoning-model serving cost by 50-70% with KIVI 2-bit KV cache.

Recover quality with QAT in extreme low-bit scenarios where PTQ is insufficient.

Skillfully manage the vLLM, TensorRT-LLM, llama.cpp, and SGLang quantized-serving stacks.

Design a quantization pipeline that reduces production cost 3-10x and latency 2-4x.

Requirements

Active Python experience (intermediate to advanced), basic use of PyTorch and HuggingFace Transformers

LLM inference experience (at least conceptual familiarity with vLLM, llama.cpp, TGI, or similar)

Foundations in linear algebra, numerical methods (matrix operations, Cholesky)

Basic knowledge of transformer architecture (attention, MLP, residual stream)

GPU access (RunPod, Lambda Labs, Modal) — H100 (80GB) recommended for the capstone; participation possible with RTX 4090 / Apple Silicon

A Hugging Face account + an LLM-provider (OpenAI/Anthropic/Google) API key before the training

Course Curriculum

104 Lessons

Module 1: Strategic Introduction to LLM Quantization Engineering — The 2026 Landscape9 Lessons

Module 2: Quantization Theory — Symmetric, Asymmetric, Per-Channel, and Per-Group9 Lessons

Module 3: Post-Training Quantization (PTQ) Foundations — Calibration and the Outlier Problem9 Lessons

Module 4: GPTQ (Generative Pre-trained Transformer Quantization) — Frantar 2022 Derivation9 Lessons

Module 5: AWQ (Activation-aware Weight Quantization) — Lin 2023 Approach9 Lessons

Module 6: SmoothQuant and ZeroQuant — W8A8 Serving with Activation Quantization9 Lessons

Module 7: GGUF, llama.cpp, and IQ-Quants — Edge and CPU-Friendly Quantization9 Lessons

Module 8: FP8 and FP4 — Hopper H100, Blackwell B200, NVFP4, and MXFP49 Lessons

Module 9: AQLM and Extreme Quantization — Serving 70B+ Models at 1-2 Bit9 Lessons

Module 10: KV Cache Quantization — Critical for Reasoning-Model Long-Trace Serving9 Lessons

Module 11: Quantization-Aware Training (QAT) — Going Beyond PTQ Limits9 Lessons

Module 12: Capstone — Building a Production-Grade Quantization Pipeline5 Lessons

Instructor

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions

Apply for Training

Boutique training with limited seats.

Pre-register for Next Groups

Leave your info to be the first to know when the next batch opens.

Live & Interactive Sessions

Project-Based Learning

Industry-Focused Curriculum

Professional Networking

1-on-1 Mentorship

Book a private session.

Enroll

About this training

Key Takeaways

Advanced LLM Quantization Engineering Training (GPTQ + AWQ + EXL2 + GGUF + FP8 + FP4)