FP8 Inference: vLLM SmoothQuant + TensorRT-LLM — Production-Ready on RTX 4090
FP8 training premature but FP8 inference production-grade in 2026. vLLM native FP8 (Llama 3.1+/Qwen 2.5+ support), TensorRT-LLM SmoothQuant, AWQ-marlin INT4 vs FP8 comparison. Llama 3.1 8B FP8 conversion + serving on RTX 4090 (~120 tok/s vs bf16 95).
Şükrü Yusuf KAYA
28 min read
Advanced1. vLLM Native FP8 Support#
vLLM 0.6+ Llama 3.1/3.2/3.3 + Qwen 2.5+ + Gemma 3 için native FP8 destekler. SmoothQuant ile activation outlier'larını absorb eder.
# 1. Llama 3.1 8B'yi FP8'e çevir (one-shot) from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier from llmcompressor.transformers import oneshot # SmoothQuant + FP8 oneshot( model="meta-llama/Meta-Llama-3.1-8B-Instruct", recipe=[ SmoothQuantModifier(smoothing_strength=0.8), GPTQModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]), ], output_dir="llama-3.1-8b-fp8", num_calibration_samples=512, ) # 2. vLLM ile serve vllm serve llama-3.1-8b-fp8 --quantization fp8
2. RTX 4090 Inference Throughput Karşılaştırma#
| Quantization | tok/s (batch=1) | tok/s (batch=16) | tok/s (batch=64) | Size | Quality (PPL) |
|---|---|---|---|---|---|
| bf16 | 95 | 540 | 1240 | 16 GB | 5.93 ref |
| AWQ int4 | 175 | 920 | 2150 | 4.4 GB | 5.99 (+1.0%) |
| GPTQ int4 | 165 | 870 | 2050 | 4.5 GB | 6.04 (+1.9%) |
| FP8 (vLLM) | 155 | 1080 | 2520 | 8 GB | 5.95 (+0.3%) |
Çıkarımlar:
- batch=1 (single user): AWQ int4 hızlı (kernel optimized)
- batch=16+: FP8 en hızlı (memory bandwidth artar, kernel daha verimli)
- FP8 kalite kaybı INT4'ten çok az (PPL +%0.3 vs +%1.0-1.9)
- FP8 size INT4'ten 2x büyük (8GB vs 4.5GB)
Cookbook'un kuralı:
- Single-user / low concurrency → AWQ int4
- High concurrency / batch serving → FP8
✅ Teslim
- Llama 8B'yi FP8'e dönüştür (llmcompressor). 2) vLLM ile serve. 3) batch=1 + batch=16 throughput karşılaştır. 4) Sonraki ders: 10.9 — Calibration Dataset Engineering.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations