# FP8 Inference: vLLM SmoothQuant + TensorRT-LLM — Production-Ready on RTX 4090

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-fp8-inference-vllm-smoothquant
> Updated: 2026-05-14T14:42:57.552Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part X — Quantization Engineering
**TLDR:** FP8 training premature but FP8 inference production-grade in 2026. vLLM native FP8 (Llama 3.1+/Qwen 2.5+ support), TensorRT-LLM SmoothQuant, AWQ-marlin INT4 vs FP8 comparison. Llama 3.1 8B FP8 conversion + serving on RTX 4090 (~120 tok/s vs bf16 95).

