TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching

TensorRT-LLM — NVIDIA's LLM-specific TensorRT engine. CUDA kernels Hopper/Ada native, fastest inference (+15-30% throughput vs vLLM). Engine build process, INT8 SmoothQuant, FP8 quantization, multi-LoRA. Llama 3.1 8B TRT-LLM engine build (1h) + inference on RTX 4090.

Şükrü Yusuf KAYA

28 min read

5/14/2026

Advanced

TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching

1. TensorRT-LLM — Niye + Niye Değil#

Avantajlar:

NVIDIA-native, en hızlı throughput (RTX 4090'da AWQ-vLLM'e göre +%15-30)
FP8 inference native + SmoothQuant INT8
In-flight batching (vLLM continuous batching ekvivalenti)
Multi-LoRA support
Triton Inference Server'a entegre

Dezavantajlar:

Engine build pahalı (1-2 saat 8B model)
Engine model-spesifik (her quantization için ayrı engine)
Python API yapay (vLLM'e göre verbose)
Yeni model destekleri vLLM'den geç gelir

bash

# === TensorRT-LLM Engine build (Llama 3.1 8B FP8) ===
# 1. TRT-LLM container
docker run --gpus all -it --rm \
    -v $PWD:/workspace \
    nvcr.io/nvidia/tensorrt:24.10-py3 bash
 
# 2. Install
pip install tensorrt-llm
 
# 3. Convert HF → TRT-LLM checkpoint
python convert_checkpoint.py \
    --model_dir meta-llama/Meta-Llama-3.1-8B-Instruct \
    --output_dir trt_ckpt \
    --dtype bfloat16 \
    --use_fp8 \
    --calib_dataset c4 \
    --calib_size 512
 
# 4. Build engine — 1-2 saat RTX 4090
trtllm-build \
    --checkpoint_dir trt_ckpt \
    --output_dir trt_engine \
    --gemm_plugin auto \
    --gpt_attention_plugin auto \
    --max_batch_size 32 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --max_num_tokens 16384 \
    --use_paged_kv_cache enable \
    --use_fp8 enable
 
# 5. Inference
python run.py --engine_dir trt_engine \
    --max_output_len 500 \
    --input_text "Türkiye'nin başkenti?"

TensorRT-LLM engine build + inference

✅ Teslim

TRT-LLM container kurulum. 2) Llama 8B FP8 engine build. 3) Aynı modeli vLLM AWQ ile karşılaştır. 4) Sonraki ders: 15.6 — llama.cpp + Ollama + Modelfile.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching

1. TensorRT-LLM — Niye + Niye Değil#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter