TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching

TensorRT-LLM — NVIDIA'nın LLM-spesifik TensorRT engine'i. CUDA kernel'lar Hopper/Ada native, en hızlı inference (vLLM'den +%15-30 throughput). Engine build process, INT8 SmoothQuant, FP8 quantization, multi-LoRA. RTX 4090'da Llama 3.1 8B TRT-LLM engine build (1 saat) + inference.

Şükrü Yusuf KAYA

28 dakikalık okuma

26.06.2026

İleri

TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching

1. TensorRT-LLM — Niye + Niye Değil#

Avantajlar:

NVIDIA-native, en hızlı throughput (RTX 4090'da AWQ-vLLM'e göre +%15-30)
FP8 inference native + SmoothQuant INT8
In-flight batching (vLLM continuous batching ekvivalenti)
Multi-LoRA support
Triton Inference Server'a entegre

Dezavantajlar:

Engine build pahalı (1-2 saat 8B model)
Engine model-spesifik (her quantization için ayrı engine)
Python API yapay (vLLM'e göre verbose)
Yeni model destekleri vLLM'den geç gelir

bash

# === TensorRT-LLM Engine build (Llama 3.1 8B FP8) ===
# 1. TRT-LLM container
docker run --gpus all -it --rm \
    -v $PWD:/workspace \
    nvcr.io/nvidia/tensorrt:24.10-py3 bash
 
# 2. Install
pip install tensorrt-llm
 
# 3. Convert HF → TRT-LLM checkpoint
python convert_checkpoint.py \
    --model_dir meta-llama/Meta-Llama-3.1-8B-Instruct \
    --output_dir trt_ckpt \
    --dtype bfloat16 \
    --use_fp8 \
    --calib_dataset c4 \
    --calib_size 512
 
# 4. Build engine — 1-2 saat RTX 4090
trtllm-build \
    --checkpoint_dir trt_ckpt \
    --output_dir trt_engine \
    --gemm_plugin auto \
    --gpt_attention_plugin auto \
    --max_batch_size 32 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --max_num_tokens 16384 \
    --use_paged_kv_cache enable \
    --use_fp8 enable
 
# 5. Inference
python run.py --engine_dir trt_engine \
    --max_output_len 500 \
    --input_text "Türkiye'nin başkenti?"

TensorRT-LLM engine build + inference

✅ Teslim

TRT-LLM container kurulum. 2) Llama 8B FP8 engine build. 3) Aynı modeli vLLM AWQ ile karşılaştır. 4) Sonraki ders: 15.6 — llama.cpp + Ollama + Modelfile.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Öğrenmeye Başla

Part 0 — Engineering Foundations

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Öğrenmeye Başla

Part 0 — Engineering Foundations

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Öğrenmeye Başla