TensorRT-LLM: NVIDIA Native Engine — INT8 SmoothQuant + FP8 + In-Flight Batching
TensorRT-LLM — NVIDIA's LLM-specific TensorRT engine. CUDA kernels Hopper/Ada native, fastest inference (+15-30% throughput vs vLLM). Engine build process, INT8 SmoothQuant, FP8 quantization, multi-LoRA. Llama 3.1 8B TRT-LLM engine build (1h) + inference on RTX 4090.
Şükrü Yusuf KAYA
28 min read
Advanced1. TensorRT-LLM — Niye + Niye Değil#
Avantajlar:
- NVIDIA-native, en hızlı throughput (RTX 4090'da AWQ-vLLM'e göre +%15-30)
- FP8 inference native + SmoothQuant INT8
- In-flight batching (vLLM continuous batching ekvivalenti)
- Multi-LoRA support
- Triton Inference Server'a entegre
Dezavantajlar:
- Engine build pahalı (1-2 saat 8B model)
- Engine model-spesifik (her quantization için ayrı engine)
- Python API yapay (vLLM'e göre verbose)
- Yeni model destekleri vLLM'den geç gelir
bash
# === TensorRT-LLM Engine build (Llama 3.1 8B FP8) ===# 1. TRT-LLM containerdocker run --gpus all -it --rm \ -v $PWD:/workspace \ nvcr.io/nvidia/tensorrt:24.10-py3 bash # 2. Installpip install tensorrt-llm # 3. Convert HF → TRT-LLM checkpointpython convert_checkpoint.py \ --model_dir meta-llama/Meta-Llama-3.1-8B-Instruct \ --output_dir trt_ckpt \ --dtype bfloat16 \ --use_fp8 \ --calib_dataset c4 \ --calib_size 512 # 4. Build engine — 1-2 saat RTX 4090trtllm-build \ --checkpoint_dir trt_ckpt \ --output_dir trt_engine \ --gemm_plugin auto \ --gpt_attention_plugin auto \ --max_batch_size 32 \ --max_input_len 4096 \ --max_seq_len 8192 \ --max_num_tokens 16384 \ --use_paged_kv_cache enable \ --use_fp8 enable # 5. Inferencepython run.py --engine_dir trt_engine \ --max_output_len 500 \ --input_text "Türkiye'nin başkenti?"TensorRT-LLM engine build + inference
✅ Teslim
- TRT-LLM container kurulum. 2) Llama 8B FP8 engine build. 3) Aynı modeli vLLM AWQ ile karşılaştır. 4) Sonraki ders: 15.6 — llama.cpp + Ollama + Modelfile.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations