TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals

TGI — HuggingFace's production inference server, powers hf.co/inference-endpoints. Rust + Python hybrid, prometheus metrics, multi-GPU support. More aggressive batching + hard-coded FA2 than vLLM. Llama 3.1 8B serving via TGI docker on RTX 4090.

Şükrü Yusuf KAYA

22 min read

5/14/2026

Advanced

TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals

bash

# === TGI ile Llama 3.1 8B serve (RTX 4090) ===
docker run --rm --gpus all -p 8080:80 \
    -v $PWD/data:/data \
    -e HF_TOKEN=$HF_TOKEN \
    ghcr.io/huggingface/text-generation-inference:3.0 \
    --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantize awq \
    --max-input-tokens 4096 \
    --max-total-tokens 8192 \
    --max-batch-prefill-tokens 16384 \
    --num-shard 1                       # single GPU
 
# Test
curl http://localhost:8080/generate \
    -X POST \
    -d '{"inputs": "İstanbul nüfusu?", "parameters": {"max_new_tokens": 100}}'
 
# Metrics
curl http://localhost:8080/metrics
# Prometheus format: tgi_request_count, tgi_batch_inference_duration, ...

TGI docker setup

✅ Teslim

TGI docker ile model serve. 2) Prometheus + Grafana dashboard kur. 3) Sonraki ders: 15.5 — TensorRT-LLM.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter