TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals

TGI — HuggingFace'in production inference server'ı, hf.co/inference-endpoints'in altında çalışır. Rust + Python hibrit, prometheus metrics, multiple GPU desteği. vLLM'e göre daha agresif batching + Flash-Attention 2 hard-coded. RTX 4090'da TGI docker ile Llama 3.1 8B serve.

Şükrü Yusuf KAYA

22 dakikalık okuma

14.05.2026

İleri

TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals

bash

# === TGI ile Llama 3.1 8B serve (RTX 4090) ===
docker run --rm --gpus all -p 8080:80 \
    -v $PWD/data:/data \
    -e HF_TOKEN=$HF_TOKEN \
    ghcr.io/huggingface/text-generation-inference:3.0 \
    --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantize awq \
    --max-input-tokens 4096 \
    --max-total-tokens 8192 \
    --max-batch-prefill-tokens 16384 \
    --num-shard 1                       # single GPU
 
# Test
curl http://localhost:8080/generate \
    -X POST \
    -d '{"inputs": "İstanbul nüfusu?", "parameters": {"max_new_tokens": 100}}'
 
# Metrics
curl http://localhost:8080/metrics
# Prometheus format: tgi_request_count, tgi_batch_inference_duration, ...