Skip to content

TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals

TGI — HuggingFace's production inference server, powers hf.co/inference-endpoints. Rust + Python hybrid, prometheus metrics, multi-GPU support. More aggressive batching + hard-coded FA2 than vLLM. Llama 3.1 8B serving via TGI docker on RTX 4090.

Şükrü Yusuf KAYA
22 min read
Advanced
TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals
bash
# === TGI ile Llama 3.1 8B serve (RTX 4090) ===
docker run --rm --gpus all -p 8080:80 \
-v $PWD/data:/data \
-e HF_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:3.0 \
--model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantize awq \
--max-input-tokens 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 16384 \
--num-shard 1 # single GPU
 
# Test
curl http://localhost:8080/generate \
-X POST \
-d '{"inputs": "İstanbul nüfusu?", "parameters": {"max_new_tokens": 100}}'
 
# Metrics
curl http://localhost:8080/metrics
# Prometheus format: tgi_request_count, tgi_batch_inference_duration, ...
TGI docker setup
✅ Teslim
  1. TGI docker ile model serve. 2) Prometheus + Grafana dashboard kur. 3) Sonraki ders: 15.5 — TensorRT-LLM.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content