TGI (HuggingFace Text Generation Inference): Production HF Endpoint Internals
TGI — HuggingFace's production inference server, powers hf.co/inference-endpoints. Rust + Python hybrid, prometheus metrics, multi-GPU support. More aggressive batching + hard-coded FA2 than vLLM. Llama 3.1 8B serving via TGI docker on RTX 4090.
Şükrü Yusuf KAYA
22 min read
Advancedbash
# === TGI ile Llama 3.1 8B serve (RTX 4090) ===docker run --rm --gpus all -p 8080:80 \ -v $PWD/data:/data \ -e HF_TOKEN=$HF_TOKEN \ ghcr.io/huggingface/text-generation-inference:3.0 \ --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \ --quantize awq \ --max-input-tokens 4096 \ --max-total-tokens 8192 \ --max-batch-prefill-tokens 16384 \ --num-shard 1 # single GPU # Testcurl http://localhost:8080/generate \ -X POST \ -d '{"inputs": "İstanbul nüfusu?", "parameters": {"max_new_tokens": 100}}' # Metricscurl http://localhost:8080/metrics# Prometheus format: tgi_request_count, tgi_batch_inference_duration, ...TGI docker setup
✅ Teslim
- TGI docker ile model serve. 2) Prometheus + Grafana dashboard kur. 3) Sonraki ders: 15.5 — TensorRT-LLM.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations