vLLM vs TGI vs SGLang — which is best?

vLLM: most popular, mature paged attention. TGI: HuggingFace native, easier deploy. SGLang: novel programming model, prefix caching. 2026 mainstream: vLLM. HF integration: TGI.

vLLM Production Serving Ultra Detaylı: Paged Attention + Continuous Batching

Ders Haritası (10 Bölüm)#

Naive inference sorunları — niye vLLM
Paged attention recap — Modül 8.4'ten
Continuous batching — dynamic request management
OpenAI-compatible API — drop-in replacement
Multi-GPU tensor parallel — vLLM TP
Türkçe model serving — sukruyusufkaya/llama-3-8b-tr-instruct
Kubernetes deployment — production patterns
Monitoring — Prometheus + Grafana
Cost economics — self-host vs OpenAI API
Troubleshooting — common production issues

# vLLM production deployment # 1. Install # pip install vllm # 2. Start server (OpenAI-compatible API) # CLI: python -m vllm.entrypoints.openai.api_server \ # --model sukruyusufkaya/llama-3-8b-tr-instruct \ # --tensor-parallel-size 1 \ # --gpu-memory-utilization 0.9 \ # --max-model-len 8192 \ # --port 8000 # 3. Use as OpenAI API from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="any-key", # vLLM accepts any ) response = client.chat.completions.create( model="sukruyusufkaya/llama-3-8b-tr-instruct", messages=[ {"role": "system", "content": "Sen Türkçe konuşan yardımcı bir asistansın."}, {"role": "user", "content": "İstanbul'un en ünlü camileri hangileridir?"}, ], max_tokens=500, ) print(response.choices[0].message.content) # 4. Multi-GPU (Llama-3-70B) # python -m vllm.entrypoints.openai.api_server \ # --model meta-llama/Meta-Llama-3-70B-Instruct \ # --tensor-parallel-size 4 \ # --gpu-memory-utilization 0.9 \ # --max-model-len 8192 # 5. Quantization (faster, less memory) # python -m vllm.entrypoints.openai.api_server \ # --model TheBloke/Llama-3-8B-Instruct-AWQ \ # --quantization awq \ # --gpu-memory-utilization 0.9

7-10. Kubernetes + Monitoring + Cost#

7.1 Kubernetes deployment#

# vllm-llama-tr.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-tr-instruct
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-3-tr
  template:
    metadata:
      labels:
        app: llama-3-tr
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model
        - sukruyusufkaya/llama-3-8b-tr-instruct
        - --gpu-memory-utilization
        - "0.9"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 64Gi
          requests:
            nvidia.com/gpu: 1
            memory: 32Gi
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60

7.2 HorizontalPodAutoscaler#

GPU autoscaling based on metrics (request queue length, latency).

7.3 Prometheus + Grafana#

vLLM exposes /metrics endpoint:

Request count, latency, errors
Token throughput
GPU utilization, memory
KV cache utilization

Grafana dashboard: real-time monitoring.

7.4 Cost economics#

Llama-3-8B Türkçe self-host:

Single H100 (spot $2.5/hr):$ 1800/month
60 concurrent users, ~5K tokens/user/day
1M API call/month
Cost per call: $0.0018

vs OpenAI API GPT-4o:

2.5/1M tokens input +

10/1M output

1M call × 1.5K tokens avg input + 0.5K output = $3K +$ 5K = $8K/month

Self-host:

1800 vs OpenAI:

8000 — 4-5x cost saving at this scale.

Break-even: 100K user/day. Below — OpenAI cheaper. Above — self-host advantageous.

7.5 Production troubleshooting#

Common issues:

OOM: reduce gpu-memory-utilization or max-model-len
Slow first request: warmup expected, KV cache initial alloc
Random latency spikes: check for memory swap to CPU
Multi-GPU NCCL timeout: increase --nccl-timeout

Sıradaki Ders: Quantization + Capstone#

Ders 16.2 (final): GPTQ, AWQ, GGUF quantization production deployment, capstone — Türkçe ChatGPT klonu (front-end + vLLM backend).

vLLM Production Serving: 10x Throughput with Paged Attention + Continuous Batching

Ders Haritası (10 Bölüm)#

7-10. Kubernetes + Monitoring + Cost#

7.1 Kubernetes deployment#

7.2 HorizontalPodAutoscaler#

7.3 Prometheus + Grafana#

7.4 Cost economics#

7.5 Production troubleshooting#

Sıradaki Ders: Quantization + Capstone#

Frequently Asked Questions

vLLM vs TGI vs SGLang — which is best?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

Subscribe to Newsletter