vLLM vs TGI vs SGLang hangisi en iyi?

vLLM Production Serving: Paged Attention + Continuous Batching ile 10x Throughput

vLLM production deployment: paged attention (Kwon 2023), continuous batching, OpenAI-compatible API, multi-GPU tensor parallel serving, Kubernetes deployment patterns. Llama-3-8B + custom Türkçe model serving 1000+ concurrent users.

Şükrü Yusuf KAYA

75 dakikalık okuma

13.05.2026

İleri

vLLM Production Serving: Paged Attention + Continuous Batching ile 10x Throughput

🚀 vLLM — production LLM serving'in fiili standardı

Kendi Türkçe Llama-3 modeli'ni eğittin (Modül 14 capstone). Şimdi 1000 user'a serve et. Naive PyTorch inference: çok yavaş, memory inefficient. vLLM (Berkeley 2023) çözümü: paged attention + continuous batching + OpenAI-compatible API. 10-30x throughput vs naive. Llama-3-8B → 60 concurrent users single H100. 75 dakika sonra: vLLM architecture'ını, deployment patterns'ı, production troubleshooting'i kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Naive inference sorunları — niye vLLM
Paged attention recap — Modül 8.4'ten
Continuous batching — dynamic request management
OpenAI-compatible API — drop-in replacement
Multi-GPU tensor parallel — vLLM TP
Türkçe model serving — sukruyusufkaya/llama-3-8b-tr-instruct
Kubernetes deployment — production patterns
Monitoring — Prometheus + Grafana
Cost economics — self-host vs OpenAI API
Troubleshooting — common production issues

python

# vLLM production deployment
 
# 1. Install
# pip install vllm
 
# 2. Start server (OpenAI-compatible API)
# CLI: python -m vllm.entrypoints.openai.api_server \
#     --model sukruyusufkaya/llama-3-8b-tr-instruct \
#     --tensor-parallel-size 1 \
#     --gpu-memory-utilization 0.9 \
#     --max-model-len 8192 \
#     --port 8000
 
# 3. Use as OpenAI API
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any-key",  # vLLM accepts any
)
 
response = client.chat.completions.create(
    model="sukruyusufkaya/llama-3-8b-tr-instruct",
    messages=[
        {"role": "system", "content": "Sen Türkçe konuşan yardımcı bir asistansın."},
        {"role": "user", "content": "İstanbul'un en ünlü camileri hangileridir?"},
    ],
    max_tokens=500,
)
print(response.choices[0].message.content)
 
# 4. Multi-GPU (Llama-3-70B)
# python -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Meta-Llama-3-70B-Instruct \
#     --tensor-parallel-size 4 \
#     --gpu-memory-utilization 0.9 \
#     --max-model-len 8192
 
# 5. Quantization (faster, less memory)
# python -m vllm.entrypoints.openai.api_server \
#     --model TheBloke/Llama-3-8B-Instruct-AWQ \
#     --quantization awq \
#     --gpu-memory-utilization 0.9

vLLM production deployment commands

7-10. Kubernetes + Monitoring + Cost#

7.1 Kubernetes deployment#

# vllm-llama-tr.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-tr-instruct
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-3-tr
  template:
    metadata:
      labels:
        app: llama-3-tr
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model
        - sukruyusufkaya/llama-3-8b-tr-instruct
        - --gpu-memory-utilization
        - "0.9"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 64Gi
          requests:
            nvidia.com/gpu: 1
            memory: 32Gi
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60

7.2 HorizontalPodAutoscaler#

GPU autoscaling based on metrics (request queue length, latency).

7.3 Prometheus + Grafana#

vLLM exposes /metrics endpoint:

Request count, latency, errors
Token throughput
GPU utilization, memory
KV cache utilization

Grafana dashboard: real-time monitoring.

7.4 Cost economics#

Llama-3-8B Türkçe self-host:

Single H100 (spot $2.5/hr):$ 1800/month
60 concurrent users, ~5K tokens/user/day
1M API call/month
Cost per call: $0.0018

vs OpenAI API GPT-4o:

2.5/1M tokens input +

10/1M output

1M call × 1.5K tokens avg input + 0.5K output = $3K +$ 5K = $8K/month

Self-host:

1800 vs OpenAI:

8000 — 4-5x cost saving at this scale.

Break-even: 100K user/day. Below — OpenAI cheaper. Above — self-host advantageous.

7.5 Production troubleshooting#

Common issues:

OOM: reduce gpu-memory-utilization or max-model-len
Slow first request: warmup expected, KV cache initial alloc
Random latency spikes: check for memory swap to CPU
Multi-GPU NCCL timeout: increase --nccl-timeout

✅ Ders 16.1 Özeti — vLLM Production

vLLM: production LLM serving fiili standard. Paged attention + continuous batching, OpenAI-compatible API. Türkçe Llama-3 single H100 60 concurrent. Kubernetes deployment + Prometheus monitoring. Self-host

1800/month vs OpenAI

8K/month — 4-5x saving for medium scale. Ders 16.2'de quantization + capstone'a geçeceğiz.

Sıradaki Ders: Quantization + Capstone#

Ders 16.2 (final): GPTQ, AWQ, GGUF quantization production deployment, capstone — Türkçe ChatGPT klonu (front-end + vLLM backend).

Sık Sorulan Sorular

vLLM: most popular, mature paged attention. TGI: HuggingFace native, easier deploy. SGLang: novel programming model, prefix caching. 2026 mainstream: vLLM. HF integration: TGI.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu