İçeriğe geç

vLLM Production Serving: Paged Attention + Continuous Batching ile 10x Throughput

vLLM production deployment: paged attention (Kwon 2023), continuous batching, OpenAI-compatible API, multi-GPU tensor parallel serving, Kubernetes deployment patterns. Llama-3-8B + custom Türkçe model serving 1000+ concurrent users.

Şükrü Yusuf KAYA
75 dakikalık okuma
İleri
vLLM Production Serving: Paged Attention + Continuous Batching ile 10x Throughput
🚀 vLLM — production LLM serving'in fiili standardı
Kendi Türkçe Llama-3 modeli'ni eğittin (Modül 14 capstone). Şimdi 1000 user'a serve et. Naive PyTorch inference: çok yavaş, memory inefficient. vLLM (Berkeley 2023) çözümü: paged attention + continuous batching + OpenAI-compatible API. 10-30x throughput vs naive. Llama-3-8B → 60 concurrent users single H100. 75 dakika sonra: vLLM architecture'ını, deployment patterns'ı, production troubleshooting'i kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Naive inference sorunları — niye vLLM
  2. Paged attention recap — Modül 8.4'ten
  3. Continuous batching — dynamic request management
  4. OpenAI-compatible API — drop-in replacement
  5. Multi-GPU tensor parallel — vLLM TP
  6. Türkçe model serving — sukruyusufkaya/llama-3-8b-tr-instruct
  7. Kubernetes deployment — production patterns
  8. Monitoring — Prometheus + Grafana
  9. Cost economics — self-host vs OpenAI API
  10. Troubleshooting — common production issues
python
# vLLM production deployment
 
# 1. Install
# pip install vllm
 
# 2. Start server (OpenAI-compatible API)
# CLI: python -m vllm.entrypoints.openai.api_server \
# --model sukruyusufkaya/llama-3-8b-tr-instruct \
# --tensor-parallel-size 1 \
# --gpu-memory-utilization 0.9 \
# --max-model-len 8192 \
# --port 8000
 
# 3. Use as OpenAI API
from openai import OpenAI
 
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="any-key", # vLLM accepts any
)
 
response = client.chat.completions.create(
model="sukruyusufkaya/llama-3-8b-tr-instruct",
messages=[
{"role": "system", "content": "Sen Türkçe konuşan yardımcı bir asistansın."},
{"role": "user", "content": "İstanbul'un en ünlü camileri hangileridir?"},
],
max_tokens=500,
)
print(response.choices[0].message.content)
 
# 4. Multi-GPU (Llama-3-70B)
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Meta-Llama-3-70B-Instruct \
# --tensor-parallel-size 4 \
# --gpu-memory-utilization 0.9 \
# --max-model-len 8192
 
# 5. Quantization (faster, less memory)
# python -m vllm.entrypoints.openai.api_server \
# --model TheBloke/Llama-3-8B-Instruct-AWQ \
# --quantization awq \
# --gpu-memory-utilization 0.9
vLLM production deployment commands

7-10. Kubernetes + Monitoring + Cost#

7.1 Kubernetes deployment#

# vllm-llama-tr.yaml apiVersion: apps/v1 kind: Deployment metadata: name: llama-3-tr-instruct spec: replicas: 2 selector: matchLabels: app: llama-3-tr template: metadata: labels: app: llama-3-tr spec: containers: - name: vllm image: vllm/vllm-openai:latest args: - --model - sukruyusufkaya/llama-3-8b-tr-instruct - --gpu-memory-utilization - "0.9" resources: limits: nvidia.com/gpu: 1 memory: 64Gi requests: nvidia.com/gpu: 1 memory: 32Gi readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60

7.2 HorizontalPodAutoscaler#

GPU autoscaling based on metrics (request queue length, latency).

7.3 Prometheus + Grafana#

vLLM exposes /metrics endpoint:
  • Request count, latency, errors
  • Token throughput
  • GPU utilization, memory
  • KV cache utilization
Grafana dashboard: real-time monitoring.

7.4 Cost economics#

Llama-3-8B Türkçe self-host:
  • Single H100 (spot 2.5/hr):2.5/hr): 1800/month
  • 60 concurrent users, ~5K tokens/user/day
  • 1M API call/month
  • Cost per call: $0.0018
vs OpenAI API GPT-4o: 2.5/1Mtokensinput+2.5/1M tokens input + 10/1M output
  • 1M call × 1.5K tokens avg input + 0.5K output = 3K+3K + 5K = $8K/month
Self-host: 1800vsOpenAI:1800 vs OpenAI: 8000 — 4-5x cost saving at this scale.
Break-even: 100K user/day. Below — OpenAI cheaper. Above — self-host advantageous.

7.5 Production troubleshooting#

Common issues:
  • OOM: reduce gpu-memory-utilization or max-model-len
  • Slow first request: warmup expected, KV cache initial alloc
  • Random latency spikes: check for memory swap to CPU
  • Multi-GPU NCCL timeout: increase --nccl-timeout
✅ Ders 16.1 Özeti — vLLM Production
vLLM: production LLM serving fiili standard. Paged attention + continuous batching, OpenAI-compatible API. Türkçe Llama-3 single H100 60 concurrent. Kubernetes deployment + Prometheus monitoring. Self-host 1800/monthvsOpenAI1800/month vs OpenAI 8K/month — 4-5x saving for medium scale. Ders 16.2'de quantization + capstone'a geçeceğiz.

Sıradaki Ders: Quantization + Capstone#

Ders 16.2 (final): GPTQ, AWQ, GGUF quantization production deployment, capstone — Türkçe ChatGPT klonu (front-end + vLLM backend).

Sık Sorulan Sorular

vLLM: most popular, mature paged attention. TGI: HuggingFace native, easier deploy. SGLang: novel programming model, prefix caching. 2026 mainstream: vLLM. HF integration: TGI.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler