vLLM Production Serving: 10x Throughput with Paged Attention + Continuous Batching
vLLM production deployment: paged attention (Kwon 2023), continuous batching, OpenAI-compatible API, multi-GPU tensor parallel serving, Kubernetes deployment patterns. Llama-3-8B + custom Turkish model serving 1000+ concurrent users.
Şükrü Yusuf KAYA
75 min read
Advanced🚀 vLLM — production LLM serving'in fiili standardı
Kendi Türkçe Llama-3 modeli'ni eğittin (Modül 14 capstone). Şimdi 1000 user'a serve et. Naive PyTorch inference: çok yavaş, memory inefficient. vLLM (Berkeley 2023) çözümü: paged attention + continuous batching + OpenAI-compatible API. 10-30x throughput vs naive. Llama-3-8B → 60 concurrent users single H100. 75 dakika sonra: vLLM architecture'ını, deployment patterns'ı, production troubleshooting'i kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Naive inference sorunları — niye vLLM
- Paged attention recap — Modül 8.4'ten
- Continuous batching — dynamic request management
- OpenAI-compatible API — drop-in replacement
- Multi-GPU tensor parallel — vLLM TP
- Türkçe model serving — sukruyusufkaya/llama-3-8b-tr-instruct
- Kubernetes deployment — production patterns
- Monitoring — Prometheus + Grafana
- Cost economics — self-host vs OpenAI API
- Troubleshooting — common production issues
python
# vLLM production deployment # 1. Install# pip install vllm # 2. Start server (OpenAI-compatible API)# CLI: python -m vllm.entrypoints.openai.api_server \# --model sukruyusufkaya/llama-3-8b-tr-instruct \# --tensor-parallel-size 1 \# --gpu-memory-utilization 0.9 \# --max-model-len 8192 \# --port 8000 # 3. Use as OpenAI APIfrom openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="any-key", # vLLM accepts any) response = client.chat.completions.create( model="sukruyusufkaya/llama-3-8b-tr-instruct", messages=[ {"role": "system", "content": "Sen Türkçe konuşan yardımcı bir asistansın."}, {"role": "user", "content": "İstanbul'un en ünlü camileri hangileridir?"}, ], max_tokens=500,)print(response.choices[0].message.content) # 4. Multi-GPU (Llama-3-70B)# python -m vllm.entrypoints.openai.api_server \# --model meta-llama/Meta-Llama-3-70B-Instruct \# --tensor-parallel-size 4 \# --gpu-memory-utilization 0.9 \# --max-model-len 8192 # 5. Quantization (faster, less memory)# python -m vllm.entrypoints.openai.api_server \# --model TheBloke/Llama-3-8B-Instruct-AWQ \# --quantization awq \# --gpu-memory-utilization 0.9vLLM production deployment commands
7-10. Kubernetes + Monitoring + Cost#
7.1 Kubernetes deployment#
# vllm-llama-tr.yaml apiVersion: apps/v1 kind: Deployment metadata: name: llama-3-tr-instruct spec: replicas: 2 selector: matchLabels: app: llama-3-tr template: metadata: labels: app: llama-3-tr spec: containers: - name: vllm image: vllm/vllm-openai:latest args: - --model - sukruyusufkaya/llama-3-8b-tr-instruct - --gpu-memory-utilization - "0.9" resources: limits: nvidia.com/gpu: 1 memory: 64Gi requests: nvidia.com/gpu: 1 memory: 32Gi readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 60
7.2 HorizontalPodAutoscaler#
GPU autoscaling based on metrics (request queue length, latency).
7.3 Prometheus + Grafana#
vLLM exposes /metrics endpoint:
- Request count, latency, errors
- Token throughput
- GPU utilization, memory
- KV cache utilization
Grafana dashboard: real-time monitoring.
7.4 Cost economics#
Llama-3-8B Türkçe self-host:
- Single H100 (spot 1800/month
- 60 concurrent users, ~5K tokens/user/day
- 1M API call/month
- Cost per call: $0.0018
vs OpenAI API GPT-4o: 10/1M output
- 1M call × 1.5K tokens avg input + 0.5K output = 5K = $8K/month
Self-host: 8000 — 4-5x cost saving at this scale.
Break-even: 100K user/day. Below — OpenAI cheaper. Above — self-host advantageous.
7.5 Production troubleshooting#
Common issues:
- OOM: reduce gpu-memory-utilization or max-model-len
- Slow first request: warmup expected, KV cache initial alloc
- Random latency spikes: check for memory swap to CPU
- Multi-GPU NCCL timeout: increase --nccl-timeout
✅ Ders 16.1 Özeti — vLLM Production
vLLM: production LLM serving fiili standard. Paged attention + continuous batching, OpenAI-compatible API. Türkçe Llama-3 single H100 60 concurrent. Kubernetes deployment + Prometheus monitoring. Self-host 8K/month — 4-5x saving for medium scale. Ders 16.2'de quantization + capstone'a geçeceğiz.
Sıradaki Ders: Quantization + Capstone#
Ders 16.2 (final): GPTQ, AWQ, GGUF quantization production deployment, capstone — Türkçe ChatGPT klonu (front-end + vLLM backend).
Frequently Asked Questions
vLLM: most popular, mature paged attention. TGI: HuggingFace native, easier deploy. SGLang: novel programming model, prefix caching. 2026 mainstream: vLLM. HF integration: TGI.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup