LoRA Hot-Swap Lab: Single Base + N Adapters — 50 Customers Served on a Single 4090
vLLM 0.3+'s killer feature: single base + N LoRA adapters, runtime hot-swap. Separate LoRA per customer, all on same 24GB. Llama 3.1 8B base (~5 GB AWQ) + 30+ adapters (~40 MB each) → 50 customers on single 4090. QPS-vs-latency curve.
Şükrü Yusuf KAYA
32 min read
Advanced1. Use Case — Niye Hot-Swap?#
Senaryo: B2B SaaS startup. 30 müşteri, her biri kendi LoRA-tuned modelini ister. Klasik yaklaşım:
- 30 GPU × 750/saat
- Her birinde tek base model
Hot-swap yaklaşım:
- 1 GPU × 0.50/saat
- Tek base + 30 adapter (RAM/disk'ten gerektiğinde GPU'ya yükle)
- **6,500/gün)
Mimari:
Request → /v1/chat (model="customer-id-42") → vLLM: load adapter customer-42 if not in cache → forward(input, base_w + adapter_42_w) → return response
python
# === vLLM LoRA serving ===from vllm import LLM, SamplingParamsfrom vllm.lora.request import LoRARequest llm = LLM( model="meta-llama/Meta-Llama-3.1-8B-Instruct", enable_lora=True, max_loras=8, # GPU'da aynı anda max LoRA sayısı max_lora_rank=64, # destekli en yüksek rank max_cpu_loras=32, # CPU RAM'de tutulan LoRA gpu_memory_utilization=0.92,) # Adapter pool — disk'te 30 LoRAlora_paths = { "customer-01": "/loras/cust01", "customer-02": "/loras/cust02", # ... 30 müşteri} # Request handlingdef handle_request(customer_id, prompt): lora = LoRARequest( lora_name=customer_id, lora_int_id=hash(customer_id) % 10000, lora_path=lora_paths[customer_id], ) sampling = SamplingParams(temperature=0.7, max_tokens=500) output = llm.generate([prompt], sampling, lora_request=lora) return output[0].outputs[0].text # Usageresponse = handle_request("customer-01", "Türkçe sohbet öneri ver")vLLM LoRA hot-swap usage
2. QPS vs Latency Eğrisi (RTX 4090 + Llama 8B AWQ + 8 active LoRA)#
| QPS | P50 latency | P95 latency | P99 latency | GPU util |
|---|---|---|---|---|
| 1 | 1.2 s | 1.4 s | 1.5 s | 30% |
| 5 | 1.3 s | 1.8 s | 2.1 s | 65% |
| 10 | 1.5 s | 2.4 s | 3.0 s | 88% |
| 20 | 2.2 s | 4.5 s | 7.2 s | 95% |
| 30 | 5.1 s | 12 s | 25 s | 99% (saturated) |
Sweet spot: ~15 QPS — latency P95 < 3.5s, %88 GPU util.
Cookbook'un kuralı: Production'da QPS limit'ini sat point'inin %70'inde tut. Sat ⇒ degradation.
✅ Teslim
- 3 farklı LoRA'yı (örn. domain-A, domain-B, domain-C) Llama 8B base'inde train et. 2) vLLM ile hot-swap serving setup. 3) QPS load test (örn. locust ile). 4) Sonraki ders: 15.3 — SGLang RadixAttention.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations