Skip to content

LoRA Hot-Swap Lab: Single Base + N Adapters — 50 Customers Served on a Single 4090

vLLM 0.3+'s killer feature: single base + N LoRA adapters, runtime hot-swap. Separate LoRA per customer, all on same 24GB. Llama 3.1 8B base (~5 GB AWQ) + 30+ adapters (~40 MB each) → 50 customers on single 4090. QPS-vs-latency curve.

Şükrü Yusuf KAYA
32 min read
Advanced
LoRA Hot-Swap Lab: Tek Base + N Adapter — Tek 4090'da 50 Müşteri Servisi

1. Use Case — Niye Hot-Swap?#

Senaryo: B2B SaaS startup. 30 müşteri, her biri kendi LoRA-tuned modelini ister. Klasik yaklaşım:
  • 30 GPU × 25/saat=25/saat = 750/saat
  • Her birinde tek base model
Hot-swap yaklaşım:
  • 1 GPU × 0.50/saat(RTX4090cloud)=0.50/saat (RTX 4090 cloud) = 0.50/saat
  • Tek base + 30 adapter (RAM/disk'ten gerektiğinde GPU'ya yükle)
  • **749.50/saattasarruf(749.50/saat tasarruf** (6,500/gün)
Mimari:
Request → /v1/chat (model="customer-id-42") → vLLM: load adapter customer-42 if not in cache → forward(input, base_w + adapter_42_w) → return response
python
# === vLLM LoRA serving ===
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
enable_lora=True,
max_loras=8, # GPU'da aynı anda max LoRA sayısı
max_lora_rank=64, # destekli en yüksek rank
max_cpu_loras=32, # CPU RAM'de tutulan LoRA
gpu_memory_utilization=0.92,
)
 
# Adapter pool — disk'te 30 LoRA
lora_paths = {
"customer-01": "/loras/cust01",
"customer-02": "/loras/cust02",
# ... 30 müşteri
}
 
# Request handling
def handle_request(customer_id, prompt):
lora = LoRARequest(
lora_name=customer_id,
lora_int_id=hash(customer_id) % 10000,
lora_path=lora_paths[customer_id],
)
sampling = SamplingParams(temperature=0.7, max_tokens=500)
output = llm.generate([prompt], sampling, lora_request=lora)
return output[0].outputs[0].text
 
# Usage
response = handle_request("customer-01", "Türkçe sohbet öneri ver")
vLLM LoRA hot-swap usage

2. QPS vs Latency Eğrisi (RTX 4090 + Llama 8B AWQ + 8 active LoRA)#

QPSP50 latencyP95 latencyP99 latencyGPU util
11.2 s1.4 s1.5 s30%
51.3 s1.8 s2.1 s65%
101.5 s2.4 s3.0 s88%
202.2 s4.5 s7.2 s95%
305.1 s12 s25 s99% (saturated)
Sweet spot: ~15 QPS — latency P95 < 3.5s, %88 GPU util.
Cookbook'un kuralı: Production'da QPS limit'ini sat point'inin %70'inde tut. Sat ⇒ degradation.
✅ Teslim
  1. 3 farklı LoRA'yı (örn. domain-A, domain-B, domain-C) Llama 8B base'inde train et. 2) vLLM ile hot-swap serving setup. 3) QPS load test (örn. locust ile). 4) Sonraki ders: 15.3 — SGLang RadixAttention.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content