Should I use spot/preemptible GPU instances?

Yes for async workloads — 60-80% discount. No for real-time — production goes down if instance is preempted. Vast.ai community spot is economic but unstable. Module 11 covers example patterns.

Self-Hosted LLM Real Cost: The Full Conversion Formula from GPU-Hour to $/M Token

When you run Llama 3.3 70B on RunPod with H100, what's the real $/M token? Formula of GPU-hour × throughput × MFU, vLLM continuous batching effect, and at which volume self-hosting becomes cheaper than frontier APIs.

Şükrü Yusuf KAYA

22 min read

5/14/2026

Advanced

Self-Hosted LLM Gerçek Maliyet: GPU Saatten $/M Token'a Tam Çevrim Formülü

⚙️ Mühendisin asıl maliyet sınavı

"Llama 3.3 70B'yi kendim host etsem ucuz mu?" sorusunu cevaplamak için sadece GPU saatini bilmen yetmez. Throughput, MFU, batching, idle ratio — formülün tüm parçalarını koyacağız.

Self-host $/M token formülü — Tam version#

Aşağıdaki formül, self-host maliyetini doğru hesaplar:

$/M_token = (GPU_saat_ücreti × GPU_sayısı) /
            (token_per_second × 3600)

veya:

$/M_token = GPU_dollar_per_hour /
            (tok_s × 3.6)

Bir örnek#

Senaryo: Llama 3.3 70B, H100 80GB, RunPod

H100 saat ücreti: $2.79/saat (RunPod community cloud, 2026 Mayıs)
vLLM throughput: 80 token/saniye (single user)
→ $/M output token =$ 2.79 / (80 × 3.6) = $9.69 / M

Bu çok pahalı! Sonnet 4.6 ($15/M output) ile karşılaştır.

Ama bekle — gerçek hesap batching ile değişiyor.

Batching'in mucizesi — Continuous Batching#

vLLM'in continuous batching özelliği, aynı GPU üstünde birden fazla user request'i paralel işliyor.

Single user vs batched#

Mod	Throughput per request	Throughput total
Single user (1 request)	80 tok/s	80 tok/s
4 paralel user	40 tok/s/each	160 tok/s total
8 paralel user	25 tok/s/each	200 tok/s total
16 paralel user	12 tok/s/each	192 tok/s total
32 paralel user	6 tok/s/each	192 tok/s total (peak)

Sweet spot#

H100'de Llama 3.3 70B için batch=16 civarı en yüksek aggregate throughput verir.

Yeni hesap#

GPU saat: $2.79
Aggregate throughput: 192 tok/s
$/M = $2.79 / (192 × 3.6) = $4.04 / M

$9.69 →$ 4.04 = %58 indirim. Continuous batching'in doğrudan ekonomik etkisi.

⚡ vLLM continuous batching mucizesi

Aynı GPU, aynı model, batch=1 ile

9.69/M, batch=16 ile

4.04/M. 2.4× indirim, sadece doğru inference engine ile. vLLM, SGLang, TensorRT-LLM bu işi yapar.

Gerçek GPU saat ücretleri (Mayıs 2026)#

NVIDIA H100 80GB#

Sağlayıcı	$/saat (community)	$/saat (secure)
RunPod	$2.79	$3.89
Modal	—	$4.50
Lambda Labs	$2.99	$3.49
Cerebrium	$3.20	$4.20
Vast.ai	$1.85	—
AWS (p5.48xlarge)	—	$4.40 (1 GPU eşdeğer)

NVIDIA H200 141GB#

RunPod: $3.49 (community),$ 4.99 (secure)
Lambda Labs: $3.99
Avantaj: 2× memory → daha büyük batch, daha az TP (tensor parallel)

NVIDIA B200 192GB (2026 Q1 çıktı, hala sınırlı)#

Lambda Labs: $6.99 (1 GPU eşdeğer)
Yeni Blackwell architecture
H100'den 2-3× throughput

AMD MI300X 192GB#

Hot Aisle: $2.49/saat
TensorWave: $2.79
ROCm desteği iyileşti, vLLM çalışıyor

Bütçe seçeneği — A100 80GB#

RunPod: $1.49/saat
Vast.ai: $0.85/saat (community spot)
Llama 3.3 70B FP16 sığar ama yavaş

Model boyutu vs GPU eşleştirme#

Her model her GPU'da çalışmaz. İhtiyacın olan VRAM hesabı:

VRAM gereksinim ≈ Param sayısı × Precision bytes × 1.2 (overhead)

Llama 3.3 70B FP16:
70B × 2 bytes × 1.2 = 168 GB → 2× H100 80GB (TP=2) gerek

Llama 3.3 70B INT8 quant:
70B × 1 byte × 1.2 = 84 GB → 1× H100 80GB sıkışır ya da 1× H200 141GB rahat

Llama 3.3 70B INT4 quant (GPTQ/AWQ):
70B × 0.5 byte × 1.2 = 42 GB → 1× A100 40GB sıkışır, 1× H100 80GB ferah

Quantization etkisi#

Precision	VRAM	Quality	Throughput
FP32 (full)	280GB	%100 baseline	1×
BF16/FP16	140GB	%99.9	1.5×
INT8	70GB	%99.5	2×
INT4 (GPTQ/AWQ)	35GB	%98	2.5×
INT4 + speculative	35GB	%98	3-4×

Throughput tablosu — gerçek dünya ölçümler#

Aynı Llama 3.3 70B model, farklı konfigürasyonlar:

Config	Aggregate Throughput	$/M Output
1× H100, FP16, batch=1	80 tok/s	$9.69
1× H100, FP16, batch=16	192 tok/s	$4.04
1× H100, INT8, batch=16	320 tok/s	$2.42
1× H100, INT4 AWQ, batch=32	480 tok/s	$1.62
1× H200, FP16, batch=24	280 tok/s	$3.47
1× H200, INT8, batch=32	550 tok/s	$1.77
2× H100, FP16 TP=2, batch=32	600 tok/s	$2.58
1× B200, FP16, batch=32	700 tok/s	$2.77
1× B200, INT8, batch=64	1.200 tok/s	$1.62

Optimum: 1× H100 + INT4 + batch=32 → $1.62/M#

Groq'un Llama 3.3 70B fiyatı:

0.79/M output. Together'ın Llama 3.3 70B fiyatı:

0.88/M output.

Groq self-hostingden hâlâ %50 ucuza — neden? Çünkü Groq özel chip kullanıyor (LPU), GPU değil. Onların maliyet temelleri farklı.

Utilization Factor — Çıplak gerçek#

Self-host hesabının en sinsi parçası: GPU'yu sürekli %100 kullandığını varsayıyoruz. Gerçekte ne kadar?

Senaryo	Real utilization	Effective $/M
Production 24/7 high traffic	%85-95	listed × 1.05-1.18
Production gündüz yoğun	%40-60	listed × 1.7-2.5
Dev/test	%5-15	listed × 7-20

Gerçek dünya örnek#

GPU saat $2.79 × 24 saat × 30 gün = $2.009/ay (1× H100 sürekli)
Utilization %50 ise: effective $4.02/saat
$/M @ 192 tok/s, %50 util = $4.02 / (192 × 3.6 × 0.5) = $11.66/M

Düşük utilization, self-host'u öldürür.

Break-Even — Self-Host vs API#

Llama 3.3 70B için break-even analizi:

API sağlayıcı	$/M output	Self-host break-even (token/gün)
Groq	$0.79	Asla — Groq daha ucuz
Together	$0.88	Asla — Together daha ucuz
Bedrock Llama	$2.65	~500K token/gün
API ortalama	$1.50	~2M token/gün (yüksek util)

Sonnet 4.6 ile karşılaştır#

Metrik	Sonnet 4.6	Self-host L3.3 70B INT4
Output $/M	$15	$1.62
Quality	Premium	Good (~80% Sonnet)
Latency	60-120ms TTFT	self-host'a göre
Operational burden	0	high (monitoring, scaling, GPU failures)

Quality kabul edilebilir + iş yükün >5M token/gün ise self-host ekonomik.

Self-host'un gizli maliyetleri#

GPU saat ücreti tek başına yetmez. Şu kalemleri de düşün:

1. Operasyonel#

GPU monitoring (Prometheus, Grafana setup)
Auto-scaling (KEDA, Karpenter veya custom)
Model deployment pipeline (CI/CD)
Update/rollback prosedürü
On-call mühendis (LLM down olunca)

Tahmin: Aylık 20-40 saat mühendislik ×

50/saat = **

1.000-2.000/ay**.

2. Network egress#

Bedrock/Vertex'te free. Self-host'ta her cevap output network egress:

AWS: $0.09/GB
Lambda Labs / RunPod: genelde free

3. Storage#

Model ağırlıkları: 70B = ~140GB. S3/R2 storage.
Logs, traces: ClickHouse veya equivalent
Tahmin: $50-100/ay

4. Idle GPU#

GPU rezerve ettiysen ama traffic yoksa, ödüyorsun. Auto-scale to zero zorunlu.

5. Üst-üste yedek (HA)#

Single GPU = single point of failure. Production'da minimum 2× GPU, primary + standby.

📐 Self-host genel kuralı

İş yükün < 5M token/gün ise: API kullan. 5-50M token/gün: karar verme zorlaşır, kalite + bütçe + ekip kapasitesi. > 50M token/gün sürekli iş yükü: self-host ciddi düşün. Modül 11'de detaylı break-even ve Lab 11.

Vaka: Bir Türk fintech'in self-host kararı#

İş yükü: Aylık 150M token, çoğunluğu Türkçe customer-support automation. KVKK uyumu zorunlu.

Seçenek A: Vertex AI EU + Sonnet 4.6#

Input  : 120M × $3/M = $360
Output : 30M × $15/M = $450
TOPLAM : $810/ay (KVKK uyumu var)

Seçenek B: Self-host Llama 4 Maverick INT8, 2× H100 Frankfurt#

GPU saat: 2 × $2.79 × 24 × 30 = $4.018/ay
Throughput @ 600 tok/s @ %70 util = 363M effective tok/ay
$/M_out: $4018 / 363 = $11.07/M (sadece GPU)
+ ops: ~$1.500/ay
+ network: $50
Yıllık seyir: $5.568/ay (B), $810/ay (A)

A daha ucuz! Çünkü iş yükü self-host break-even'ın altında.

Sonuç: 150M token/ay self-host etmez. Sonnet 4.6 + Vertex EU ile devam.

Eğer iş yükü 1B token/ay olsa?#

A: 1B × ortalama $7.5/M = $7.500/ay
B: $4.018 GPU + $1.500 ops = $5.518/ay

B ucuza geçti. Self-host kararı volume'a göre değişir.

▶️ Sıradaki ders

2.7 — Gizli Maliyetler. Tool use token'ları, structured output prefill, reasoning thinking budget, system fingerprint cache miss, web-search tool, vision detail mode — fiyat sayfasında olmayan ama faturanı şişiren tüm kalemler.

Frequently Asked Questions

Modal: easiest DX, %15-30 more expensive, scale-to-zero strong. RunPod: cheapest, user-controlled, more engineering. Lambda Labs: between, stable pricing. Lab 11 runs the same Llama on all three.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...