Is Llama 3.3 70B good enough for Turkish?

80% of use cases yes, 20% you'll find errors (especially grammar nuances). Llama 4 Scout is better. Qwen 3 72B is surprisingly good in Turkish. Sonnet 4.6 still safer. Module 9 covers comparative testing with Turkish-specific eval sets.

Open-Weight Inference: Together, Fireworks, Groq, Cerebras, DeepSeek — Frontier Quality at 5% of the Price?

Providers serving open-weight models like Llama 4, Mistral, Qwen 3, DeepSeek V3.5 — Together AI, Fireworks, Groq, Cerebras, Replicate, DeepSeek native. Price comparison, latency/throughput trade-offs, which provider for what.

Şükrü Yusuf KAYA

22 min read

5/14/2026

Intermediate

Open-Weight Inference: Together, Fireworks, Groq, Cerebras, DeepSeek — Frontier'in %5'i Fiyata Aynı Kalite?

🔥 Açık-ağırlık ekonomisi

Llama, Mistral, Qwen, DeepSeek — bu modellerin eğitimi Meta/Mistral/Alibaba/DeepSeek tarafından ödenmiş ve ağırlıklar açık. Servisleyen sağlayıcılar sadece GPU saatini ücretlendiriyor. Bu yüzden frontier modellerden %80-95 daha ucuz.

Open-weight inference sağlayıcı manzarası#

2026'da 8 ana sağlayıcı:

Sağlayıcı	Hız strateji	Fiyat strateji	Model katalogu
Together AI	Normal-fast	Düşük-mid	100+ model
Fireworks AI	Fast	Düşük-mid	50+ model + fine-tune host
Groq	Ultra-fast (LPU)	Düşük	10-15 model
Cerebras	Hyper-fast (WSE-3)	Mid	5-8 model
Replicate	Slow-normal	Pay-per-second	100+ model
DeepSeek	Normal	En düşük	Sadece DeepSeek modelleri
OpenRouter	Aggregator	%5 ek komisyon	200+ model meta
Hyperbolic	Normal	Düşük	30+ model

Her birinin kendine has fiyat dinamikleri var.

Together AI — Geniş katalog, dengeli fiyat#

Çoğu open-weight modeli sunan en geniş katalog.

Model	Input ($/M)	Output ($/M)
Llama 3.3 70B	$0.88	$0.88
Llama 3.1 405B	$3.50	$3.50
Llama 4 Scout	$0.59	$0.79
Llama 4 Maverick	$1.10	$1.40
Mistral Small 3	$0.30	$0.30
Mistral Large 2	$2.00	$6.00
Qwen 3 72B	$0.90	$0.90
DeepSeek V3	$0.49	$1.20

Özellikler#

✅ Tüm modeller OpenAI-compatible API ✅ Fine-tuning sunuyor (LoRA, full FT) ✅ Dedicated endpoints (committed throughput) ✅ Tool use destekli (modele göre) ❌ Latency Cerebras/Groq'tan yüksek

Ne zaman Together?#

Geniş model seçimine ihtiyacın var
Fine-tune host etmek istiyorsun
Dengeli latency/cost

Fireworks AI — Fine-tune odaklı#

Together'a benzer ama fine-tuning + serverless deployment odaklı.

Model	Input ($/M)	Output ($/M)
Llama 3.3 70B	$0.90	$0.90
Llama 3.1 405B	$3.00	$3.00
DeepSeek V3	$0.45	$1.10
Mistral Small 3	$0.20	$0.20

Özellikler#

✅ Tek tıkla fine-tune (LoRA) + deploy ✅ Serverless inference (cold start sorunu hafifletilmiş) ✅ Structured output destekli (JSON mode) ✅ Self-deployed model'leri Fireworks'e taşıyıp host edebilirsin

Ne zaman Fireworks?#

Fine-tune yapacaksan ve operasyonel yük istemiyorsan
Production serverless deployment
Tool use + JSON yoğun iş yükü

Groq — Ultra düşük latency#

Groq, LPU (Language Processing Unit) adlı kendi özel chip'iyle inference yapıyor. Llama 3.3 70B'yi saniyede 500 token üretiyor — endüstri ortalaması ~80.

Model	Input ($/M)	Output ($/M)	Throughput
Llama 3.3 70B	$0.59	$0.79	500 tok/s ⚡
Llama 3.1 8B	$0.05	$0.08	800 tok/s ⚡
Llama 4 Scout	$0.11	$0.34	400 tok/s
Mistral Saba 24B	$0.79	$0.79	350 tok/s
Whisper Large v3	—	—	$0.04 / saat ses

Özellikler#

✅ Düşük latency çoğu agent / chatbot için ideal ✅ Whisper transcription'da en hızlı (real-time) ✅ Cömert ücretsiz tier ✅ OpenAI-compatible API ❌ Model katalogu sınırlı ❌ Context window 8K-128K (modele göre, çok büyük değil) ❌ Fine-tune yok

Ne zaman Groq?#

Streaming UX kritik (kullanıcı bekleyemez)
Real-time transkripsiyon
Yüksek throughput batch'siz iş yükü
Türkçe konuşmalı agent (Llama 3.3 Türkçe iyi + Groq'un hızı)

⚡ Groq'un sihri

Aynı Llama 3.3 70B modelini Together'da 80 tok/s, Groq'da 500 tok/s alıyorsun. 6× hız avantajı, fiyat da %30 daha düşük. Streaming gerekli her yerde Groq'u default seçim olarak düşün.

Cerebras — Hyper-fast, premium fiyat#

Cerebras WSE-3 chip'i (dünyanın en büyük chip'i). Llama 3.3 70B'yi 2200 tok/s üretiyor — Groq'tan 4× daha hızlı.

Model	Input ($/M)	Output ($/M)	Throughput
Llama 3.3 70B	$0.85	$1.20	2200 tok/s ⚡⚡
Llama 4 Scout	$0.65	$0.85	1500 tok/s
Qwen 3 32B	$0.40	$0.80	1800 tok/s

Özellikler#

✅ Endüstrinin en hızlı inference'i (literally hyper-fast) ✅ Çok düşük TTFT (time to first token) ✅ Reasoning model'leri çok hızlı çalıştırıyor ❌ Premium fiyat (Groq'tan %50 pahalı) ❌ Model katalogu çok sınırlı ❌ Bazı feature'lar (tool use, structured output) eksik

Ne zaman Cerebras?#

Reasoning model (Qwen 3 reasoning, Llama 3 reasoning) hızlı çalıştırılacak
"Lightning fast UX" premium feature
Düşük-latency arbitrage opportunities

DeepSeek Native API — En ucuz frontier#

DeepSeek modellerini doğrudan DeepSeek'in API'sinden almak, en ucuz seçenek.

Model	Input	Cached Input	Output
DeepSeek V3.5	$0.27 / M	$0.027 ⭐	$1.10 / M
DeepSeek R1 (reasoning)	$0.55 / M	$0.055	$2.19 / M + thinking

Cache 10× indirim mucizesi#

DeepSeek otomatik prompt cache uygular: cache hit'te %90 indirim (Anthropic'in 0.10× ile aynı oranı).

Özellikler#

✅ Frontier-grade quality (Sonnet 4.6 seviyesinde benchmarklarda) ✅ En ucuz (Anthropic'ten 10-15× ucuz) ✅ Otomatik prompt caching (kullanmasan da çalışır) ❌ Çin merkezli, KVKK/GDPR endişeli ❌ Rate limits sıkı ❌ Tool use henüz olgunlaşmamış

Türkiye için#

KVKK uyumlu üretim ortamına alabileceğin bir model. DeepSeek modelini Together / Fireworks üzerinden kullan (US-based servisler) ve native API'yi dev/test'te kullan.

OpenRouter — Aggregator#

Tüm sağlayıcıları tek API'den çağırmana izin veriyor. Provider routing optimizasyonu sunuyor.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-...",
)

response = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct",  # OpenRouter herhangi bir sağlayıcıya yönlendirir
    messages=[...],
)

Özellikler#

✅ 200+ model tek API ✅ Automatic fallback (bir sağlayıcı yavaşsa diğerine) ✅ Provider preferences (

provider={"order": ["Groq", "Together"]}

) ✅ Usage analytics built-in ❌ %5 ek komisyon ❌ Cache feature'ları sağlayıcı-spesifik

Ne zaman OpenRouter?#

Çok-sağlayıcı strategy ile çalışıyorsun
Provider failover'a güveniyorsun
Test/keşif fazında — sonra direct API'ye geç tasarruf için

Replicate — Pay-per-second#

Replicate, pricing'i GPU saniye bazında ödüyor:

GPU saniye fiyatı (A100 80GB): $0.0014 / saniye
Llama 3.3 70B inference: ~100ms/token → 5 saniye/cevap
Maliyet/cevap: ~$0.007

Özellikler#

✅ İmage / video gen modelleri çok güçlü (Flux, SDXL) ✅ Pay-per-use (cold start var ama net) ✅ Custom container deploy ❌ Token-bazlı pricing'ten daha karmaşık tahminler

Ne zaman Replicate?#

İmage/video generation
Custom open-source model deploy
Sporadic workload (pay-as-you-go avantaj)

Final karşılaştırma — Aynı Llama 3.3 70B, 5 sağlayıcı#

Aynı modeli farklı sağlayıcılarda kullanmanın fiyat ve hız tablosu:

Sağlayıcı	Input ($/M)	Output ($/M)	Throughput	TTFT
Together AI	$0.88	$0.88	80 tok/s	300ms
Fireworks AI	$0.90	$0.90	100 tok/s	250ms
Groq	$0.59	$0.79	500 tok/s	100ms
Cerebras	$0.85	$1.20	2200 tok/s	80ms
Hyperbolic	$0.40	$0.40	60 tok/s	400ms
OpenRouter	~$0.65	~$0.80	sağlayıcıya bağlı	sağlayıcıya bağlı

Karar matrisi#

Önceliğin	Tercih
En ucuz	Hyperbolic, DeepSeek native
En hızlı	Cerebras > Groq
Geniş katalog	Together AI, OpenRouter
Fine-tune	Fireworks AI
Real-time UX	Groq
Reasoning model	Cerebras

📈 Açık-ağırlık tezi

2026'nın sonunda, üretim iş yüklerinin %50'sinin açık-ağırlık modellere kayması bekleniyor. Frontier kalitesinin %95'i, fiyatın %5'i — bu denklem işliyor. Lab 11'de bunun ne kadarını self-host edebileceğini ölçeceğiz.

▶️ Sıradaki ders

2.5 — AWS Bedrock, Azure OpenAI, Vertex AI: Enterprise Fiyat Manzarası. Büyük cloud sağlayıcılarındaki LLM fiyat farkları, committed throughput, region pricing, ve enterprise compliance/security premium'u.

Frequently Asked Questions

At high volume yes — we'll do break-even calculations in Module 11. At low volume no: you pay for GPU hours, sit idle. Store vs build. Self-hosting becomes economic above ~5M tokens/day.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Open-Weight Inference: Together, Fireworks, Groq, Cerebras, DeepSeek — Frontier Quality at 5% of the Price?

Open-weight inference sağlayıcı manzarası#

Together AI — Geniş katalog, dengeli fiyat#

Özellikler#

Ne zaman Together?#

Fireworks AI — Fine-tune odaklı#

Özellikler#

Ne zaman Fireworks?#

Groq — Ultra düşük latency#

Özellikler#

Ne zaman Groq?#

Cerebras — Hyper-fast, premium fiyat#

Özellikler#

Ne zaman Cerebras?#

DeepSeek Native API — En ucuz frontier#

Cache 10× indirim mucizesi#

Özellikler#

Türkiye için#

OpenRouter — Aggregator#

Özellikler#

Ne zaman OpenRouter?#

Replicate — Pay-per-second#

Özellikler#

Ne zaman Replicate?#

Final karşılaştırma — Aynı Llama 3.3 70B, 5 sağlayıcı#

Karar matrisi#

Frequently Asked Questions

Is self-hosting open-weight models cheaper?

Is Llama 3.3 70B good enough for Turkish?

Yorumlar & Soru-Cevap

Related Content

The AI Cost Explosion: Why Token Prices Fell 96% from 2022 to 2026 — Yet Bills Grew 40×

Unit Economics Vocabulary: COGS, Gross Margin, $/User, Contribution Margin — 9 Financial Concepts Every AI Engineer Must Know

Workshop Toolkit: A Quick Tour of the 11 Tools We'll Use Throughout the Course

Subscribe to Newsletter