Where do scores come from?

MMLU-TR, TruthfulQA-TR, Belebele, Artificial Analysis benchmark set + internal Q1 2026 calibration.

How do Türkiye-local models compare?

TR-native models (Trendyol LLM, Cosmos, KanguruLLM) score lower on general benchmarks but excel in TR token efficiency and domain quality.

Which model is best for TR?

General: Claude 4.7 Opus / GPT-5. Production price/perf: Claude Sonnet or GPT-5-mini. TR-native: Trendyol LLM 7B v3.

Why is DeepSeek cheap?

China-based; very low margin + free pretraining offered cheaply via API. Caution on KVKK Art. 9 cross-border transfer.

What is token efficiency?

Turkish text uses 20-40% more tokens than English. TR-native models narrow this gap (efficiency ~0.95-0.98).

How is use-case score calculated?

Each use-case weights dimensions (e.g. customer-support: trGeneral 0.3, contentQuality 0.25, truthfulTr 0.15, cost -0.2).

AI Interactive Tools

Turkish LLM Performance Comparator

16+ LLMs on Turkish benchmarks + use-case score + domain (bank/legal/health) + cost + region.

Definition

Turkish LLM Benchmark: Standard eval sets measuring LLM performance in Turkish: MMLU-TR, TruthfulQA-TR, Reasoning-TR, sectoral domain tests + token efficiency measurements.; Also known as: TR-MMLU, Turkish LLM eval, TR benchmark, Cosmos, Trendyol LLM

Selection

Use-CaseRegionOpen-weights only (self-host)

Models (4)

Results

Sign-up Required

Turkish LLM Performance Comparator results are members-only

You can adjust the form inputs freely; the result table, charts and PDF report require a free account. Your current inputs are preserved when you sign up.

Re-download your reports and PDFs from your dashboard
Stay updated on new tools and KVKK + EU AI Act changes
Full access to the Resource Centre, Forum and Learning Portal

KVKK/GDPR compliant — only name and email. We won't send ads; you can delete your account anytime.

Frequently Asked Questions

MMLU-TR, TruthfulQA-TR, Belebele, Artificial Analysis benchmark set + internal Q1 2026 calibration.

References

MMLU — Measuring Massive Multitask Language Understanding, Hendrycks et al.
TruthfulQA: Measuring How Models Mimic Human Falsehoods, Lin et al.
Belebele Multilingual Reading Comprehension, Meta
Artificial Analysis — TR Benchmark, Artificial Analysis