# The 2026 LLM Benchmark Glossary: What MMLU, HumanEval, SWE-bench, ARC-AGI-2, GPQA, AIME, LiveCodeBench Measure and What the Numbers Mean

> Source: https://sukruyusufkaya.com/en/blog/llm-benchmark-sozlugu-mmlu-humaneval-swe-bench-arc-agi-2026
> Updated: 2026-05-27T18:16:07.199Z
> Type: blog
> Category: yapay-zeka
**TLDR:** MMLU, HumanEval, SWE-bench Verified/Pro, ARC-AGI-2, GPQA Diamond, AIME, LiveCodeBench v6, Terminal-Bench 2.0, OSWorld, HLE, plus Turkish benchmarks (TR-MMLU, TUMLU) — what each one measures, the frontier thresholds, contamination and cherry-picking risks, and practical meaning for CTOs, investors, and engineers. 32+ references.

<tldr data-summary="[&quot;MMLU is saturated in 2026 (frontier 90%+). Discriminative benchmarks are now ARC-AGI-2, GPQA Diamond, SWE-bench Pro, HLE, and LiveCodeBench v6.&quot;,&quot;SWE-bench Verified hit 80% frontier; OpenAI announced SWE-bench Pro (frontier ~46%) as the new threshold in September 2025 — multi-file real-world software engineering.&quot;,&quot;ARC-AGI-2 SOTA sits in the mid-60s; even o3-class reasoning models lag the human baseline of 85% on transfer/fluid intelligence.&quot;,&quot;For Turkish enterprises TR-MMLU (84%) and TUMLU (79%) are far more informative than English MMLU for picking a model.&quot;,&quot;Contamination is now an industry-wide risk: vendor cherry-picking, post-train test-set leakage, and rolling-update benchmarks like LiveCodeBench v6 are direct evidence.&quot;]" data-one-line="2026 LLM evaluation has moved from a single-score leaderboard to a portfolio of benchmarks with frontier thresholds; the answer to what a score means depends on the benchmark and the use case."></tldr>

## 1. Why a Benchmark Glossary?

A vendor announces "GPT-5.5 hits 82% on SWE-bench Verified." Tech press makes a headline of it. But what does that number actually mean for:
- A CTO ("Will my engineers ship 82% faster?" — no, not directly)
- An investor ("Is this frontier?" — maybe, watch for contamination)
- An ML engineer ("Is this enough to pick a model?" — never, you need task-specific eval)

Every benchmark measures a different thing, with a different frontier threshold, exposed to different contamination risks. This guide is an honest map of the 2026 benchmark landscape.

<definition-box data-term="LLM Benchmark" data-definition="A public dataset and protocol that tests a large language model's competence on a standardized task — ranging from multiple-choice questions to closed-box software engineering and agentic computer-use environments. Each benchmark measures a specific capability; no single benchmark suffices for general intelligence." data-also="LLM eval, AI benchmark" data-wikidata="Q105843828"></definition-box>

<stat-callout data-value="46%" data-context="May 2026 SOTA on SWE-bench Pro" data-outcome="OpenAI announced this in September 2025 as the new frontier threshold — vs the 80%+ on Verified, it shows how far real software engineering still has to go." data-source="{&quot;label&quot;:&quot;OpenAI SWE-bench Pro Announcement&quot;,&quot;url&quot;:&quot;https://openai.com/index/swe-bench-pro/&quot;,&quot;date&quot;:&quot;2025-09&quot;}"></stat-callout>

## 2. Anatomy: Five Categories

LLM benchmarks fall into 5 categories:

1. **Knowledge + Reasoning** (MMLU, GPQA, HLE, ARC-AGI)
2. **Math** (AIME, MATH, GSM8K)
3. **Code** (HumanEval, MBPP, SWE-bench, LiveCodeBench, Terminal-Bench)
4. **Agentic + Computer Use** (OSWorld, AgentBench, WebArena)
5. **Language-specific** (TR-MMLU, TUMLU, CMMLU, JMMLU)

A "frontier" model must score high across all 5 — not just one.

## 3. The 2026 Frontier Landscape

<comparison-table data-caption="2026 LLM Benchmark Landscape: Frontier Thresholds" data-headers="[&quot;Benchmark&quot;,&quot;What it measures&quot;,&quot;Max&quot;,&quot;Frontier (2026)&quot;,&quot;Saturated?&quot;]" data-rows="[{&quot;feature&quot;:&quot;MMLU&quot;,&quot;values&quot;:[&quot;57-area MCQ&quot;,&quot;100%&quot;,&quot;88%+&quot;,&quot;Yes (saturated)&quot;]},{&quot;feature&quot;:&quot;MMLU-Pro&quot;,&quot;values&quot;:[&quot;Harder MCQ&quot;,&quot;100%&quot;,&quot;80%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;GPQA Diamond&quot;,&quot;values&quot;:[&quot;Graduate-level QA&quot;,&quot;100%&quot;,&quot;75%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;HumanEval&quot;,&quot;values&quot;:[&quot;Python code&quot;,&quot;100%&quot;,&quot;92%+&quot;,&quot;Yes (saturated)&quot;]},{&quot;feature&quot;:&quot;MBPP&quot;,&quot;values&quot;:[&quot;Basic Python&quot;,&quot;100%&quot;,&quot;85%+&quot;,&quot;Saturating&quot;]},{&quot;feature&quot;:&quot;LiveCodeBench v6&quot;,&quot;values&quot;:[&quot;Recent code problems&quot;,&quot;100%&quot;,&quot;65%+&quot;,&quot;No (rolling)&quot;]},{&quot;feature&quot;:&quot;SWE-bench Verified&quot;,&quot;values&quot;:[&quot;Real GitHub issues&quot;,&quot;100%&quot;,&quot;80%+&quot;,&quot;Approaching&quot;]},{&quot;feature&quot;:&quot;SWE-bench Pro&quot;,&quot;values&quot;:[&quot;Multi-file software&quot;,&quot;100%&quot;,&quot;46%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;ARC-AGI-1&quot;,&quot;values&quot;:[&quot;Visual reasoning&quot;,&quot;100%&quot;,&quot;88%+&quot;,&quot;Yes (end-2024)&quot;]},{&quot;feature&quot;:&quot;ARC-AGI-2&quot;,&quot;values&quot;:[&quot;Harder ARC&quot;,&quot;100%&quot;,&quot;55%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;AIME&quot;,&quot;values&quot;:[&quot;Olympiad math&quot;,&quot;30/30&quot;,&quot;26+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;MATH&quot;,&quot;values&quot;:[&quot;High-school math&quot;,&quot;100%&quot;,&quot;92%+&quot;,&quot;Saturating&quot;]},{&quot;feature&quot;:&quot;GSM8K&quot;,&quot;values&quot;:[&quot;Grade-school math&quot;,&quot;100%&quot;,&quot;96%+&quot;,&quot;Yes (saturated)&quot;]},{&quot;feature&quot;:&quot;Terminal-Bench 2.0&quot;,&quot;values&quot;:[&quot;CLI agent&quot;,&quot;100%&quot;,&quot;38%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;OSWorld&quot;,&quot;values&quot;:[&quot;Computer-use agent&quot;,&quot;100%&quot;,&quot;24%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;HLE&quot;,&quot;values&quot;:[&quot;Multi-domain hard&quot;,&quot;100%&quot;,&quot;34%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;TR-MMLU v2&quot;,&quot;values&quot;:[&quot;Turkish 67-area&quot;,&quot;100%&quot;,&quot;82%+&quot;,&quot;No&quot;]},{&quot;feature&quot;:&quot;TUMLU&quot;,&quot;values&quot;:[&quot;Turkish 32-task&quot;,&quot;100%&quot;,&quot;78%+&quot;,&quot;No&quot;]}]"></comparison-table>

## 4. Detail on Each Benchmark

### 4.1. MMLU
57 academic fields, ~14k MCQ. **Saturated.** Frontier 88%+. Treat as minimum entry threshold, not as discriminator.

### 4.2. MMLU-Pro
10-option harder MMLU. Frontier 80%+. Not yet saturated, but trending.

### 4.3. GPQA Diamond
PhD-level Bio/Chem/Physics, Google-proof, 198 hardest items. Frontier 75%+. **Best knowledge-discrimination benchmark for 2026.**

### 4.4. HumanEval
164 standalone Python problems. **Saturated.** Frontier 92%+. Heavy contamination risk; **do not use as production criterion.**

### 4.5. MBPP
974 basic Python. Saturating. Frontier 85%+.

### 4.6. LiveCodeBench v6
Rolling-update from Codeforces / LeetCode / AtCoder / HackerRank. **Best code benchmark for contamination resistance.** Frontier 65%+.

### 4.7. SWE-bench Verified
500 real GitHub issues, manually verified. Frontier 80%+. Real engineering relevance.

### 4.8. SWE-bench Pro
Multi-file, multi-module, multi-language curated tasks. **Lowest contamination, most realistic.** Frontier 46%+. OpenAI's official new frontier threshold.

<callout-box data-variant="answer" data-title="Why Verified at 80% but Pro at 46%?">

Two reasons: contamination (Verified problems were public on GitHub in 2023-2024) and task complexity (Pro averages 8-12 files vs Verified's 2-3). Pro is a much more accurate proxy for "engineer-on-a-team productivity."

</callout-box>

### 4.9. ARC-AGI-1
Visual reasoning, fluid intelligence. **Saturated end-2024** by o3-style models at 88%. Superseded by ARC-AGI-2.

### 4.10. ARC-AGI-2
Harder visual reasoning. Frontier 55-65% with reasoning models. Human baseline 85% — still uncrossed.

### 4.11. AIME
30 olympiad-math problems/year. Frontier 26+/30. Reasoning models now at olympiad level.

### 4.12. MATH
12.5k high-school problems. Frontier 92%+. Saturating.

### 4.13. GSM8K
8.5k grade-school word problems. **Saturated** at 96%+.

### 4.14. Terminal-Bench 2.0
CLI agent tasks (bash + git + Docker + kubectl). Frontier 38%+. Closest to real DevOps work.

### 4.15. OSWorld
Linux desktop GUI tasks via mouse + keyboard. Frontier 24%+. Human baseline 72%. Long way to go.

### 4.16. HLE (Humanity's Last Exam)
PhD-level multi-domain. Frontier 34%+. Human PhD baseline 82%. Built to outlast the "models reaching human" moment.

### 4.17. TR-MMLU, TUMLU, TurkishMMLU-Pro
Turkish-specific benchmarks; far more informative than English MMLU for Turkish-market decisions. Frontier 82%+, 78%+, 62%+.

## 5. Consolidated Frontier Scoreboard (May 2026)

<stat-callout data-value="32" data-context="As of May 2026, the average number of benchmarks needed" data-outcome="to classify a model as 'frontier' — every one must cross the frontier threshold; topping a single benchmark is no longer enough." data-source="{&quot;label&quot;:&quot;Vellum LLM Leaderboard, May 2026&quot;,&quot;url&quot;:&quot;https://www.vellum.ai/llm-leaderboard&quot;,&quot;date&quot;:&quot;2026-05&quot;}"></stat-callout>

<comparison-table data-caption="May 2026 Frontier Model Scores" data-headers="[&quot;Benchmark&quot;,&quot;GPT-5.5&quot;,&quot;Claude Opus 4.7&quot;,&quot;Gemini 3.1 Pro&quot;,&quot;Llama 4 Maverick&quot;,&quot;DeepSeek V3.2&quot;]" data-rows="[{&quot;feature&quot;:&quot;MMLU&quot;,&quot;values&quot;:[&quot;92.4%&quot;,&quot;92.1%&quot;,&quot;91.7%&quot;,&quot;89.3%&quot;,&quot;88.7%&quot;]},{&quot;feature&quot;:&quot;MMLU-Pro&quot;,&quot;values&quot;:[&quot;83.7%&quot;,&quot;84.6%&quot;,&quot;82.9%&quot;,&quot;79.4%&quot;,&quot;78.2%&quot;]},{&quot;feature&quot;:&quot;GPQA Diamond&quot;,&quot;values&quot;:[&quot;78.4%&quot;,&quot;79.2%&quot;,&quot;76.8%&quot;,&quot;71.3%&quot;,&quot;69.4%&quot;]},{&quot;feature&quot;:&quot;HumanEval&quot;,&quot;values&quot;:[&quot;94.7%&quot;,&quot;95.1%&quot;,&quot;93.8%&quot;,&quot;92.1%&quot;,&quot;91.6%&quot;]},{&quot;feature&quot;:&quot;LiveCodeBench v6&quot;,&quot;values&quot;:[&quot;68.4%&quot;,&quot;66.7%&quot;,&quot;64.2%&quot;,&quot;56.8%&quot;,&quot;59.3%&quot;]},{&quot;feature&quot;:&quot;SWE-bench Verified&quot;,&quot;values&quot;:[&quot;82.3%&quot;,&quot;84.1%&quot;,&quot;78.6%&quot;,&quot;67.4%&quot;,&quot;64.8%&quot;]},{&quot;feature&quot;:&quot;SWE-bench Pro&quot;,&quot;values&quot;:[&quot;46.3%&quot;,&quot;47.8%&quot;,&quot;41.2%&quot;,&quot;29.7%&quot;,&quot;27.4%&quot;]},{&quot;feature&quot;:&quot;ARC-AGI-2&quot;,&quot;values&quot;:[&quot;62.4%&quot;,&quot;64.7%&quot;,&quot;59.3%&quot;,&quot;38.6%&quot;,&quot;41.2%&quot;]},{&quot;feature&quot;:&quot;AIME&quot;,&quot;values&quot;:[&quot;86.7%&quot;,&quot;83.3%&quot;,&quot;90.0%&quot;,&quot;62.4%&quot;,&quot;67.8%&quot;]},{&quot;feature&quot;:&quot;Terminal-Bench 2.0&quot;,&quot;values&quot;:[&quot;38.4%&quot;,&quot;42.1%&quot;,&quot;35.7%&quot;,&quot;21.4%&quot;,&quot;23.7%&quot;]},{&quot;feature&quot;:&quot;OSWorld&quot;,&quot;values&quot;:[&quot;22.7%&quot;,&quot;28.4%&quot;,&quot;19.3%&quot;,&quot;11.8%&quot;,&quot;10.4%&quot;]},{&quot;feature&quot;:&quot;HLE&quot;,&quot;values&quot;:[&quot;34.1%&quot;,&quot;36.2%&quot;,&quot;31.8%&quot;,&quot;21.4%&quot;,&quot;23.7%&quot;]},{&quot;feature&quot;:&quot;TR-MMLU v2&quot;,&quot;values&quot;:[&quot;82.4%&quot;,&quot;84.1%&quot;,&quot;80.7%&quot;,&quot;71.3%&quot;,&quot;72.8%&quot;]}]"></comparison-table>

## 6. Turkish-Market Perspective

### For Turkish CTOs
- Support/chatbot → TR-MMLU + TUMLU
- Turkish content → TUMLU Creative Writing
- Legal writing → TR-MMLU Law sub-score
- Engineering productivity → SWE-bench **Pro** (not Verified), LiveCodeBench v6
- Complex business processes → Terminal-Bench 2.0, OSWorld
- Financial reasoning → AIME, GPQA Diamond, ARC-AGI-2

### For Turkish Investors
A frontier company must clear thresholds across 5 dimensions: knowledge, code, math, agentic, language. A single-score pitch should raise eyebrows.

### For Turkish ML Engineers
Public benchmarks are the starting line. Real production decisions need your own 50-100 prompt Turkish eval set + cost + latency + KVKK.

## 7. Case Studies: Benchmark-Decision Mismatch

### Case 1 — Turkish SaaS Misled by HumanEval
Picked a model on HumanEval 95.4%. Production engineer productivity 40% below expectation. Cause: contamination + standalone-function focus. SWE-bench Pro would have shown 30% — sub-frontier.

### Case 2 — Turkish Bank Misled by GPQA
Selected on GPQA Diamond 78%. Turkish financial-market performance disappointing. Cause: GPQA is English + science. TR-MMLU Finance sub-score would have shown 71% — sub-frontier.

### Case 3 — Turkish E-commerce Got It Right
Used 4 benchmarks: TUMLU NER + TUMLU Sentiment + LiveCodeBench v6 + OSWorld. Picked the only model frontier on all four. Production: +18% product conversion, +0.3 Likert customer satisfaction.

## 8. Risks

### Contamination
- Training-data leak (LiveCodeBench, SWE-bench Pro guard against this)
- Post-train contamination (RLHF-on-benchmark) — most dangerous, intentional
- Test-set memorization — detect by rephrasing same question

### Vendor Cherry-Picking
- Late-2024: OpenAI announced ARC-AGI-1 88% (true) but hid ARC-AGI-2 25%
- 2025: vendor announced MMLU #1 but didn't report SWE-bench Pro
- 2026 Q1: multiple vendors announced LiveCodeBench scores without specifying v3 vs v6

Always cross-check on Vellum, Artificial Analysis, LMSYS, CodeSOTA, BenchLM.

### Saturation
MMLU, HumanEval, GSM8K are no longer discriminative. Use MMLU-Pro, LiveCodeBench v6, MATH-Hard instead.

<callout-box data-variant="warning" data-title="Single-Score Marketing">

If a vendor's marketing leads with a single benchmark, be suspicious. Frontier requires high scores across 5+ benchmarks. Single-score is a cherry-picking or contamination signal.

</callout-box>

## 9. FAQ

<callout-box data-variant="answer" data-title="Does MMLU still matter?">
Not as a discriminator. Use as minimum threshold; for discrimination use MMLU-Pro or GPQA Diamond.
</callout-box>

<callout-box data-variant="answer" data-title="Should I pick on HumanEval?">
No — contaminated and saturated. Use SWE-bench Pro + LiveCodeBench v6 + your own codebase eval.
</callout-box>

<callout-box data-variant="answer" data-title="What does ARC-AGI-2 SOTA mean?">
Leading on fluid intelligence + learning transfer. But human baseline (85%) is uncrossed. Mid-60s = "promising reasoning", not human-parity.
</callout-box>

<callout-box data-variant="answer" data-title="Which benchmarks matter for my Turkish company?">
TR-MMLU v2, TUMLU, TurkishMMLU-Pro for language; SWE-bench Pro / OSWorld depending on use-case.
</callout-box>

<callout-box data-variant="answer" data-title="How do I validate a vendor's frontier claim?">
5-dimension rule across knowledge, code, math, agentic, language. Cross-check Vellum + Artificial Analysis.
</callout-box>

<callout-box data-variant="answer" data-title="Why prefer LiveCodeBench v6?">
Rolling-update design resists contamination. Most reliable code benchmark for 2026.
</callout-box>

<callout-box data-variant="answer" data-title="SWE-bench Verified or Pro?">
Pro. Lower contamination, higher real-engineering relevance. Official OpenAI position too.
</callout-box>

<callout-box data-variant="answer" data-title="Should I build my own eval harness?">
Yes, always. Public benchmarks are a starting line. Real decision needs your domain + language + standards in 50-100 prompts minimum.
</callout-box>

## 10. Next Steps

For LLM benchmark strategy or eval harness setup in your organization:

1. **Benchmark decision workshop.** We pick 5-7 use-case-relevant benchmarks and grade vendor pitches against them.
2. **Turkish eval set setup.** 100-200 prompts, automated regression protection.
3. **Model selection report.** Comparing your current model to frontier alternatives: ROI + KVKK + cost.

Reach out via the contact form on the site.

<references-list data-items="[{&quot;title&quot;:&quot;Measuring Massive Multitask Language Understanding (MMLU)&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2009.03300&quot;,&quot;author&quot;:&quot;Hendrycks et al.&quot;,&quot;publishedAt&quot;:&quot;2020-09-07&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;MMLU-Pro&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2406.01574&quot;,&quot;author&quot;:&quot;Wang et al.&quot;,&quot;publishedAt&quot;:&quot;2024-06-03&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;GPQA&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2311.12022&quot;,&quot;author&quot;:&quot;Rein et al.&quot;,&quot;publishedAt&quot;:&quot;2023-11-20&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;HumanEval&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2107.03374&quot;,&quot;author&quot;:&quot;Chen et al.&quot;,&quot;publishedAt&quot;:&quot;2021-07-07&quot;,&quot;publisher&quot;:&quot;arXiv (OpenAI)&quot;},{&quot;title&quot;:&quot;MBPP&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2108.07732&quot;,&quot;author&quot;:&quot;Austin et al.&quot;,&quot;publishedAt&quot;:&quot;2021-08-16&quot;,&quot;publisher&quot;:&quot;arXiv (Google)&quot;},{&quot;title&quot;:&quot;LiveCodeBench&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2403.07974&quot;,&quot;author&quot;:&quot;Jain et al.&quot;,&quot;publishedAt&quot;:&quot;2024-03-12&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;SWE-bench&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2310.06770&quot;,&quot;author&quot;:&quot;Jimenez et al.&quot;,&quot;publishedAt&quot;:&quot;2023-10-10&quot;,&quot;publisher&quot;:&quot;Princeton&quot;},{&quot;title&quot;:&quot;Introducing SWE-bench Verified&quot;,&quot;url&quot;:&quot;https://openai.com/index/introducing-swe-bench-verified/&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2024-08-13&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;Introducing SWE-bench Pro&quot;,&quot;url&quot;:&quot;https://openai.com/index/swe-bench-pro/&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2025-09&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;On the Measure of Intelligence (ARC-AGI)&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/1911.01547&quot;,&quot;author&quot;:&quot;Chollet&quot;,&quot;publishedAt&quot;:&quot;2019-11-04&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;ARC-AGI-2&quot;,&quot;url&quot;:&quot;https://arcprize.org/arc-agi-2/&quot;,&quot;author&quot;:&quot;ARC Prize&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;ARC Prize&quot;},{&quot;title&quot;:&quot;AIME Problems Archive&quot;,&quot;url&quot;:&quot;https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions&quot;,&quot;author&quot;:&quot;AoPS / MAA&quot;,&quot;publishedAt&quot;:&quot;Annual&quot;,&quot;publisher&quot;:&quot;AoPS&quot;},{&quot;title&quot;:&quot;MATH&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2103.03874&quot;,&quot;author&quot;:&quot;Hendrycks et al.&quot;,&quot;publishedAt&quot;:&quot;2021-03-05&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;GSM8K&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2110.14168&quot;,&quot;author&quot;:&quot;Cobbe et al. (OpenAI)&quot;,&quot;publishedAt&quot;:&quot;2021-10-27&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Terminal-Bench&quot;,&quot;url&quot;:&quot;https://github.com/lmsys/terminal-bench&quot;,&quot;author&quot;:&quot;LMSYS&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;GitHub&quot;},{&quot;title&quot;:&quot;OSWorld&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2404.07972&quot;,&quot;author&quot;:&quot;Xie et al.&quot;,&quot;publishedAt&quot;:&quot;2024-04-11&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Humanity&apos;s Last Exam&quot;,&quot;url&quot;:&quot;https://lastexam.ai/&quot;,&quot;author&quot;:&quot;CAIS + Scale AI&quot;,&quot;publishedAt&quot;:&quot;2025-01&quot;,&quot;publisher&quot;:&quot;Center for AI Safety&quot;},{&quot;title&quot;:&quot;TR-MMLU&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2407.12402&quot;,&quot;author&quot;:&quot;Yazaroğlu et al.&quot;,&quot;publishedAt&quot;:&quot;2024-07-17&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;TUMLU&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2502.11340&quot;,&quot;author&quot;:&quot;Pamuk &amp; Karaer&quot;,&quot;publishedAt&quot;:&quot;2025-02-17&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;TurkishMMLU-Pro&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2603.04412&quot;,&quot;author&quot;:&quot;Vidoport Research Lab&quot;,&quot;publishedAt&quot;:&quot;2026-03-08&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Vellum LLM Leaderboard&quot;,&quot;url&quot;:&quot;https://www.vellum.ai/llm-leaderboard&quot;,&quot;author&quot;:&quot;Vellum&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Vellum&quot;},{&quot;title&quot;:&quot;Artificial Analysis&quot;,&quot;url&quot;:&quot;https://artificialanalysis.ai/&quot;,&quot;author&quot;:&quot;Artificial Analysis&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Artificial Analysis&quot;},{&quot;title&quot;:&quot;LMSYS Chatbot Arena&quot;,&quot;url&quot;:&quot;https://chat.lmsys.org/leaderboard&quot;,&quot;author&quot;:&quot;LMSYS&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;LMSYS&quot;},{&quot;title&quot;:&quot;CodeSOTA&quot;,&quot;url&quot;:&quot;https://codesota.com/&quot;,&quot;author&quot;:&quot;CodeSOTA Team&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;CodeSOTA&quot;},{&quot;title&quot;:&quot;BenchLM&quot;,&quot;url&quot;:&quot;https://benchlm.com/&quot;,&quot;author&quot;:&quot;BenchLM&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;BenchLM&quot;},{&quot;title&quot;:&quot;WebArena&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2307.13854&quot;,&quot;author&quot;:&quot;Zhou et al.&quot;,&quot;publishedAt&quot;:&quot;2023-07-25&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;AgentBench&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2308.03688&quot;,&quot;author&quot;:&quot;Liu et al.&quot;,&quot;publishedAt&quot;:&quot;2023-08-07&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Investigating Data Contamination in Modern Benchmarks&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2311.09783&quot;,&quot;author&quot;:&quot;Sainz et al.&quot;,&quot;publishedAt&quot;:&quot;2023-11-16&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;GPT-5.5 System Card&quot;,&quot;url&quot;:&quot;https://openai.com/index/gpt-5-5-system-card/&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2026-01-22&quot;,&quot;publisher&quot;:&quot;OpenAI&quot;},{&quot;title&quot;:&quot;Claude Opus 4.7 Model Card&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/news/claude-opus-4-7&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2026-04-09&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Gemini 3.1 Pro Technical Report&quot;,&quot;url&quot;:&quot;https://blog.google/technology/google-deepmind/gemini-3-1/&quot;,&quot;author&quot;:&quot;Google DeepMind&quot;,&quot;publishedAt&quot;:&quot;2026-02-14&quot;,&quot;publisher&quot;:&quot;Google&quot;},{&quot;title&quot;:&quot;Sentezbilisim Türkçe LLM Leaderboard&quot;,&quot;url&quot;:&quot;https://sentezbilisim.com/llm-leaderboard&quot;,&quot;author&quot;:&quot;Sentezbilisim&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;Sentezbilisim&quot;},{&quot;title&quot;:&quot;ChatGPT vs Claude vs Gemini: Turkish Test&quot;,&quot;url&quot;:&quot;https://sukruyusufkaya.com/en/blog/chatgpt-vs-claude-vs-gemini-turkce-test-tr-mmlu-2026&quot;,&quot;author&quot;:&quot;Şükrü Yusuf KAYA&quot;,&quot;publishedAt&quot;:&quot;2026&quot;,&quot;publisher&quot;:&quot;sukruyusufkaya.com&quot;}]"></references-list>

---

This is a living document; the benchmark landscape shifts every quarter and is updated accordingly.