Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite

Code LLM'in standart benchmark suite'i: HumanEval (164 Python problem), MBPP (974 Python), BigCodeBench (1140 calls 139 lib), LiveCodeBench (datas leak-resistant), SWE-Bench-Lite (300 real GitHub issue fix). Pass@1 vs pass@10 metric, code execution sandbox. RTX 4090'da bench koşma.

Şükrü Yusuf KAYA

26 dakikalık okuma

14.05.2026

İleri

Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite

1. Code Benchmark Tablosu#

Benchmark	Size	Type	Notlar
HumanEval	164	function-level Python	klasik, data-leak riski
HumanEval-X (multilingual)	164 × 6	6 dil	EN + ZH + ...
MBPP (Mostly Basic Python Problems)	974	basic algorithmic	iyi baseline
BigCodeBench	1140	real-world library calls (139 lib)	en realistic
LiveCodeBench	400+	LeetCode-style, monthly refreshed	data-leak resistant
SWE-Bench-Lite	300	real GitHub issue → PR	en zor, agentic
RepoBench	27K	code completion	repo-level