Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite

Code LLM standard benchmark suite: HumanEval (164 Python), MBPP (974 Python), BigCodeBench (1140 calls, 139 libs), LiveCodeBench (data-leak resistant), SWE-Bench-Lite (300 real GitHub issues). Pass@1 vs pass@10, code execution sandbox. Running bench on RTX 4090.

Şükrü Yusuf KAYA

26 min read

5/14/2026

Advanced

Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite

1. Code Benchmark Tablosu#

Benchmark	Size	Type	Notlar
HumanEval	164	function-level Python	klasik, data-leak riski
HumanEval-X (multilingual)	164 × 6	6 dil	EN + ZH + ...
MBPP (Mostly Basic Python Problems)	974	basic algorithmic	iyi baseline
BigCodeBench	1140	real-world library calls (139 lib)	en realistic
LiveCodeBench	400+	LeetCode-style, monthly refreshed	data-leak resistant
SWE-Bench-Lite	300	real GitHub issue → PR	en zor, agentic
RepoBench	27K	code completion	repo-level

Cookbook standart eval suite: HumanEval + MBPP (sanity) + BigCodeBench + LiveCodeBench (real-world) + SWE-Bench-Lite (agentic).

Pass@k: k completion'dan en az birinin doğru olma olasılığı.

pass@1

(greedy) ve

pass@10

(temperature 0.8) cookbook default.

✅ Teslim

lm-eval-harness ile HumanEval koş. 2) Pre/post FT karşılaştır. 3) Sonraki ders: 8.8 — Code-LLM Safety.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite

1. Code Benchmark Tablosu#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter