Code Eval: HumanEval + MBPP + BigCodeBench + LiveCodeBench + SWE-Bench-Lite
Code LLM standard benchmark suite: HumanEval (164 Python), MBPP (974 Python), BigCodeBench (1140 calls, 139 libs), LiveCodeBench (data-leak resistant), SWE-Bench-Lite (300 real GitHub issues). Pass@1 vs pass@10, code execution sandbox. Running bench on RTX 4090.
Şükrü Yusuf KAYA
26 min read
Advanced1. Code Benchmark Tablosu#
| Benchmark | Size | Type | Notlar |
|---|---|---|---|
| HumanEval | 164 | function-level Python | klasik, data-leak riski |
| HumanEval-X (multilingual) | 164 × 6 | 6 dil | EN + ZH + ... |
| MBPP (Mostly Basic Python Problems) | 974 | basic algorithmic | iyi baseline |
| BigCodeBench | 1140 | real-world library calls (139 lib) | en realistic |
| LiveCodeBench | 400+ | LeetCode-style, monthly refreshed | data-leak resistant |
| SWE-Bench-Lite | 300 | real GitHub issue → PR | en zor, agentic |
| RepoBench | 27K | code completion | repo-level |
Cookbook standart eval suite: HumanEval + MBPP (sanity) + BigCodeBench + LiveCodeBench (real-world) + SWE-Bench-Lite (agentic).
Pass@k: k completion'dan en az birinin doğru olma olasılığı. (greedy) ve (temperature 0.8) cookbook default.
pass@1pass@10✅ Teslim
- lm-eval-harness ile HumanEval koş. 2) Pre/post FT karşılaştır. 3) Sonraki ders: 8.8 — Code-LLM Safety.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations