What MMLU, HumanEval, and GSM8K Actually Measure
A quick guide to the benchmarks we use: what they test, how they’re run, and what to take away when comparing models.
We display scores for several standard benchmarks. Here’s what each one is about and how to interpret the numbers.
MMLU (Massive Multitask Language Understanding)
MMLU uses multiple-choice questions across 57 subjects (STEM, humanities, etc.). It measures broad knowledge and reasoning. High MMLU usually means the model is strong on general knowledge; it doesn’t tell you how good it is at coding or long-form writing.
HumanEval
HumanEval tests code generation: the model gets a problem description and must produce Python code that passes unit tests. It’s a standard for “can this model write correct code?” Combine it with user reviews if your use case is real-world development.
GSM8K (Grade School Math 8K)
GSM8K is made of grade-school math word problems that need multi-step arithmetic. It probes logical reasoning and following a chain of steps. Strong GSM8K suggests the model can handle structured reasoning; for harder math we also track the MATH benchmark.
Using benchmarks on this site
On each model page you’ll see scores and percentiles for these and other tests. Use them to compare relative strengths (e.g. “best at code” vs “best at math”) and pair that with our benchmark test descriptions and user ratings for a full picture.