What MMLU, HumanEval, and GSM8K Actually Measure

We display scores for several standard benchmarks. Here's what each one is about and how to interpret the numbers.

MMLU (Massive Multitask Language Understanding)

MMLU uses multiple-choice questions across 57 subjects (STEM, humanities, etc.). It measures broad knowledge and reasoning. High MMLU usually means the model is strong on general knowledge; it doesn't tell you how good it is at coding or long-form writing.

HumanEval

HumanEval tests code generation: the model gets a problem description and must produce Python code that passes unit tests. It's a standard for "can this model write correct code?" Combine it with user reviews if your use case is real-world development.

GSM8K (Grade School Math 8K)

GSM8K is made of grade-school math word problems that need multi-step arithmetic. It probes logical reasoning and following a chain of steps. Strong GSM8K suggests the model can handle structured reasoning; for harder math we also track the MATH benchmark.

Using benchmarks on this site

On each model page you'll see scores and percentiles for these and other tests. Use them to compare relative strengths (e.g. "best at code" vs "best at math") and pair that with our benchmark test descriptions and user ratings for a full picture.

What MMLU, HumanEval, and GSM8K Actually Measure

MMLU (Massive Multitask Language Understanding)

HumanEval

GSM8K (Grade School Math 8K)

Using benchmarks on this site

Explore Top AI Models

The AI Briefing. Free. Daily. No Spam.

Comments