MBPP
Benchmark website →Mostly Basic Python Problems: 974 crowd-sourced Python programming problems testing basic programming competence.
About this test
- What it measures
- Basic Python programming ability and code generation correctness.
- How it was administered
- Models generate function implementations; tested against 3 automated test cases per problem; pass@1 reported.
Model rankings
Models ranked by score on this benchmark. Higher is better.
| Rank | Model | Provider | Score | Percentile | Tags |
|---|---|---|---|---|---|
| 1 | Alibaba | 90.2 | p98 | Code Assistant, Open Weight, Medium | |
| 2 | Mistral AI | 88.0 | p97 | Reasoning, Small, Code Assistant, Proprietary | |
| 3 |