MBPP

Name: MBPP Benchmark Results
Creator: BAUS.AI

Mostly Basic Python Problems: 974 crowd-sourced Python programming problems testing basic programming competence.

What it measures: Basic Python programming ability and code generation correctness.
How it was administered: Models generate function implementations; tested against 3 automated test cases per problem; pass@1 reported.

Model rankings

Models ranked by score on this benchmark. Higher is better.

Rank	Model	Provider	Score	Percentile	Tags
1	Qwen 2.5 Coder 32B	Alibaba	90.2	p98	Code Assistant, Open Weight, Medium
2	Codestral	Mistral AI	88.0	p97	Reasoning, Small, Code Assistant, Proprietary
3