BigBench Hard
Benchmark website →BigBench Hard is a subset of 23 challenging tasks from the Beyond the Imitation Game Benchmark (BIG-bench).
About this test
- What it measures
- Diverse reasoning: logic, linguistics, knowledge, and multi-step tasks.
- How it was administered
- Multiple task formats; various metrics (exact match, F1, etc.); few-shot or zero-shot.
Model rankings
Models ranked by score on this benchmark. Higher is better.
| Rank | Model | Provider | Score | Percentile | Tags |
|---|---|---|---|---|---|
| 1 | OpenAI | 90.5 | p99 | Text Generation, Reasoning, Proprietary | |
| 2 | DeepSeek | 89.0 | p99 | Text Generation, Reasoning, Open Weight, Large | |