BigBench Hard

Name: BigBench Hard Benchmark Results
Creator: BAUS.AI

BigBench Hard is a subset of 23 challenging tasks from the Beyond the Imitation Game Benchmark (BIG-bench).

What it measures: Diverse reasoning: logic, linguistics, knowledge, and multi-step tasks.
How it was administered: Multiple task formats; various metrics (exact match, F1, etc.); few-shot or zero-shot.

Model rankings

Models ranked by score on this benchmark. Higher is better.

Rank	Model	Provider	Score	Percentile	Tags
1	GPT-o1	OpenAI	90.5	p99	Text Generation, Reasoning, Proprietary
2	DeepSeek R1	DeepSeek	89.0	p99	Text Generation, Reasoning, Open Weight, Large