BENCHMARK LEADERBOARD

Model Evaluation Leaderboards

Track ranking standing across standardized evaluations. Analyze MMLU, HumanEval, and GPQA performance to identify capabilities suitable for your product.

Rank	Model	Provider	Score	Change
1	GPT-5.5 Pro	OpenAI	96.5%	—
2	Claude 4.8 Opus	Anthropic	95.4%	—
3	Grok 4.20	xAI	94.7%	—
4	DeepSeek V4 Pro	DeepSeek	94.5%	—
5	GPT-5.5	OpenAI	94.2%	—
6	Claude 4.7 Opus	Anthropic	94.1%	—
7	Gemini 3.1 Pro	Google	92.8%	—
8	Claude 4.5 Opus	Anthropic	92.8%	—
9	Claude 4.6 Opus	Anthropic	92.5%	—
10	Grok 4.3	xAI	92.4%	↑ +1.5%
11	GPT-5	OpenAI	92.1%	—
12	OpenAI o1	OpenAI	91.8%	—
13	Llama 4 Maverick	Meta	91.5%	—
14	DeepSeek R1	DeepSeek	90.8%	—
15	Gemini 3.5 Flash	Google	90.5%	—
16	Claude Opus 4	Anthropic	90.5%	—
17	Claude 4.6 Sonnet	Anthropic	90.4%	↑ +0.8%
18	Gemini 2.5 Pro	Google	89.9%	—
19	Claude 4.5 Sonnet	Anthropic	89.5%	—
20	GPT-4o	OpenAI	88.7%	↑ +1.2%
21	Claude 3.5 Sonnet	Anthropic	88.7%	—
22	DeepSeek V4 Flash	DeepSeek	88.0%	—
23	Claude Sonnet 4	Anthropic	88.0%	—
24	OpenAI o3-mini	OpenAI	87.9%	—
25	Llama 4 Scout	Meta	87.2%	—
26	Gemini 3.1 Flash	Google	86.8%	—
27	Mistral Large 3	Mistral	86.8%	—
28	Claude 3 Opus	Anthropic	86.8%	—
29	GPT-5 Mini	OpenAI	86.5%	—
30	GPT-4 Turbo	OpenAI	86.4%	—
31	Llama 3.3 70B Instruct	Meta	86.2%	—
32	Claude 4.5 Haiku	Anthropic	84.8%	—
33	GPT-4o Mini	OpenAI	82.0%	—
34	Mistral Small 3	Mistral	81.2%	—
35	Claude 3 Sonnet	Anthropic	79.0%	—
36	Command R+	Cohere	75.7%	—
37	Claude 3.5 Haiku	Anthropic	75.2%	—
38	Claude 3 Haiku	Anthropic	75.2%	—
39	Llama 3.2 11B Vision	Meta	73.0%	—
40	Command R	Cohere	71.0%	—

About MMLU

Massive Multitask Language Understanding (MMLU) measures general knowledge across 57 academic subjects from elementary math to professional law. It is the industry standard for evaluating general reasoning and semantic understanding.

MMLU consists of multiple-choice questions covering humanities, social sciences, STEM, and other professional contexts. It tests both world knowledge and problem-solving capability. Strong performance indicates a highly versatile model capable of handling diverse tasks without specialized fine-tuning.

What do these benchmarks mean?

What is the MMLU benchmark?keyboard_arrow_down

What is the HumanEval benchmark?keyboard_arrow_down

HumanEval is a programming benchmark created by OpenAI to evaluate coding abilities. It measures the accuracy of models in generating functional Python code blocks based on docstring instructions.

HumanEval consists of 164 hand-written programming problems. Models are evaluated using pass@1 metrics, meaning the code is executed against automated unit tests and must pass on the first try. High scores correspond directly to software engineering utility, agentic code writing, and syntactical precision.

What is the MATH benchmark?keyboard_arrow_down

MATH measures mathematical problem-solving skills across seven high-school and college disciplines. It requires models to perform complex, multi-step symbolic reasoning rather than simple calculation.

MATH is exceptionally difficult for standard models. It covers algebra, calculus, probability, geometry, and number theory. Unlike multiple-choice benchmarks, MATH requires generating final equations or numerical answers, testing a model's chain-of-thought planning and logical precision.

What is the MT-Bench benchmark?keyboard_arrow_down

MT-Bench is a multi-turn conversation benchmark. It evaluates how well models maintain coherence, logic, and instructions across progressive dialogue exchanges.

MT-Bench tests eight categories of tasks including coding, math, roleplay, and writing. A powerful model like GPT-4 is utilized as a judge to grade the responses on a scale from 1 to 10. High performance indicates excellent instruction following and conversational context retention over long interactions.

What is the GPQA benchmark?keyboard_arrow_down

GPQA (Graduate-Level Google-Proof Q&A Benchmark) tests advanced scientific and mathematical understanding using questions designed by PhD-level experts. The questions are specifically written to be difficult to answer via search engines.

GPQA contains physics, biology, and chemistry questions that even human experts find challenging. Non-experts with access to Google search only score around 34%, while models must exhibit advanced abstract reasoning and scientific understanding to pass, serving as a key benchmark for expert-level capability.

What is the HellaSwag benchmark?keyboard_arrow_down

HellaSwag evaluates common-sense reasoning and situational prediction. It tests whether a model can accurately determine the most likely next event in a described physical scenario.

HellaSwag is designed using adversarial filtering to find scenarios that are easy for humans (who score ~95%) but difficult for language models. It requires deep contextual understanding of everyday physics, human intent, and spatial logic.