BENCHMARK LEADERBOARD

Model Evaluation Leaderboards

Track ranking standing across standardized evaluations. Analyze MMLU, HumanEval, and GPQA performance to identify capabilities suitable for your product.

RankModelProviderScoreChange
1GPT-5.5 ProOpenAI96.5%
2Claude 4.8 OpusAnthropic95.4%
3Grok 4.20xAI94.7%
4DeepSeek V4 ProDeepSeek94.5%
5GPT-5.5OpenAI94.2%
6Claude 4.7 OpusAnthropic94.1%
7Gemini 3.1 ProGoogle92.8%
8Claude 4.5 OpusAnthropic92.8%
9Claude 4.6 OpusAnthropic92.5%
10Grok 4.3xAI92.4%↑ +1.5%
11GPT-5OpenAI92.1%
12OpenAI o1OpenAI91.8%
13Llama 4 MaverickMeta91.5%
14DeepSeek R1DeepSeek90.8%
15Gemini 3.5 FlashGoogle90.5%
16Claude Opus 4Anthropic90.5%
17Claude 4.6 SonnetAnthropic90.4%↑ +0.8%
18Gemini 2.5 ProGoogle89.9%
19Claude 4.5 SonnetAnthropic89.5%
20GPT-4oOpenAI88.7%↑ +1.2%
21Claude 3.5 SonnetAnthropic88.7%
22DeepSeek V4 FlashDeepSeek88.0%
23Claude Sonnet 4Anthropic88.0%
24OpenAI o3-miniOpenAI87.9%
25Llama 4 ScoutMeta87.2%
26Gemini 3.1 FlashGoogle86.8%
27Mistral Large 3Mistral86.8%
28Claude 3 OpusAnthropic86.8%
29GPT-5 MiniOpenAI86.5%
30GPT-4 TurboOpenAI86.4%
31Llama 3.3 70B InstructMeta86.2%
32Claude 4.5 HaikuAnthropic84.8%
33GPT-4o MiniOpenAI82.0%
34Mistral Small 3Mistral81.2%
35Claude 3 SonnetAnthropic79.0%
36Command R+Cohere75.7%
37Claude 3.5 HaikuAnthropic75.2%
38Claude 3 HaikuAnthropic75.2%
39Llama 3.2 11B VisionMeta73.0%
40Command RCohere71.0%

About MMLU

Massive Multitask Language Understanding (MMLU) measures general knowledge across 57 academic subjects from elementary math to professional law. It is the industry standard for evaluating general reasoning and semantic understanding.

MMLU consists of multiple-choice questions covering humanities, social sciences, STEM, and other professional contexts. It tests both world knowledge and problem-solving capability. Strong performance indicates a highly versatile model capable of handling diverse tasks without specialized fine-tuning.

What do these benchmarks mean?

What is the MMLU benchmark?keyboard_arrow_down

Massive Multitask Language Understanding (MMLU) measures general knowledge across 57 academic subjects from elementary math to professional law. It is the industry standard for evaluating general reasoning and semantic understanding.

MMLU consists of multiple-choice questions covering humanities, social sciences, STEM, and other professional contexts. It tests both world knowledge and problem-solving capability. Strong performance indicates a highly versatile model capable of handling diverse tasks without specialized fine-tuning.

What is the HumanEval benchmark?keyboard_arrow_down

HumanEval is a programming benchmark created by OpenAI to evaluate coding abilities. It measures the accuracy of models in generating functional Python code blocks based on docstring instructions.

HumanEval consists of 164 hand-written programming problems. Models are evaluated using pass@1 metrics, meaning the code is executed against automated unit tests and must pass on the first try. High scores correspond directly to software engineering utility, agentic code writing, and syntactical precision.

What is the MATH benchmark?keyboard_arrow_down

MATH measures mathematical problem-solving skills across seven high-school and college disciplines. It requires models to perform complex, multi-step symbolic reasoning rather than simple calculation.

MATH is exceptionally difficult for standard models. It covers algebra, calculus, probability, geometry, and number theory. Unlike multiple-choice benchmarks, MATH requires generating final equations or numerical answers, testing a model's chain-of-thought planning and logical precision.

What is the MT-Bench benchmark?keyboard_arrow_down

MT-Bench is a multi-turn conversation benchmark. It evaluates how well models maintain coherence, logic, and instructions across progressive dialogue exchanges.

MT-Bench tests eight categories of tasks including coding, math, roleplay, and writing. A powerful model like GPT-4 is utilized as a judge to grade the responses on a scale from 1 to 10. High performance indicates excellent instruction following and conversational context retention over long interactions.

What is the GPQA benchmark?keyboard_arrow_down

GPQA (Graduate-Level Google-Proof Q&A Benchmark) tests advanced scientific and mathematical understanding using questions designed by PhD-level experts. The questions are specifically written to be difficult to answer via search engines.

GPQA contains physics, biology, and chemistry questions that even human experts find challenging. Non-experts with access to Google search only score around 34%, while models must exhibit advanced abstract reasoning and scientific understanding to pass, serving as a key benchmark for expert-level capability.

What is the HellaSwag benchmark?keyboard_arrow_down

HellaSwag evaluates common-sense reasoning and situational prediction. It tests whether a model can accurately determine the most likely next event in a described physical scenario.

HellaSwag is designed using adversarial filtering to find scenarios that are easy for humans (who score ~95%) but difficult for language models. It requires deep contextual understanding of everyday physics, human intent, and spatial logic.