Model Evaluation Leaderboards
Track ranking standing across standardized evaluations. Analyze MMLU, HumanEval, and GPQA performance to identify capabilities suitable for your product.
| Rank | Model | Provider | Score | Change |
|---|---|---|---|---|
| 1 | GPT-5.5 Pro | OpenAI | 96.5% | — |
| 2 | Claude 4.8 Opus | Anthropic | 95.4% | — |
| 3 | Grok 4.20 | xAI | 94.7% | — |
| 4 | DeepSeek V4 Pro | DeepSeek | 94.5% | — |
| 5 | GPT-5.5 | OpenAI | 94.2% | — |
| 6 | Claude 4.7 Opus | Anthropic | 94.1% | — |
| 7 | Gemini 3.1 Pro | 92.8% | — | |
| 8 | Claude 4.5 Opus | Anthropic | 92.8% | — |
| 9 | Claude 4.6 Opus | Anthropic | 92.5% | — |
| 10 | Grok 4.3 | xAI | 92.4% | ↑ +1.5% |
| 11 | GPT-5 | OpenAI | 92.1% | — |
| 12 | OpenAI o1 | OpenAI | 91.8% | — |
| 13 | Llama 4 Maverick | Meta | 91.5% | — |
| 14 | DeepSeek R1 | DeepSeek | 90.8% | — |
| 15 | Gemini 3.5 Flash | 90.5% | — | |
| 16 | Claude Opus 4 | Anthropic | 90.5% | — |
| 17 | Claude 4.6 Sonnet | Anthropic | 90.4% | ↑ +0.8% |
| 18 | Gemini 2.5 Pro | 89.9% | — | |
| 19 | Claude 4.5 Sonnet | Anthropic | 89.5% | — |
| 20 | GPT-4o | OpenAI | 88.7% | ↑ +1.2% |
| 21 | Claude 3.5 Sonnet | Anthropic | 88.7% | — |
| 22 | DeepSeek V4 Flash | DeepSeek | 88.0% | — |
| 23 | Claude Sonnet 4 | Anthropic | 88.0% | — |
| 24 | OpenAI o3-mini | OpenAI | 87.9% | — |
| 25 | Llama 4 Scout | Meta | 87.2% | — |
| 26 | Gemini 3.1 Flash | 86.8% | — | |
| 27 | Mistral Large 3 | Mistral | 86.8% | — |
| 28 | Claude 3 Opus | Anthropic | 86.8% | — |
| 29 | GPT-5 Mini | OpenAI | 86.5% | — |
| 30 | GPT-4 Turbo | OpenAI | 86.4% | — |
| 31 | Llama 3.3 70B Instruct | Meta | 86.2% | — |
| 32 | Claude 4.5 Haiku | Anthropic | 84.8% | — |
| 33 | GPT-4o Mini | OpenAI | 82.0% | — |
| 34 | Mistral Small 3 | Mistral | 81.2% | — |
| 35 | Claude 3 Sonnet | Anthropic | 79.0% | — |
| 36 | Command R+ | Cohere | 75.7% | — |
| 37 | Claude 3.5 Haiku | Anthropic | 75.2% | — |
| 38 | Claude 3 Haiku | Anthropic | 75.2% | — |
| 39 | Llama 3.2 11B Vision | Meta | 73.0% | — |
| 40 | Command R | Cohere | 71.0% | — |
About MMLU
Massive Multitask Language Understanding (MMLU) measures general knowledge across 57 academic subjects from elementary math to professional law. It is the industry standard for evaluating general reasoning and semantic understanding.
What do these benchmarks mean?
What is the MMLU benchmark?keyboard_arrow_down
Massive Multitask Language Understanding (MMLU) measures general knowledge across 57 academic subjects from elementary math to professional law. It is the industry standard for evaluating general reasoning and semantic understanding.
MMLU consists of multiple-choice questions covering humanities, social sciences, STEM, and other professional contexts. It tests both world knowledge and problem-solving capability. Strong performance indicates a highly versatile model capable of handling diverse tasks without specialized fine-tuning.
What is the HumanEval benchmark?keyboard_arrow_down
HumanEval is a programming benchmark created by OpenAI to evaluate coding abilities. It measures the accuracy of models in generating functional Python code blocks based on docstring instructions.
HumanEval consists of 164 hand-written programming problems. Models are evaluated using pass@1 metrics, meaning the code is executed against automated unit tests and must pass on the first try. High scores correspond directly to software engineering utility, agentic code writing, and syntactical precision.
What is the MATH benchmark?keyboard_arrow_down
MATH measures mathematical problem-solving skills across seven high-school and college disciplines. It requires models to perform complex, multi-step symbolic reasoning rather than simple calculation.
MATH is exceptionally difficult for standard models. It covers algebra, calculus, probability, geometry, and number theory. Unlike multiple-choice benchmarks, MATH requires generating final equations or numerical answers, testing a model's chain-of-thought planning and logical precision.
What is the MT-Bench benchmark?keyboard_arrow_down
MT-Bench is a multi-turn conversation benchmark. It evaluates how well models maintain coherence, logic, and instructions across progressive dialogue exchanges.
MT-Bench tests eight categories of tasks including coding, math, roleplay, and writing. A powerful model like GPT-4 is utilized as a judge to grade the responses on a scale from 1 to 10. High performance indicates excellent instruction following and conversational context retention over long interactions.
What is the GPQA benchmark?keyboard_arrow_down
GPQA (Graduate-Level Google-Proof Q&A Benchmark) tests advanced scientific and mathematical understanding using questions designed by PhD-level experts. The questions are specifically written to be difficult to answer via search engines.
GPQA contains physics, biology, and chemistry questions that even human experts find challenging. Non-experts with access to Google search only score around 34%, while models must exhibit advanced abstract reasoning and scientific understanding to pass, serving as a key benchmark for expert-level capability.
What is the HellaSwag benchmark?keyboard_arrow_down
HellaSwag evaluates common-sense reasoning and situational prediction. It tests whether a model can accurately determine the most likely next event in a described physical scenario.
HellaSwag is designed using adversarial filtering to find scenarios that are easy for humans (who score ~95%) but difficult for language models. It requires deep contextual understanding of everyday physics, human intent, and spatial logic.