MMLU and MMLU-Pro Benchmark Leaderboard 2026: What They Measure, Current Rankings, and Limitations

TokenMix Research Lab ·

MMLU Benchmark Leaderboard 2026: Scores, Rankings, and Why MMLU-Pro Is Replacing the Original

The MMLU leaderboard has been the default way to compare LLM intelligence since 2021. That era is ending. In April 2026, every frontier model scores between 89% and 92% on standard MMLU, making the benchmark nearly useless for differentiation. MMLU-Pro, with harder questions and 10 answer choices instead of 4, is rapidly becoming the replacement standard.

Current MMLU scores: [GPT-5.4](https://tokenmix.ai/blog/gpt-5-api-pricing) leads at approximately 92%, followed by Claude Opus 4.6 at 91%, Gemini 3.1 Pro at 90%, and [DeepSeek V4](https://tokenmix.ai/blog/deepseek-api-pricing) at 89%. On MMLU-Pro, the spread widens meaningfully: GPT-5.4 at ~78%, Claude Opus at ~76%, Gemini Pro at ~75%, DeepSeek V4 at ~74%.

This guide covers what MMLU and MMLU-Pro actually measure, the current leaderboard standings, why MMLU-Pro scores are more useful for model selection, known limitations of both benchmarks, and a practical framework for using MMLU data to choose the right model. All benchmark data tracked and verified by TokenMix.ai.

[What Does the MMLU Benchmark Measure]
[MMLU vs MMLU-Pro: What Changed and Why]
[MMLU Leaderboard: Current Scores and Rankings (April 2026)]
[MMLU-Pro Leaderboard: The New Standard]
[Complete MMLU Benchmark Comparison Table]
[Why MMLU Scores Are Saturated and What That Means]
[Known Limitations of the MMLU Benchmark]
[How to Use MMLU Scores to Choose a Model]
[MMLU Scores vs Real-World Performance: What the Data Shows]
[Decision Guide: When MMLU Matters and When It Does Not]
[FAQ]

---

What Does the MMLU Benchmark Measure

MMLU stands for Massive Multitask Language Understanding. It is a multiple-choice benchmark spanning 57 academic subjects, from abstract algebra to world religions. Each question has 4 answer choices. The benchmark was introduced by Dan Hendrycks et al. in 2021 to measure the breadth and depth of a language model's knowledge.

**The 57 subjects fall into four broad categories:**

| Category | Example Subjects | Question Count | |----------|-----------------|----------------| | STEM | Physics, Chemistry, Computer Science, Math | ~4,000 | | Humanities | Philosophy, History, Law | ~3,500 | | Social Sciences | Economics, Psychology, Sociology | ~3,000 | | Other | Professional Medicine, Accounting, Business | ~3,500 |

Total: approximately 14,000 questions.

**What MMLU actually tests:** Factual recall, conceptual understanding, and basic reasoning across a wide range of domains. It answers the question: "How much does this model know about the world?"

**What MMLU does not test:** Complex multi-step reasoning, code generation, creative writing, instruction following, conversational ability, or any task that requires generating long-form outputs. MMLU is a pure knowledge breadth test in multiple-choice format.

This is important context for interpreting the MMLU leaderboard. A model scoring 92% on MMLU knows more facts across more domains than a model scoring 85%. But it does not necessarily write better code, follow instructions more reliably, or produce better analysis.

---

MMLU vs MMLU-Pro: What Changed and Why

MMLU-Pro was introduced because standard MMLU stopped working as a differentiator. When every top model clusters between 89% and 92%, the benchmark provides no actionable signal.

**Key differences between MMLU and MMLU-Pro:**

| Dimension | MMLU (Original) | MMLU-Pro | |-----------|----------------|----------| | Answer choices | 4 per question | 10 per question | | Random guess baseline | 25% | 10% | | Question difficulty | Undergraduate to early graduate | Advanced graduate to expert | | Reasoning required | Often surface-level recall | Multi-step chain-of-thought | | Score range (frontier models) | 89-92% | 74-78% | | Discriminative power | Low (3-point spread at top) | Medium (4-point spread at top) | | Contamination risk | High (public since 2021) | Lower (newer, less exposure) |

**Why the 10-choice format matters:** With 4 choices, a model that is only 60% sure of the answer can still get it right by eliminating 2 obviously wrong options and guessing between the remaining 2. With 10 choices, that same model has much lower odds of guessing correctly. The format change alone drops scores by 10-15 points and widens the gap between strong and weak models.

**Why harder questions matter:** Standard MMLU includes many questions that any well-trained model can answer through pattern matching. MMLU-Pro requires genuine multi-step reasoning. A question might require combining knowledge from two domains, performing a calculation, or following a chain of logical deductions.

The result: MMLU-Pro scores are more meaningful. A 4-point gap on MMLU-Pro represents a larger real capability difference than a 4-point gap on standard MMLU.

TokenMix.ai now displays both MMLU and MMLU-Pro scores on its leaderboard, with MMLU-Pro highlighted as the recommended metric for model comparison.

---

MMLU Leaderboard: Current Scores and Rankings (April 2026)

Standard MMLU scores for all major models, sourced from official reports and independent evaluations. Data verified by TokenMix.ai as of April 2026.

**Frontier tier (MMLU 88%+):**

| Rank | Model | MMLU Score | Release Date | |------|-------|-----------|--------------| | 1 | GPT-5.4 | ~92% | March 2026 | | 2 | Claude Opus 4.6 | ~91% | February 2026 | | 3 | Gemini 3.1 Pro | ~90% | March 2026 | | 4 | DeepSeek V4 | ~89% | January 2026 | | 5 | Llama 4 Maverick | ~88% | March 2026 |

**Mid-tier (MMLU 78-87%):**

| Rank | Model | MMLU Score | |------|-------|-----------| | 6 | Mistral Large 3 | ~85% | | 7 | Qwen 3 Max | ~84% | | 8 | Gemini 3.1 Flash | ~79% | | 9 | GPT-5.4 Nano | ~77% | | 10 | Mercury 2 | ~75% |

**Key observation:** The top 5 models are separated by only 4 percentage points. At this level of saturation, MMLU rank differences between frontier models are not statistically reliable for model selection. The confidence interval on these scores is roughly plus or minus 1 percentage point, meaning GPT-5.4 at 92% and Claude Opus at 91% are effectively tied.

---

MMLU-Pro Leaderboard: The New Standard

MMLU-Pro scores provide significantly better differentiation. Here are the current standings.

**MMLU-Pro leaderboard (April 2026):**

| Rank | Model | MMLU-Pro Score | Delta from MMLU | |------|-------|---------------|-----------------| | 1 | GPT-5.4 | ~78% | -14 points | | 2 | Claude Opus 4.6 | ~76% | -15 points | | 3 | Gemini 3.1 Pro | ~75% | -15 points | | 4 | DeepSeek V4 | ~74% | -15 points | | 5 | Llama 4 Maverick | ~72% | -16 points | | 6 | Mistral Large 3 | ~68% | -17 points | | 7 | Qwen 3 Max | ~67% | -17 points |

**The MMLU-to-MMLU-Pro drop is itself informative.** Models that drop fewer points from MMLU to MMLU-Pro (like GPT-5.4, dropping 14 points) demonstrate stronger genuine reasoning ability rather than reliance on surface-level pattern matching. Models with larger drops may have been disproportionately optimized for standard MMLU's 4-choice format.

The 4-point spread at the top of MMLU-Pro (78% vs 74%) is more meaningful than the 3-point spread at the top of standard MMLU (92% vs 89%) because MMLU-Pro's harder questions and 10-choice format reduce the impact of lucky guessing and test-taking heuristics.

---

Complete MMLU Benchmark Comparison Table

Full comparison of MMLU and MMLU-Pro scores alongside key context for model selection. Data from TokenMix.ai's model tracking platform.

| Model | MMLU | MMLU-Pro | Input $/1M Tokens | Output $/1M Tokens | Context Window | Best For | |-------|------|----------|-------------------|-------------------|----------------|----------| | GPT-5.4 | ~92% | ~78% | $5.00 | $15.00 | 256K | General knowledge, instruction following | | Claude Opus 4.6 | ~91% | ~76% | $15.00 | $75.00 | 1M | Deep analysis, coding, long-context | | Gemini 3.1 Pro | ~90% | ~75% | $3.50 | $10.50 | 2M | Multimodal, long-context, cost efficiency | | DeepSeek V4 | ~89% | ~74% | $2.00 | $8.00 | 128K | Coding, math, cost-sensitive workloads | | Llama 4 Maverick | ~88% | ~72% | $0.20 | $0.60 | 128K | Self-hosted, open-weight applications | | Mistral Large 3 | ~85% | ~68% | $2.00 | $6.00 | 128K | European data residency, multilingual | | Qwen 3 Max | ~84% | ~67% | $1.60 | $6.40 | 128K | Chinese language, Asian market applications |

**Observation:** MMLU-Pro scores correlate with pricing tier. Higher capability costs more. But the correlation is not perfect. DeepSeek V4 scores within 4 points of GPT-5.4 on MMLU-Pro while costing 60% less per output token. Gemini 3.1 Pro is 3 points behind GPT-5.4 at 30% lower cost.

TokenMix.ai's unified API lets you switch between any of these models with a single parameter change, making it practical to route different query types to different models based on their MMLU-Pro-relevant strengths.

---

Why MMLU Scores Are Saturated and What That Means

MMLU saturation is not just a theoretical concern. It has practical consequences for model evaluation and selection.

**The saturation math:** Standard MMLU has approximately 14,000 questions. At 92% accuracy, GPT-5.4 gets roughly 12,880 correct and 1,120 wrong. At 89%, DeepSeek V4 gets 12,460 correct and 1,540 wrong. The difference is about 420 questions out of 14,000.

But here is the problem: many of the questions GPT-5.4 gets right that DeepSeek V4 gets wrong are in obscure edge cases of specific academic domains. These are questions about niche legal precedents, rare historical events, or specialized medical terminology. For 95% of real-world applications, both models have more than sufficient knowledge.

**What saturation means for you:**

1. **Stop using standard MMLU as a primary ranking metric.** The differences between top models are within noise margins.

2. **Use MMLU-Pro instead.** The wider score spread and harder questions provide better signal.

3. **Look at domain-specific subscores.** If you care about medical knowledge, check MMLU medical subscores specifically. The aggregate score hides domain-level variation.

4. **Weight other benchmarks more heavily.** For coding, use SWE-bench. For reasoning, use GPQA. MMLU is a breadth metric. Depth metrics are more useful for most applications.

---

Known Limitations of the MMLU Benchmark

Beyond saturation, MMLU has structural limitations that every developer should understand.

**1. Multiple-choice format masks generation quality.** Real-world LLM usage involves generating text, not picking from predefined options. A model can be excellent at identifying the correct answer from a list while being mediocre at generating that answer from scratch. MMLU does not measure generation.

**2. Contamination risk is high.** MMLU questions have been public since 2021. While reputable labs take steps to avoid training on benchmark data, the widespread availability of MMLU questions means some degree of contamination is likely for models trained on large web crawls. This inflates scores in ways that do not reflect genuine capability.

**3. English-centric bias.** MMLU is exclusively English-language. It tells you nothing about a model's performance in Chinese, Japanese, Arabic, or any other language. For multilingual applications, MMLU scores are incomplete.

**4. Static knowledge, dynamic world.** MMLU questions are fixed. They test knowledge as of the benchmark's creation date. Models trained on more recent data do not get credit for knowing newer information, and models that memorized older data may score well on outdated questions.

**5. Aggregate score hides subject-level variance.** A model scoring 90% overall might score 98% on history and 72% on abstract algebra. If your application relies on math, the aggregate score is misleading.

**What TokenMix.ai does differently:** Our leaderboard breaks down MMLU scores by subject category (STEM, Humanities, Social Sciences, Other) in addition to the aggregate, giving you domain-specific signal for model selection.

---

How to Use MMLU Scores to Choose a Model

Despite its limitations, MMLU data is still useful when applied correctly. Here is a practical framework.

**Step 1: Use MMLU-Pro as the baseline, not standard MMLU.** For any model scoring 85%+ on standard MMLU, MMLU-Pro is the more informative metric. Only use standard MMLU to filter out models below the competence threshold.

**Step 2: Check subject-level scores for your domain.** If you are building a medical application, the aggregate MMLU score matters less than the clinical medicine, anatomy, and professional medicine subscores. TokenMix.ai provides subject-level breakdowns for all tracked models.

**Step 3: Cross-reference with task-specific benchmarks.** MMLU tells you about knowledge breadth. Pair it with: - SWE-bench for coding ability - GPQA for deep reasoning - LMArena for user preference - LiveCodeBench for code generation

**Step 4: Factor in cost per MMLU-Pro point.** This is an underused metric. Divide the model's output cost per million tokens by its MMLU-Pro score to get a rough "intelligence per dollar" ratio.

| Model | MMLU-Pro | Output $/1M | Cost per MMLU-Pro Point | |-------|----------|-------------|------------------------| | DeepSeek V4 | ~74% | $8.00 | $0.108 | | Gemini 3.1 Pro | ~75% | $10.50 | $0.140 | | GPT-5.4 | ~78% | $15.00 | $0.192 | | Claude Opus 4.6 | ~76% | $75.00 | $0.987 |

DeepSeek V4 delivers the best MMLU-Pro performance per dollar. [Claude Opus 4.6](https://tokenmix.ai/blog/anthropic-api-pricing) is the most expensive per point but compensates with strengths on other benchmarks (coding, reasoning, long-context) that MMLU does not capture.

**Step 5: Run your own evaluation.** Take 100 questions representative of your actual use case. Test your top 2-3 candidates. The model that wins on your data is the right model, regardless of what any leaderboard says.

---

MMLU Scores vs Real-World Performance: What the Data Shows

TokenMix.ai has tracked both MMLU benchmark scores and real-world usage data across thousands of API calls. Here are the patterns.

**Where MMLU predicts well:** Factual Q&A applications, trivia systems, educational tools, and any use case where the primary task is retrieving and presenting factual knowledge. Models with higher MMLU scores consistently perform better on these tasks in production.

**Where MMLU fails to predict:** Coding assistance, creative writing, multi-turn conversation, instruction following, and document analysis. For these tasks, MMLU score differences of 1-3 points have zero predictive value. A model scoring 89% on MMLU can outperform a model scoring 92% on code generation by a wide margin.

**The correlation breakdown by task type:**

| Task Type | MMLU Score Correlation | Better Benchmark | |-----------|----------------------|------------------| | Factual Q&A | High | MMLU-Pro | | Knowledge chatbot | Medium-High | MMLU-Pro + LMArena | | Code generation | Low | SWE-bench, LiveCodeBench | | Creative writing | Very Low | LMArena | | Data analysis | Medium | GPQA, MATH | | Document summarization | Low | LMArena, custom eval |

The practical takeaway: MMLU is a necessary but not sufficient metric. Use it to eliminate clearly underqualified models (below 80% for knowledge-heavy tasks), but never use it as the sole selection criterion.

---

Decision Guide: When MMLU Matters and When It Does Not

| Your Use Case | Does MMLU Score Matter? | Recommended Model (April 2026) | Why | |--------------|------------------------|-------------------------------|-----| | General knowledge Q&A bot | Yes, strongly | GPT-5.4 | Highest MMLU and MMLU-Pro scores | | Medical/legal knowledge system | Yes, check subscores | GPT-5.4 or Claude Opus 4.6 | Verify domain-specific MMLU subscores | | AI coding assistant | No, use SWE-bench | Claude Opus 4.6 or DeepSeek V4 | MMLU does not predict coding ability | | Cost-sensitive knowledge tasks | Yes, use cost-per-point | DeepSeek V4 | Best MMLU-Pro per dollar ratio | | Multilingual application | No, MMLU is English-only | Gemini 3.1 Pro | Strongest multilingual performance | | Educational content generation | Moderate | GPT-5.4 | Knowledge breadth matters, but generation quality matters more | | Enterprise document processing | Low | Test 2-3 candidates directly | MMLU does not predict document task performance |

---

FAQ

What is the MMLU benchmark and what does it stand for?

MMLU stands for Massive Multitask Language Understanding. It is a benchmark consisting of approximately 14,000 multiple-choice questions across 57 academic subjects, designed to measure the breadth of a language model's knowledge. Introduced in 2021 by Dan Hendrycks and colleagues, it has been the most widely cited AI benchmark for general intelligence comparison.

What is the highest MMLU score in 2026?

As of April 2026, GPT-5.4 holds the highest MMLU score at approximately 92%. Claude Opus 4.6 follows at roughly 91%, Gemini 3.1 Pro at 90%, and DeepSeek V4 at 89%. However, these scores are so close together that the differences are not statistically meaningful for most model selection decisions. MMLU-Pro scores provide better differentiation.

What is the difference between MMLU and MMLU-Pro?

MMLU-Pro is a harder version of MMLU with 10 answer choices instead of 4, more difficult questions requiring multi-step reasoning, and a lower random guess baseline (10% vs 25%). Frontier model scores drop 14-16 points from MMLU to MMLU-Pro. The wider score spread on MMLU-Pro makes it more useful for comparing top models. TokenMix.ai recommends using MMLU-Pro as the primary knowledge benchmark.

Why are MMLU scores becoming less useful?

MMLU scores are saturated at the top. All frontier models score between 89% and 92%, a range too narrow for reliable differentiation given measurement uncertainty. Additionally, the benchmark has high contamination risk (questions publicly available since 2021), is English-only, and uses a multiple-choice format that does not reflect real-world generation tasks.

How should I use MMLU scores when choosing an AI model?

Use MMLU as a minimum qualification filter, not a ranking tool. Any model scoring 85%+ on standard MMLU has sufficient knowledge breadth for most tasks. For comparing top models, switch to MMLU-Pro scores. For coding tasks, ignore MMLU entirely and use SWE-bench. Always cross-reference with task-specific benchmarks and run your own evaluation on representative data from your actual use case.

Where can I find the latest MMLU leaderboard data?

TokenMix.ai maintains an up-to-date leaderboard with both MMLU and MMLU-Pro scores, broken down by subject category, at tokenmix.ai/leaderboard. Papers With Code tracks academic submissions. ArtificialAnalysis provides independent benchmark verification. For human preference data that complements MMLU, LMArena (Chatbot Arena) publishes real-time Elo ratings.

---

*Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: [TokenMix.ai Leaderboard](https://tokenmix.ai/leaderboard), [MMLU Benchmark (Hendrycks et al.)](https://arxiv.org/abs/2009.03300), [ArtificialAnalysis](https://artificialanalysis.ai), [Papers With Code](https://paperswithcode.com/dataset/mmlu)*