Research Background
Historically, AI progress was tracked on simple datasets: ImageNet (identifying cats/dogs) or GLUE (grammar).
In 2020, researchers introduced MMLU (Massive Multitask Language Understanding), a test covering 57 subjects from math to law. It was designed to be "too hard" for AI.
By 2024, GPT-4 and Gemini Ultra were scoring ~90%, effectively solving the benchmark.
Core Technical Explanation
The collapse of benchmarks is driven by Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
Contamination
Because models are trained on the entire internet, they have often "seen" the test questions during training. This is like a student memorizing the answer key rather than learning the subject.
Researchers try to detect this by checking for "n-gram overlap" between the test set and the training set, but models can memorize concepts even without exact text matches.
What the Data Shows
Standardized tests have lost their discriminating power.
| Model | MMLU Score | GSM8K (Math) | HumanEval (Code) | Real-World "Vibe" |
|---|---|---|---|---|
| GPT-4 | 86.4% | 92.0% | 67.0% | Excellent |
| Model X (Fine-Tuned) | 86.5% | 93.0% | 68.0% | Poor |
Limitations & Open Problems
1. Metric Gaming: Companies now optimize specifically for the leaderboard.
2. LMSYS Chatbot Arena: The industry has shifted to "Elo ratings" based on blind human preference (A/B testing). While effective, it is unscientific, slow, and expensive. You cannot "compute" the score; you must wait for thousands of humans to vote.
Why This Matters
If we cannot measure intelligence, we cannot regulate it. "Safety" thresholds (e.g., "if model scores >X, do not release") are meaningless if the test is broken. We need Private, Dynamic, Evaluation Sets that are never exposed to the public internet.
---
Verified by Global AI News Editorial Board. Sources: HELM (Stanford), Hugging Face Open LLM Leaderboard