ARC-AGI-3 Launches: Humans Score 100%, AI Scores Below 1%
ARC Prize launched ARC-AGI-3 on March 25, calling it the world's only benchmark that current AI cannot crack. The result is striking: humans score 100%, while the best frontier models β including top versions of GPT-5.4 and Grok β score below 0.3%.
What Makes It Different
Previous ARC-AGI benchmarks were eventually saturated by AI reasoning systems, which grew powerful enough to generalize across standard public/private test splits. ARC-AGI-3 addresses this directly. The public set contains only 25 demonstration games β down sharply from prior versions β and is explicitly no longer called a "training set." Over 100 additional games make up the private evaluation set.
The benchmark places agents in interactive, game-like environments with no instructions provided. To score, a model must explore the environment, build a world model, perceive patterns, and adapt its strategy on the fly β capabilities that go well beyond pattern matching or retrieval.
Why It Matters
Co-founders FranΓ§ois Chollet (creator of Keras) and Mike Knoop (Zapier) argue that most AI benchmarks test what models already know, not how well they learn. ARC-AGI-3 is designed to measure the latter β and current scores reveal a massive gap.
The benchmark is available now on Kaggle with over $2 million in prizes for open-source breakthroughs. ARC Prize's position is clear: until a system closes that gap, AGI remains out of reach.