Ai2 Releases MolmoWeb: An Open-Source Web Agent That Beats GPT-4o at Browser Navigation
The Allen Institute for AI (Ai2) has released MolmoWeb, a fully open visual web agent that autonomously navigates browsers by interpreting screenshots — the same visual interface humans see — rather than relying on HTML or accessibility trees.
What It Does
MolmoWeb takes a natural-language task and a live webpage, then clicks, types, scrolls, and navigates to complete it. The system runs a simple loop: look at the screen, decide what to do, act. It comes in two sizes — 4B and 8B parameters — and is fully self-hostable, locally or on cloud services.
Why It Matters
The results are striking for its size. MolmoWeb-8B scores 78.2% on WebVoyager and outperforms much larger proprietary agents built on GPT-4o across all four benchmarks tested (WebVoyager, Online-Mind2Web, DeepShop, WebTailBench). With parallel rollouts at test time, pass@4 on WebVoyager reaches 94.7%.
Critically, it was trained without distilling from proprietary vision models — all training data comes from synthetic trajectories and human demonstrations, packaged as the open MolmoWebMix dataset (36K human task trajectories, 1.1K websites).
Full Stack Open
Unlike most web agents, Ai2 is releasing not just the models but the complete pipeline: training code (coming soon), evaluation tools, and the MolmoWebMix training dataset. The model was announced March 24, 2026, and has already drawn significant developer attention for being the first truly open foundation for web agents — analogous to what OLMo was for language models.