The Allen Institute for AI (Ai2) released MolmoWeb on Tuesday, an open-source visual web agent that can navigate browsers and complete tasks on a user's behalf โ€” fully open weights, training data, and evaluation tools included.

How It Works

MolmoWeb operates in a simple loop: look at the screen, decide what to do, act. Given a task instruction and a live webpage, the model interprets a screenshot, reasons step-by-step in plain English, then executes browser actions โ€” clicking, typing, scrolling, switching tabs, or filling forms. Unlike agents that rely on HTML or accessibility trees, MolmoWeb works purely from screenshots, the same visual interface humans use.

The agent is available in two sizes โ€” 4B and 8B parameters โ€” built on Ai2's Molmo 2 multimodal model family. It's designed for self-hosted deployment, locally or on cloud infrastructure, with no external API calls required.

Benchmark Results

Ai2 claims MolmoWeb sets a new open-weight SOTA across four major web-agent benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and WebTailBench. It beats OpenAI's Computer Use Agent (CUA) on three of the four. With four parallel inference attempts at test time, it outperforms single-attempt results from agents powered by GPT-5 and Gemini CU Preview.

The training data is also fully open. MolmoWebMix includes 150K+ trajectories: 30K+ human demonstrations collected via a custom Chrome extension, 7M GUI grounding examples, and 2.2M screenshot QA examples.

Why It Matters

Most capable web agents today are proprietary with undisclosed training methods. Ai2 positions MolmoWeb as the open foundation the community needs โ€” comparable to what OLMo was for language models. Training code is coming soon.