Positronic Robotics has launched PhAIL (Physical AI Leaderboard), the first benchmark designed to evaluate AI-driven robots using real industrial metrics rather than academic success rates.

Factory Floor, Not Lab Floor

Unlike traditional robotics benchmarks that measure task completion in controlled settings, PhAIL scores models on units per hour and mean time between failures - the same metrics factories use to evaluate human workers. The initial tests focus on bin-to-bin picking, a repetitive logistics task performed thousands of times daily in real warehouses.

Each model runs on standardized hardware (a DROID-style Franka arm with Robotiq gripper), with every trial recorded alongside full telemetry data. The results are published openly on phail.ai.

AI Still Falls Short

Early results paint a sobering picture. Models from NVIDIA, Hugging Face, and other developers were tested against human operators and teleoperated robots. Across the board, current foundation models trail humans in both speed and reliability on this fundamental picking task.

"Physical AI needs to prove itself there first, and PhAIL is how we measure whether it can," said Sergey Arkhangelskiy, Positronic's founder.

Consortium Model

PhAIL is structured as a consortium rather than a proprietary platform. Cloud provider Nebius and data company Toloka are among initial partners. The team plans to expand beyond picking tasks in Q2 2026, adding new robotic embodiments to reflect broader real-world deployments.

The benchmark arrives at a critical moment - global VC investment in AI hit $239 billion last quarter, with physical AI and robotics among the fastest-growing categories. PhAIL may help investors and operators separate genuine capability from hype.