Harvard Study Says OpenAI o1 Beat ER Doctors on Initial Triage Diagnoses
A Harvard Medical School and Beth Israel Deaconess Medical Center team says a large language model outperformed two attending physicians in one real-world emergency triage comparison, according to a new paper published in Science.
What the study found
In the experiment highlighted by the paper, researchers reviewed 76 emergency room cases and compared diagnoses from two physicians with outputs from OpenAI's o1 model at several decision points. Reporting tied to the study said o1 produced the exact or a very close diagnosis in 67.1% of initial triage cases, versus 55% and 50% for the two physicians. The gap narrowed as more patient information became available, but the model still remained at or above the physician baselines reported for those stages.
The broader study evaluated clinical reasoning tasks beyond triage, including management planning. Independent coverage from The Guardian said the model also scored well on longer-form care planning exercises, but the headline result is the early-stage triage performance, where clinicians have the least information and the least time.
Why it matters
This is notable because it moves beyond benchmark exams and into messy patient-chart reasoning. But it is not a green light for autonomous diagnosis. The authors and outside experts both said the work supports prospective clinical testing, not unsupervised deployment in hospitals.
That makes the study look less like a replacement story and more like evidence that frontier models may be approaching a useful second-opinion role in acute care settings.