In a Harvard study, AI provided more accurate emergency room diagnoses than two human doctors.

New research examines how large-scale language models perform in a variety of medical situations, including real-world emergency room cases. In the emergency room case, at least one model seemed more accurate than a human doctor.

The study was published this week in Science and was conducted by a team of researchers led by doctors and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted a variety of experiments to measure how OpenAI’s models compared to human doctors.

In one experiment, researchers focused on 76 patients who presented to the Beth Israel Emergency Department and compared the diagnoses provided by two attending physicians with those generated by OpenAI’s o1 and 4o models. These diagnoses were evaluated by two different attending physicians who did not know which ones came from humans and which ones came from AI.

“At each diagnostic touchpoint, o1 performed nominally better or equivalently than 2 primary care physicians and 4o,” the study said. “The differences were particularly noticeable at the first diagnostic encounter (initial ER triage), where there was the least information available about the patient and the greatest urgency to make the right decision.”

A press release from Harvard Medical School about the study emphasized that the researchers did “no preprocessing of the data.” The AI ​​model was provided with the same information that was available in the electronic medical record at the time of each diagnosis.

With that information, the o1 model provided a “correct or very close diagnosis” in 67% of classified cases, while one doctor achieved an accurate or close diagnosis in 55% of cases and another doctor achieved an accuracy of 50%.

“We tested our AI model against almost every benchmark, and it outperformed both previous models and physician baselines,” Arjun Manrai, director of Harvard Medical School’s AI Lab and one of the study’s lead authors, said in a press release.

Tech Crunch Event

San Francisco, California
|
October 13-15, 2026

To be clear, the study did not claim that AI would be ready to make actual life-or-death decisions in emergency rooms. Instead, it said the study results show “an urgent need for prospective trials to evaluate these technologies in real-world patient care settings.”

Additionally, the researchers noted that they have only studied how the model performs when provided with text-based information, and that “existing research suggests that current underlying models have more limited inferences for non-textual inputs.”

Beth Israel doctor Adam Rodman, who is also one of the study’s lead authors, warned the Guardian that “there is currently no formal framework for accountability” for AI diagnoses, and that patients still “want humans to guide them through life and death decisions and guide them through difficult treatment decisions.”

In a post about the study, emergency physician Kristen Panthagani said this was “an interesting AI study that has garnered some very exaggerated headlines,” especially because it compared AI diagnoses to those of internists rather than emergency room doctors.

“If you want to compare AI tools to the clinical capabilities of doctors, you first have to compare them to doctors who actually practice that specialty,” Panthagani said. “I wouldn’t be surprised if an LLM could beat a dermatologist on the neurosurgery board exam, (but) that’s not a particularly helpful thing to know.”

She added, “As an emergency room doctor seeing a patient for the first time, my primary goal is to ~ no Guess the final diagnosis. “My main goal is to find out if you have a condition that could lead to death.”

This post and headline have been updated to reflect the fact that the study’s diagnoses came from primary care physicians and to include commentary from Kristen Panthagani.

If you purchase through links in our articles, we may receive a small commission. This does not affect our editorial independence.