Arya Rao, MESH Incubator Artificial Intelligence Research Group at Mass General Brigham | Official Website

Patient Daily | Apr 16, 2026

Study finds generative AI models struggle with clinical reasoning in diagnosis

Generative artificial intelligence models, while increasingly used in health care, continue to face challenges in clinical reasoning during the diagnostic process, according to a study released by Mass General Brigham researchers from the MESH Incubator on Apr. 13.

The findings are significant as they highlight the limitations of large language models (LLMs) when it comes to navigating complex medical decision-making. While these AI systems are often praised for their accuracy once all information is available, their performance during earlier stages of diagnosis remains a concern for clinicians and developers alike.

Researchers evaluated 21 different LLMs by presenting them with a series of clinical scenarios and found that although all tested models arrived at the correct final diagnosis more than 90% of the time when given complete patient data, they struggled with generating appropriate differential diagnoses early in the process. The study introduced PrIME-LLM, a new measure designed to assess model competency across various stages such as proposing potential diagnoses, conducting tests, reaching a final diagnosis, and managing treatment. This metric revealed imbalances where some models excelled at certain tasks but underperformed at others.

Lead author Arya Rao said: "By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor. These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information."

The study compared general-purpose LLMs including ChatGPT's latest versions alongside DeepSeek, Claude, Gemini, and Grok using 29 published clinical cases. Evaluations showed that most LLMs improved their accuracy with additional laboratory results or imaging data; newer versions generally performed better than older ones. However, none were able to generate an appropriate differential diagnosis more than 80% of the time.

Succi said: "We want to help separate the hype from the reality of these tools as they apply to health care. Our results reinforce that large language models in healthcare continue to require a 'human in the loop' and very close oversight."

Organizations in this story

+ News Medical

Study finds generative AI models struggle with clinical reasoning in diagnosis

Organizations in this story

Trending

Reason Foundation commentator on vaccine liability: 'Exposing vaccine makers to the machinations of trial lawyers will stymie the development of innovative vaccines'

ProPublica examines impact of RFK Jr.'s vaccine policies on childhood disease risks

Researchers identify artemin as marker and target in feline osteoarthritis

Remote monitoring technology helps Brooklyn resident manage high blood pressure