Arya Rao, MESH Incubator Artificial Intelligence Research Group at Mass General Brigham | Official Website
+ Pharmaceuticals
Patient Daily | Apr 16, 2026

Study finds generative AI models struggle with clinical reasoning in diagnosis

Generative artificial intelligence models, while increasingly used in health care, continue to face challenges in clinical reasoning during the diagnostic process, according to a study released by Mass General Brigham researchers from the MESH Incubator on Apr. 13.

The findings are significant as they highlight the limitations of large language models (LLMs) when it comes to navigating complex medical decision-making. While these AI systems are often praised for their accuracy once all information is available, their performance during earlier stages of diagnosis remains a concern for clinicians and developers alike.

Researchers evaluated 21 different LLMs by presenting them with a series of clinical scenarios and found that although all tested models arrived at the correct final diagnosis more than 90% of the time when given complete patient data, they struggled with generating appropriate differential diagnoses early in the process. The study introduced PrIME-LLM, a new measure designed to assess model competency across various stages such as proposing potential diagnoses, conducting tests, reaching a final diagnosis, and managing treatment. This metric revealed imbalances where some models excelled at certain tasks but underperformed at others.

Lead author Arya Rao said: "By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor. These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information."

The study compared general-purpose LLMs including ChatGPT's latest versions alongside DeepSeek, Claude, Gemini, and Grok using 29 published clinical cases. Evaluations showed that most LLMs improved their accuracy with additional laboratory results or imaging data; newer versions generally performed better than older ones. However, none were able to generate an appropriate differential diagnosis more than 80% of the time.

Succi said: "We want to help separate the hype from the reality of these tools as they apply to health care. Our results reinforce that large language models in healthcare continue to require a 'human in the loop' and very close oversight."

Organizations in this story