A recent study published in JAMA Network Open reports on Apr. 14 that advanced artificial intelligence models, while often able to reach a final diagnosis, continue to face challenges in key areas of clinical reasoning such as managing uncertainty and developing differential diagnoses.
This issue is significant as large language models (LLMs) are increasingly being marketed for use in medical diagnostics and patient care. The study raises concerns about the reliability of these systems for unsupervised clinical decision-making, especially given their widespread adoption and the complex nature of real-world medical cases.
Researchers evaluated the performance of 21 LLMs from leading developers including OpenAI, Anthropic, DeepSeek, Google DeepMind, and xAI. The assessment involved presenting each model with 29 standardized clinical vignettes from the January 2025 update of the Merck Sharp & Dohme Manual. Each vignette included detailed case information such as physical exam findings and laboratory results. The LLMs were tested across five domains: diagnostic testing, differential diagnosis, final diagnosis, management decisions, and miscellaneous reasoning tasks.
The results showed that while most LLMs performed well when making a final diagnosis or suggesting management steps, they consistently struggled with generating appropriate differential diagnoses and determining which diagnostic tests to order next. The highest performing models—such as Grok 4 by xAI—achieved greater accuracy overall but still displayed notable weaknesses outside of making final diagnoses.
To better evaluate longitudinal reasoning ability across domains, researchers developed a new metric called PrIME-LLM (Proportional Index of Medical Evaluation for LLMs). This score revealed more pronounced differences between high-performing "reasoning-optimized" models like GPT-5 or Claude 4.5 Opus compared to others not specifically optimized for clinical reasoning tasks.
Despite improvements in multimodal capabilities—including interpreting images like electrocardiograms or CT scans—the study found that off-the-shelf LLMs are not yet suitable for independent use in patient-facing settings without human oversight. According to the authors: "Overall, the PrIME-LLM framework provides an independent, extensible, and reproducible benchmark for tracking progress and guiding safe integration into healthcare practice. However, the findings also suggest that off-the-shelf LLMs are not yet ready for unsupervised patient-facing clinical decision-making."