Milan Toma, Ph.D., associate professor at New York Institute of Technology College of Osteopathic Medicine | Official Website

Patient Daily | Mar 22, 2026

Study reveals limitations of large language models in medical diagnostics

A study led by Milan Toma, Ph.D., associate professor at New York Institute of Technology College of Osteopathic Medicine, found that large language models (LLMs) such as GPT-5 and Gemini 3 Pro may not be reliable for medical diagnosis, according to a Mar. 16 report. The research tested several advanced multimodal LLMs on their ability to interpret a CT brain scan with clear intracranial pathology.

The findings are significant as artificial intelligence continues to play a growing role in healthcare. While specialized AI algorithms are already assisting physicians by analyzing medical images and prioritizing urgent cases, the reliability of general-use AI platforms for clinical tasks remains uncertain.

In the study, each AI model was given the same CT brain scan and asked to identify key diagnostic features. All five models correctly recognized the image as a CT brain scan, and four identified an ischemic stroke near the left middle cerebral artery. However, one model made a critical error by misclassifying the stroke as a hemorrhage on the opposite side of the brain—a mistake that could have serious consequences in real-world settings due to differing treatments for these conditions. Even among those that reached the correct diagnosis, there were notable differences in their explanations regarding timing, alternative diagnoses, and affected regions.

The researchers also had each model grade the others' diagnostic explanations. This cross-evaluation revealed further inconsistencies; some models graded more harshly than others, and one even interpreted chronic abnormalities instead of an acute stroke.

Toma said: "Our research highlights a critical distinction in the AI landscape. Most successful medical AI tools are task-specific algorithms, trained on large datasets of labeled medical images and validated for very specific diagnostic tasks. However, large language models are not optimized for diagnostics-they are built for linguistics and conversation. Accordingly, they generate explanations that sound authoritative, even when their underlying interpretation is wrong or inconsistent."

The authors conclude that while LLMs may assist with documentation or patient communication in healthcare settings, expert oversight is essential when it comes to diagnostic interpretations.

Organizations in this story

+ News Medical

Study reveals limitations of large language models in medical diagnostics

Organizations in this story

Trending

Five therapies show progress in multiple myeloma treatment landscape

Taxpayer advocate: Instead of improving product safety, litigation can result in the ‘opposite because runaway legal costs deter innovation'

FDA advisory committee to review two AstraZeneca cancer drugs in April meeting

FDA biologics director Vinay Prasad to depart in April, analysts predict impact on biotech