Ian Birkby, CEO at News-Medical | Official Website
+ Pharmaceuticals
Patient Daily | Mar 30, 2026

Study finds AI-generated X-rays can fool radiologists and chatbots

A new study published in the journal Radiology found that realistic artificial intelligence (AI)-generated X-rays were difficult for both radiologists and leading multimodal models to distinguish from authentic scans, according to a statement released on Mar. 25.

The findings highlight growing concerns about the potential misuse of deepfake technology in clinical imaging, as advances in generative AI make it easier to fabricate convincing medical images. The study points out that large language models such as ChatGPT-4o and GPT-5 can now create anatomically plausible radiographs from plain-language prompts, lowering technical barriers for producing fake medical images.

Researchers evaluated the ability of both human experts and AI systems to tell apart real and synthetic radiographs. Seventeen radiologists from six countries participated, representing a range of experience levels and specialties including musculoskeletal imaging, thoracic imaging, nuclear medicine, interventional radiology, general radiology, and body imaging. They assessed two sets of images: one containing chest, extremity, and spine X-rays generated by GPT-4o alongside real scans; another with chest X-rays created by an organ-specific diffusion model called RoentGen paired with authentic images.

The results showed that even experienced professionals struggled with this task. In one phase of the study where they were told some images were synthetic, radiologists achieved an average accuracy rate of 74.8% in detecting fakes. The pooled sensitivity was 69.1% while specificity reached 80.4%. Common clues used included uniform noise or graininess, unnaturally smooth bone edges or soft-tissue textures, symmetric vertebral alignment without typical anatomical irregularities, and unusually clean fracture lines.

AI models performed similarly or slightly better than humans depending on the dataset: GPT-4o achieved up to 85.1% accuracy for its own generated images but other tested models like Llama 4 Maverick and Gemini 2.5 Pro performed substantially worse.

Authors noted several limitations including balanced datasets not reflecting real-world conditions and some obvious synthetic failures being excluded from testing. They cautioned that detection may be even more challenging outside controlled research settings because fake images would likely be less common among many real ones.

"In sum," researchers said in their report summary,"the moderate performance of radiologists and LLMs in identifying synthetic radiographs... underscores the potential for malicious use." They concluded that clinician education programs as well as mandatory watermarking or automated deepfake detection are needed "to prevent this novelty from becoming a systemic threat."

Organizations in this story