Study finds half of chatbot medical answers are problematic or inaccurate

A new study published in BMJ Open reports on Apr. 15 that five popular chatbots provided a substantial amount of inaccurate and incomplete medical information, with half of their responses to evidence-based questions rated as "somewhat" or "highly" problematic.

The findings raise concerns about the risks of misinformation when people use generative artificial intelligence chatbots for health and medical queries. The researchers warn that continued deployment of these tools without public education and oversight could amplify the spread of false or misleading health advice.

The study evaluated Gemini (Google), DeepSeek (High-Flyer), Meta AI (Meta), ChatGPT (OpenAI), and Grok (xAI) by prompting each with questions related to cancer, vaccines, stem cells, nutrition, and athletic performance. These prompts were designed to resemble common online health inquiries as well as known misinformation tropes. Responses were assessed for accuracy, completeness, readability, and whether they presented a false balance between scientific consensus and non-scientific claims.

Results showed that 50% of the chatbot answers were problematic: 30% somewhat so and 20% highly so. Open-ended prompts led to more highly problematic responses than closed prompts. Among the chatbots tested, Grok produced the most highly problematic answers while Gemini had the fewest. The best performance was seen in vaccine- and cancer-related topics; performance was poorest for stem cells, athletic performance, and nutrition.

The researchers noted that all chatbot responses tended to be delivered confidently but rarely included caveats or disclaimers. Reference quality was low across all models due to hallucinated citations or incomplete reference lists. Readability scores indicated responses were generally difficult to understand without advanced education.

Although only five chatbots were studied—and commercial AI is rapidly evolving—the authors say their findings point out key behavioral limitations: "Our findings regarding scientific accuracy, reference quality, and response readability highlight important behavioral limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication." They further explain that because chatbots do not access real-time data nor reason through evidence like humans do,"This behavioral limitation means that chatbots can reproduce authoritative-sounding but potentially flawed responses." They conclude by calling for increased public education, professional training, and regulatory oversight as AI-powered tools become more widely used in healthcare.

Organizations in this story

+ News Medical

Study finds half of chatbot medical answers are problematic or inaccurate

Organizations in this story

Trending

National Breathe Free donates $175,000 to support shelter for Tampa Bay families

Patient in 5.0 Google review: Gulf Coast Breathe Free Sinus & Allergy Centers ‘changed my life’

New study says TikTok content may normalize illicit vaping among youth as Trump Administration ramps up enforcement

FDA rare disease framework faces uncertainty after leadership departures