Artificial intelligence-powered chatbots answered everyday health-related questions from general users with nearly 76% accuracy, raising concerns about their trustworthiness in real-world client-facing applications, according to a study led by Penn State researchers on May 28.
The research team sought to understand how average individuals use AI for health-related concerns and how accurately AI responds to common medical queries. The findings suggest that AI tools may be more effective when used by trained physicians rather than patients, particularly in specialized areas such as neurology and dermatology. The results will be presented at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency conference in Montreal from June 25-28.
To evaluate the accuracy and potential harm of large language model (LLM) responses for typical internet users, the researchers organized an AI competition called Diagnose-a-thon at Penn State. Thirty-four participants—including faculty, staff, undergraduates, and graduate students—submitted 212 prompts along with AI-generated responses to both real and hypothetical health concerns written from patient and doctor perspectives. Participants could choose between ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro or Llama3-8b for generating responses.
Lead author Bonam Mingole said, "One of the strengths of our study is we're essentially trying to replicate real-world usage of LLMs by telling participants to choose the LLM of their choice and use it as they would on a normal day. This type of participatory research is so important for understanding how the public uses AI in their daily life." Nine board-certified physicians then evaluated these responses using a six-point scale measuring accuracy and potential harm.
The results showed that overall 76.2% of LLM-generated answers were accurate. Specialties such as obstetrics/gynecology and otolaryngology saw higher validity scores with lower harm scores compared to internal medicine, neurology or dermatology where performance was lower. More specific prompts—especially those between 60–250 characters—tended to produce more accurate outputs.
Study co-author Jennifer Kraschnewski said, "We're entering a new age of healthcare, and AI is a significant part of it... There's a real opportunity for healthcare to transform... so that clinicians like myself can use them to improve patient care." However, researchers noted that error rates still exceeded 20%, about double those seen among human physicians—a level which could pose risks if used directly by patients without clinical oversight.