James Mackey, Chief Executive of NHS England | Official Website
+ Pharmaceuticals
Patient Daily | Dec 8, 2025

Human expertise remains vital over AI for trustworthy medical systematic reviews

A new study published in Scientific Reports has found that human researchers continue to outperform large language models (LLMs) when preparing systematic literature reviews. The research compared the abilities of six different LLMs to those of human experts across several key tasks involved in producing a systematic review.

LLMs, such as GPT-4 and BERT, have been increasingly used in healthcare, education, and research for tasks like data annotation, content summarization, and report drafting. Since the launch of OpenAI’s ChatGPT in 2022, these AI systems have received significant attention for their versatility in generating text and analyzing information.

In this study, LLMs were tested on literature searches, article screening and selection, data extraction and analysis, and final manuscript drafting. Their results were measured against an original systematic review completed by human researchers on the same topic. The process was repeated twice to assess improvements in LLM performance over time.

During the literature search and selection phase, Gemini was the most effective LLM but still only identified 13 out of 18 articles included by humans. Researchers noted that many LLMs struggle with access to scientific databases and often rely on training datasets that may lack sufficient original research articles.

Despite these challenges, LLMs demonstrated faster initial article extraction than humans. This suggests they could be useful for preliminary screening if supervised by experienced researchers.

For data extraction and analysis, DeepSeek performed best among the tested models with a 93% accuracy rate for correct entries from selected articles. However, three other LLMs required complex prompts and multiple uploads to achieve results comparable to humans, indicating lower efficiency.

When it came to drafting the final manuscript, none of the LLMs met expectations. The generated articles were well-structured but lacked depth and did not fully adhere to established templates for systematic reviews. This could potentially mislead non-expert readers due to their polished appearance.

The authors emphasized: "Since systematic reviews and meta-analyses are considered the gold standard in evidence-based medicine, a critical evaluation of published literature by human experts is essential to guide clinical practice effectively."

The study also observed that while modern LLMs cannot independently produce comprehensive medical systematic reviews without specialized prompting strategies, improvements between test rounds suggest they can provide valuable support under supervision. Guided prompting methods may further enhance their performance on specific review tasks.

However, the authors caution that their findings are based on a single medical domain review as a reference point. They recommend future studies involving multiple systematic reviews across various fields to better assess generalizability.

Organizations in this story