Ian Birkby, CEO at News-Medical | News-Medical
+ Pharmaceuticals
Patient Daily | Feb 23, 2026

AI-driven DeepRare system shows promise for faster rare disease diagnosis

A new artificial intelligence-based system, DeepRare, has been developed to help diagnose rare diseases more efficiently by integrating clinical, genetic, and phenotypic data. The framework aims to reduce the time patients spend searching for a diagnosis and offers clinicians transparent reasoning linked to supporting evidence.

DeepRare was described in a recent study published in Nature. The tool uses large language models (LLMs) and is structured with three main components: a central host powered by an LLM and memory bank, specialized agent servers for analytical tasks, and multiple data sources that provide diagnostic evidence from medical knowledge bases and scientific literature. DeepSeek-V3 serves as the default LLM for the central host.

The system processes various patient inputs such as genomic test results, free-text clinical descriptions, and Human Phenotype Ontology (HPO) terms. It coordinates agent servers to retrieve relevant evidence tailored to each patient’s data, generates initial diagnostic hypotheses, then performs self-reflection phases to validate or refute these hypotheses through further searches. If no hypothesis meets set criteria, the cycle repeats until a resolution is found. The output is a ranked list of candidate rare diseases with traceable reasoning chains linking each inference to its supporting evidence.

Researchers evaluated DeepRare against several other tools including general-purpose LLMs like Claude-3.7-Sonnet and GPT-4o; reasoning-enhanced variants; medical-specific LLMs such as MMedS-Llama 3; bioinformatics tools like PubCaseFinder; and other agentic systems including MDAgents. Testing involved 6,401 clinical cases across nearly 3,000 diseases using public datasets such as Deciphering Developmental Disorders Study and RareBench as well as in-house datasets from Xinhua and Hunan hospitals in China.

For each case, DeepRare generated five ranked predictions. Performance was measured using Recall@K metrics—the probability that the correct diagnosis appears within the top K predictions—with Recall@1 showing if it ranked first.

In analyses based on HPO terms alone, DeepRare achieved a Recall@1 of 57.18%, outperforming its closest competitor by nearly 24%. Its performance remained strong across different body systems and both common and rare disease types.

When compared directly with five expert rare disease specialists using identical HPO inputs—where clinicians could use search engines but not AI-based tools—DeepRare showed higher accuracy rates: "DeepRare achieved Recall@1 and Recall@5 rates of 64.4% and 78.5%, respectively, compared with specialists’ average Recall@1 of 54.6% and Recall@5 of 65.6%. These results suggest the system outperformed human experts under standardized benchmarking conditions."

Adding genetic data further improved outcomes: "Recall@1 increased from 33.3% to 63.6% in the Hunan dataset and from 39.9% to 69.1% in the Xinhua dataset." Compared with Exomiser—a tool that also integrates genetic information—DeepRare performed better on both datasets.

Testing different LLMs as central hosts revealed minimal impact on overall performance, indicating robustness in DeepRare’s architecture.

The authors noted that while retrospective evaluations show promise for DeepRare’s ability to generate accurate diagnoses supported by transparent reasoning chains—sometimes exceeding clinician performance—these findings are based on controlled studies rather than real-world deployment scenarios.

They concluded that future research may explore extending this approach toward treatment selection or prognosis prediction along with prospective validation in clinical settings.

Organizations in this story