A research team funded by the National Institutes of Health (NIH) has developed a machine learning model called Merlin that can analyze 3D abdominal computed tomography (CT) scans for a range of clinical diagnostic tasks. The tool was designed to identify anatomical features, predict disease onset years in advance, and perform other complex assessments.
The team trained Merlin using a large dataset from Stanford University School of Medicine, which included more than 15,000 3D abdominal CT scans linked with radiology reports and nearly one million diagnostic codes. This collection is considered the largest of its kind for abdominal CT data.
Computed tomography is widely used in early medical evaluations, but interpreting these images often requires significant time from radiologists and may involve additional tests. With a shortage of physicians in the United States, this process can be delayed further.
"With Merlin, you could potentially go beyond traditional radiology and jump straight from imaging to a possible diagnosis. And that's just one potential use," said co-first author Louis Blankemeier, Ph.D., who conducted this work while at Stanford University.
Merlin belongs to a class of models known as foundation models. These are trained on large-scale datasets without specific labels and are capable of handling diverse information types.
The researchers evaluated Merlin across six categories covering more than 750 tasks related to diagnostics, prognostics, and quality assessment. They tested it on over 50,000 new CT scans from four hospitals to see how well its conclusions matched those made by human experts.
"Merlin tackled some tasks, such as predicting diagnosis codes, head-on, while other more complicated tasks, such as drafting radiology reports from scratch or identifying and outlining organs in a 3D space, called for additional training," said co-first author Ashwin Kumar, also a graduate student at Stanford University.
Specialized state-of-the-art models were used as benchmarks for comparison. On average across 692 diagnostic codes, Merlin correctly predicted which scan was associated with each code over 81% of the time—outperforming two other model variants. For certain codes, accuracy rose to 90%.
In another test area, Merlin was asked to predict chronic diseases like diabetes or heart disease based only on CT scans from healthy patients. The model identified individuals at higher risk within five years about 75% of the time compared to another model's 68%. According to Blankemeier, these results suggest that Merlin can detect subtle features not easily visible to humans and could help discover new biomarkers for disease.
Researchers also challenged Merlin with chest CT scans—a region it had not previously seen during training—and found it performed as well or better than specialist models trained solely on chest images.
Despite being built as an all-purpose tool rather than for specific tasks alone, Merlin matched or outperformed specialist systems across all areas tested. The authors attribute this performance to both its design and comprehensive training data that allowed it to connect visual patterns with written information effectively.
Looking ahead, the team believes their approach could soon meet regulatory requirements for simpler applications while continuing efforts to improve Merlin’s capabilities in complex areas like report writing. They encourage others in the field to fine-tune the model using their own data sets according to their needs.
"Our model and the data will provide the community a robust backbone to build upon," said senior author Akshay Chaudhari, Ph.D., professor at Stanford University. "From here, the sky's the limit."
This project received support through multiple NIH grants provided by NIBIB (R01EB002524; P41EB027060), MIDRC (contract 75N92020C00021), NHLBI (R01HL167974; R01HL169345), and NIAMS (R01AR077604; R01AR079431).