Researchers at the Arc Institute have developed an artificial intelligence model called Evo 2, which is trained on approximately 9 trillion DNA base pairs. The study, published in Nature, describes how Evo 2 uses a hybrid computational design to analyze long-range dependencies within genomes, spanning organisms from bacteria to humans.
Evo 2’s architecture, known as “StripedHyena 2,” allows it to process up to one million nucleotides at a time. This large context window lets the model examine complex interactions between distant parts of the genome that influence gene function—an area where traditional reductionist biology has struggled.
The researchers evaluated Evo 2 in two main areas: predicting whether specific genetic variants can cause disease or loss of function, and generating new synthetic DNA sequences. According to the study, “Encouragingly, when analyzing the breast cancer-linked BRCA1 gene, the model’s internal representations could be used to train a classifier that outperformed the base model's zero-shot predictions (Area Under the Receiver Operating Characteristic [AUROC] = 0.95).” The model also outperformed others in predicting effects of complex mutations like insertions and deletions.
Evo 2 was able to generate complete mitochondrial genomes and sequences similar to bacterial and yeast chromosomes that preserved natural biological structures in simulations. In experiments with mouse embryonic stem cells, designed DNA sequences guided by Evo 2 folded into specific shapes and encoded messages using Morse code patterns. These functions were validated with chromatin accessibility assays showing AUROC values between 0.92 and 0.95.
Interpretability tools indicated that certain artificial neurons within Evo 2 learned to identify biological features on their own. For example, “Evo 2 generated candidate regulatory regions that showed a statistically significant enrichment of transcription factor motifs (P = 3.6 x 10-7), confirming the model was capturing biologically meaningful regulatory patterns rather than producing random sequences.”
The dataset used for training—OpenGenome2—included about 8.8 trillion nucleotides from bacteria, archaea, eukaryotes, and bacteriophages but excluded viruses infecting eukaryotic hosts for biosafety reasons.
To encourage further research and application development, all model parameters, code, and data for Evo 2 have been released as open source materials.
“Evo 2 represents a paradigm shift from analyzing isolated biological components to modeling the holistic complexity of genomes,” according to the research team. Its advancements suggest potential for programmable biology beyond current analytical approaches.