A team of researchers announced on Apr. 13 that they have developed a new approach for generating large-scale data sets to train artificial intelligence models in protein engineering. The method, called Sequence Display, enables the creation of more than 10 million data points in a single experiment, providing the necessary information for AI systems to predict optimal changes in proteins.
Protein engineering involves modifying amino acids within proteins to optimize their function. However, the vast number of possible combinations makes it difficult to test all options experimentally. For example, a protein just 50 amino acids long can have about 1.13x1065 potential variations—a number far beyond what is feasible for laboratory testing.
The research team, including collaborators from Johns Hopkins University and Microsoft, published their findings in Nature Biotechnology. Linqi Cheng, a Rice graduate student and first author on the study, said: "We were able to develop an activity-based barcoding system that records the activity of individual protein variants and generates the kind of dataset needed to train a machine learning model." Cheng added: "Then the model was able to predict mutations that significantly improved the activity of the protein we were studying."
For proof of concept, researchers focused on a small CRISPR-Cas protein valued for its size but limited in its ability to target DNA sequences. By mutating DNA coding for this Cas9 protein and attaching unique barcodes responsive to activity levels, they could track which variants performed best using next-generation sequencing technology.
Cheng said: "The AI is not replacing the experiment here. It instead depends on the experiment." He explained that Sequence Display provides foundational data while AI models help identify promising candidates among millions of possibilities.
Hanze Xiao, who led the project and is also a Cancer Prevention and Research Institute Scholar, said: "What this approach provides is a practical framework for integrating AI with protein engineering." Xiao continued: "Rather than relying on machine learning as a stand-alone solution, we couple it with an experimental platform that generates high-quality training data. This synergy enables more efficient discovery of advanced research tools and next-generation therapeutic proteins." The work received support from several organizations including SynthX Seed Award programs and national research foundations.