AI models at IBM and DeepMind are pushing DNA toward a GPT era

The human genome may be about to get its GPT moment.Artificial intelligence is changing how scientists read DNA, with new models scanning long genetic sequences to link patterns in the code to biological behavior, from gene regulation to disease risk. IBM researchers say these approaches could reshape drug discovery over time, in ways that echo how AI has altered modern software development.The stakes are enormous. Google DeepMind recently published itsAlphaGenomemodel, which takes up to one million DNA base pairs as input and predicts thousands of molecular properties across diverse biological processes, including chromatin accessibility, transcription factor binding and splice junction coordinates. In a study published inNature, the DeepMind researchers found that the model outperformed existing tools in 22 of 24 variant effect prediction tasks, marking what they describe as a fundamental shift in how scientists can interrogate the regulatory code embedded in non-coding DNA.For pharmaceutical companies and biotech firms, the promise is immense: faster identification of disease-causing mutations, more precise drug targeting and the ability to design experiments guided by computational predictions rather than brute-force screening. IBM Research has been developing its own suite ofbiomedical foundation modelsto tackle complementary challenges in drug discovery, with applications ranging from antibody design to small-molecule property prediction, part of a broader industry movement to apply large-scale AI to biological data.AlphaGenome distinguishes itself from earlier genomic models by learning from multiple types of biological measurements simultaneously.“What I found most novel about AlphaGenome was its multimodal nature,”Mark Gerstein, the Albert L Williams Professor of Biomedical Informatics at Yale University, and who was not involved in the research, toldIBM Thinkin an interview. “The fact that it is trained on data from many different genomic modalities—for instance, RNA-seq, ATAC-seq and Hi-C—and predicts effects across these modalities is particularly notable.”Gerstein said AlphaGenome stands out because it tries to predict multiple genomic signals simultaneously and treats them as connected rather than independent. Changes in chromatin state upstream, for example, can shape gene expression downstream—and models have long recognized those links. What’s new, in his view, is the scale at which AlphaGenome tries to fold those relationships directly into sequence-to-function prediction.He also highlighted how much DNA the model can “see” in one pass. The window, he said, is unusually large, on the order of amegabase. It’s a span big enough to capture regulatory effects that can sit far from the genes they influence.

The human genome contains roughly three billion base pairs, but only about two percent of them encode proteins. The remaining 98 percent orchestrates when, where and how much of each protein gets made. Small variations in this regulatory machinery can profoundly alter an organism’s response to its environment or susceptibility to disease. But until recently, deciphering exactly how these sequences work at the molecular level has remained one of biology’s most stubborn puzzles.Before AlphaGenome, researchers often had to make a compromise: scan a long region of DNA, but lose fine detail, or zoom in tightly and miss the long-range signals that matter in regulation. In ablog postannouncing the model, DeepMind said it designed AlphaGenome to avoid that choice. The company described an architecture that uses convolutional layers to capture short DNA motifs, transformers to share information across the entire sequence and additional layers to translate those detected patterns into predictions across multiple biological readouts.DeepMind also emphasized how quickly the system can be trained. The company said training took about four hours and used roughly half the compute budget of its earlier Enformer model, an efficiency gain it highlighted as notable given AlphaGenome’s expanded scope.AlphaGenome arrives in the wake ofAlphaFold, the protein-structure system that helped convince the world that AI could tackle parts of biology once thought too complex to model directly. But DNA is a different kind of challenge. A change in sequence does not simply alter a static structure. It can ripple through regulation, shifting when a gene turns on, how much RNA gets made, how much protein is produced and how a cell reacts to signals from its environment.Most genomics tools are built to handle that complexity in slices: one method to find protein-coding regions, another to interpret variants, another to estimate disease risk and another to support clinical decisions. AlphaGenome is designed to bring many of those steps into a single framework, rather than forcing researchers to stitch together separate models.AlphaGenome is trained on an enormous archive of molecular biology experiments generated over decades of research, many of them produced by publicly funded consortia. DeepMind has described using large public datasets that measure how sequence and variation relate to signals such as RNA output and transcription factor binding in human and mouse cells. By learning from these experimental patterns, the modelclaimsto identify not only the stretches of DNA that encode genes, but also the regulatory sequences that control when genes turn on, where they turn on and how strongly.In DeepMind’s description, when researchers give the system a DNA sequence up to one million base pairs long, AlphaGenome can predict gene-related features across different cell types, including signals related to transcription and aspects of RNA processing, and how those outputs change when the sequence is altered.

DeepMind is building a single system meant to read regulatory DNA as a unified code. IBM’s approach centers on decomposing biological questions into well-defined tasks, with models optimized for the mathematical and biological structure of each domain.“Our work on Biomedical Foundation Models (BMFM) takes a more practical, modular approach,” saidMichal Rosen-Zvi, Director of AI for Healthcare and Life Sciences at IBM Research, in an interview withIBM Think. “We decompose complex biological questions into well-defined components and identify the mathematical and algorithmic innovations required for the specific tasks at hand.”Based on this analysis, IBM develops specialized models tailored to distinct domains, including RNA transcriptomics, DNA sequence analysis, and small-molecule and protein representation, according to Rosen-Zvi. “Each model is designed to optimally capture the modalities most relevant to its domain, whether that is primary sequence, two-dimensional structure, three-dimensional conformation or, in the case of our RNA models, mathematical representations that more faithfully capture whole‑genome expression at the cellular level,” she said.Rosen-Zvi said IBM’s DNA work tries to avoid treating the genome as a single “standard” sequence. “Importantly, in our DNA models we explicitly incorporate population-level variation, training not only on reference sequences but also on SNPs and other mutable sites,” she said. That design, Rosen-Zvi explained, lets the models learn evolutionary and functional signals that a single reference genome can’t capture—signals that might otherwise require training on many thousands of whole genomes to approximate.Rosen-Zvi framed biomedical foundation models as tools that are both powerful and workable in practice. “Overall, the BMFM approach emphasizes efficient training and inference and is particularly well suited to problems where the underlying biology spans multiple layers of information, abstraction and observation,” she said. In her view, that’s exactly the terrain scientists have to cross when they try to explain disease, pinpoint drug targets, propose mechanisms of action, generate candidate compounds and predict which ones are worth pursuing.IBM has been focusing its recent modeling work on two areas of drug development that tend to consume time and money: biologics and small molecules. She pointed to IBM’sMAMMAL, which is designed to predict antibody-antigen binding strength. She also highlighted IBM’sMMELON, which she said has performed well at predicting the therapeutic properties of small-molecule candidates, an early readout that can help teams decide what’s worth pursuing before lab work begins.Anew IBM paper, co-authored with the Cleveland Clinic, offers a clearer look at how MMELON works. It describes a “multi-view” method for representing molecules, which IBM Research has presented in the paper as a case for domain-specific foundation models in biomedicine. The project grew out of IBM’s Discovery Accelerator Partnership with the Cleveland Clinic, a collaboration the two organizationshave describedas using AI and quantum computing to speed biomedical discovery.IBM Research is also plugged into a much bigger data-building effort. It recently joinedLIGAND-AI, a consortium announced in January 2026 that aims to generate open, high-quality datasets of protein-ligand interactions. The project announcement said the consortium, led by Pfizer and the Structural Genomics Consortium, includes 18 partners across nine countries.Organizers said the initiative has a budget of more than 60 million euros and will probe thousands of proteins tied to both existing treatments and major unmet needs, including rare diseases, neurological conditions and cancer. The Structural Genomics Consortium said the project plans to generate billions of data points using complementary screening technologies, creating a resource that researchers worldwide can use to train and benchmark AI systems that predict molecular interactions.The market for AI in biotechnology is expanding rapidly. Precedence Research projects continued double-digit growth globally, with estimates pointing to a market exceeding USD 25 billion by the mid-2030s, according to aJanuary 2026 analysisby Ardigen. The US market alone was approximately USD 2.1 billion in 2025, with growth driven primarily by adoption in drug discovery, genomics and precision medicine, the analysis stated.

Comments (0)