I have spent a significant portion of my career building genome-scale metabolic models. I built them for Halomonas elongata during my PhD, trying to understand how a halophilic bacterium partitions carbon between growth and bioplastic production. I built them at the Great Lakes Bioenergy Research Center for Novosphingobium, trying to optimize the design-build-test-learn cycle for biofuel synthesis. I have lived inside the stoichiometric matrices, the flux balance equations, the gene-protein-reaction mappings, and the painful gap between what these models promise and what they actually deliver.
And I am now convinced that genome-scale metabolic models are about to undergo the most significant transformation since their invention. Not because the models themselves are changing, although they are. But because the computational ecosystem around them is being rebuilt from the ground up by foundation models, AI agents, and a fundamentally new approach to biological data. The question is not whether GEMs will remain relevant. It is what they become when they are no longer standalone tools but components in a larger intelligence system.
This post is my attempt to think through that question seriously, drawing on both the technical realities I work with daily and the trajectory I see forming across the field.
What GEMs Actually Are (And What They Are Not)
A genome-scale metabolic model is a mathematical representation of all known metabolic reactions in an organism, derived from its genome annotation. The genome tells you which enzymes the organism can produce. The enzymes tell you which reactions are possible. The reactions define a network of metabolite transformations. Flux balance analysis (FBA) then solves a linear programming problem over that network: given an objective function (typically maximizing growth or a target product) and a set of constraints (nutrient uptake limits, thermodynamic feasibility), what distribution of metabolic fluxes is optimal?
This framework is powerful because it requires relatively little data. You need a genome sequence, a reasonably curated set of metabolic reactions (from databases like KEGG, MetaCyc, or BiGG), and some experimentally measured boundary conditions (growth media composition, oxygen availability). You do not need kinetic rate constants for every enzyme. You do not need protein concentrations. You do not need to know the cell's regulatory logic. You can predict growth rates, metabolic byproducts, gene essentiality, and optimal engineering targets from stoichiometry and mass balance alone.
This is also, of course, the fundamental limitation. FBA solves for a steady-state optimum under the assumption that the cell is trying to maximize some objective. In reality, cells are not solving linear programs. They are executing a regulatory program that was shaped by evolution, and that program does not always maximize growth. It hedges bets, maintains futile cycles, expresses genes that are not needed yet, and allocates resources to stress responses even when no stress is present. The gap between the FBA optimum and actual cellular behavior is the gap between thermodynamic feasibility and biological reality.
A genome-scale metabolic model tells you what a cell could do, given its biochemical parts list. It does not tell you what the cell actually does, because that depends on regulation, environment, and evolutionary history, none of which stoichiometry captures.
The Current State: What Works, What Breaks
What Works
GEMs are remarkably effective for a specific class of problems. Predicting essential genes: FBA can identify which gene knockouts are lethal by checking whether any feasible flux distribution can sustain growth without the knocked-out reaction. This works well for central metabolism and has been validated extensively across organisms. Predicting growth phenotypes on different carbon sources: given a curated model and correct media composition, FBA growth predictions correlate reasonably well with experimental measurements. Identifying metabolic engineering targets: by introducing heterologous pathways and optimizing for product flux, GEMs can suggest gene knockouts, overexpressions, and cofactor balancing strategies. This is the workhorse application in synthetic biology and metabolic engineering, and it is the application I use most in my own work at GLBRC.
What Breaks
GEMs break in predictable ways, and understanding these failure modes is essential for knowing where AI can help.
The annotation bottleneck. Every GEM depends on the quality of the genome annotation, and genome annotation is far from perfect. For well-studied organisms like E. coli, the metabolic annotation is mature and heavily curated. For non-model organisms, which is where most of the interesting biology and engineering opportunities live, a significant fraction of genes have no functional annotation, and many annotations are transferred by sequence homology with uncertain accuracy. I experienced this firsthand when building models for Halomonas and Novosphingobium: the annotation tools confidently assigned functions that, upon closer inspection, were based on distant homologs in entirely different metabolic contexts.
The regulation gap. FBA ignores gene regulation entirely. It assumes all annotated enzymes are available and active. In reality, many enzymes are only expressed under specific conditions, regulated by transcription factors, allosteric effectors, and post-translational modifications that the stoichiometric model knows nothing about. This is why FBA predictions for secondary metabolism, stress responses, and condition-specific phenotypes are often poor: the model says the pathway is available, but the cell has turned it off.
The kinetics gap. Even when the right enzymes are expressed, their activity depends on kinetic parameters (Km, kcat, Ki) and metabolite concentrations that FBA does not consider. Two reactions that are thermodynamically feasible and stoichiometrically balanced may have vastly different rates, and the slow one may be the bottleneck that determines the actual flux. Enzyme-constrained models (ecModels) partially address this by incorporating protein abundance data and turnover numbers, but they require quantitative proteomics data that is expensive and often unavailable.
The community gap. Most organisms do not live in isolation. They live in communities where metabolic exchange between species shapes the behavior of every member. Community-level metabolic modeling exists, but it is computationally expensive, data-hungry, and fraught with assumptions about how species interact. My metagenomics work at GLBRC, building 492 metagenome-assembled genomes from wastewater treatment samples, was partly motivated by the need to understand these community-level metabolic interactions. The genomes are the starting point. The interactions are the frontier.
Foundation Models as Intelligence Layers
Here is where I believe the field is about to shift. The traditional approach to improving GEMs has been incremental: better annotations, more curated reactions, additional constraint types (thermodynamic, enzyme-constrained, regulatory). Each improvement requires significant manual effort and domain expertise. The progress is real but slow.
Foundation models offer a different strategy. Instead of improving the model by hand, use a pretrained model that has already learned biological relationships at scale to fill in the gaps that manual curation cannot reach.
Annotation Enhancement
The most immediate application is functional annotation. Protein language models like ESM-2 and ESM Cambrian have learned representations of protein sequence space that encode structural and functional information. A protein whose function is "unknown" in the database may have an ESM-2 embedding that clusters tightly with well-characterized enzymes, revealing its likely function without any new experimental data. This is not hypothetical. Research groups are already using protein embeddings to fill annotation gaps in metabolic reconstructions, and the results are significantly more accurate than traditional sequence homology methods because the embeddings capture higher-order patterns that pairwise alignment misses.
What this means for GEMs: the annotation bottleneck does not disappear, but it narrows substantially. A model built from foundation-model-enhanced annotations will have fewer gaps, fewer incorrect assignments, and better coverage of non-model organisms. For someone like me, working on microbes that have been studied by a handful of labs, this is transformative.
Regulatory Prediction
The regulation gap is harder, but foundation models are beginning to address it. Models trained on gene expression data across thousands of conditions (like scGPT for single-cell data, or models trained on bulk transcriptomic compendia) learn context-dependent relationships between genes. Given a set of environmental conditions, they can predict which genes are likely to be expressed, providing a condition-specific filter that transforms a static GEM into a context-aware model.
The architecture I envision is layered. The GEM provides the structural scaffold: the set of all possible metabolic reactions. The foundation model provides the regulatory intelligence: which of those reactions are active under specific conditions. FBA then operates on the intersection, the set of reactions that are both structurally possible and regulatorily active. This layered approach preserves the mechanistic interpretability of the GEM while importing the learned biological knowledge of the foundation model.
Kinetic Parameter Prediction
Kinetic parameters (Km, kcat) are among the most data-scarce quantities in biology. The BRENDA database contains kinetic measurements for a small fraction of known enzymes, often measured under non-physiological conditions. Foundation models trained on enzyme sequences, structures, and available kinetic data are beginning to predict these parameters directly from sequence. The predictions are noisy, but they are better than the alternative, which is having no kinetic information at all.
When kinetic predictions become reliable enough to constrain enzyme-constrained FBA models, we will be able to build condition-specific, kinetically informed metabolic models for organisms that have never been characterized kinetically. That is a step change in the resolution and accuracy of metabolic predictions.
The foundation model does not replace the genome-scale model. It becomes the intelligence layer that fills the gaps the genome-scale model has always had: annotation uncertainty, regulatory context, and kinetic parameters.
AI Agents as Operators
This is the part of the trajectory that is furthest from current practice and, I think, the most important to think about now.
Today, running a GEM-based analysis requires a skilled computational biologist. You need to know how to reconstruct the model, curate it, set up the FBA problem, interpret the results, identify inconsistencies, refine the model, and iterate. This process takes weeks to months per organism and requires deep domain expertise. It is the bottleneck that limits how widely GEMs are used.
AI agents can change this. An agent is a system that takes a goal ("build a metabolic model for this organism and identify engineering targets for product X"), decomposes it into subtasks, executes those subtasks using available tools, evaluates the results, and iterates. The tools an agent would use for GEM-based work already exist as software packages: genome annotation (Prokka, PGAP), model reconstruction (CarveMe, ModelSEED, KBase), gap filling (GapFind/GapFill), FBA (COBRApy), and flux variability analysis. What does not yet exist is the orchestration layer that connects these tools into an autonomous workflow that can handle the decision-making currently done by a human.
What an Agent-Operated GEM Workflow Looks Like
Imagine this: you give an AI agent a genome sequence and a target molecule. The agent annotates the genome using multiple tools, compares the annotations, resolves conflicts using a foundation model's functional predictions. It reconstructs a draft metabolic model, runs gap-filling to ensure the model can produce biomass, and validates the model against any available experimental data (growth rates, substrate uptake rates, product titers). If validation fails, it diagnoses the likely cause: missing reactions, incorrect gene annotations, inappropriate boundary conditions. It adjusts and reruns. Once validated, it runs OptKnock or similar algorithms to identify engineering targets, ranks them by predicted yield improvement, and generates a report with confidence intervals and caveats.
This is not science fiction. Every individual step in this workflow exists as a software tool. What is new is the idea that an AI agent could orchestrate the entire process, making the judgment calls that currently require a PhD-level computational biologist. The agent does not need to be perfect. It needs to be good enough to handle routine cases and flag the non-routine ones for human review.
The implications are significant. If building a validated GEM goes from months of expert time to days of automated processing, the number of organisms with high-quality metabolic models increases by orders of magnitude. And if those models are coupled with foundation-model-enhanced annotations and kinetic predictions, their accuracy improves simultaneously. More models, better models, faster.
The Lab-in-the-Loop
The agent concept extends beyond computation. In the design-build-test-learn cycle that drives metabolic engineering, the "test" step generates experimental data that feeds back into the model. An agent that can not only run FBA but also design the next experiment, specify the measurements needed to resolve a model ambiguity, and interpret the results when they arrive, becomes an autonomous scientific collaborator.
This is the lab-in-the-loop paradigm. The agent designs an experiment based on model predictions. The wet lab executes it. The data comes back. The agent updates the model, revises its engineering strategy, and designs the next experiment. Each cycle refines both the model and the engineering solution. The human's role shifts from executing the cycle to supervising the agent, providing biological judgment on edge cases, and deciding which directions to pursue.
The Data Bottlenecks Nobody Talks About
The vision I have described above, foundation models as intelligence layers, agents as operators, lab-in-the-loop design cycles, is technically plausible. But it depends on data, and the data situation in metabolic biology is worse than most people realize.
Bottleneck 1: Kinetic Data Is Catastrophically Sparse
BRENDA, the most comprehensive database of enzyme kinetic parameters, contains measurements for roughly 90,000 enzyme-substrate pairs. That sounds like a lot. It is not. The number of enzyme-substrate pairs that exist in nature is estimated in the hundreds of millions. Coverage is below 0.1%. Worse, the measurements that do exist are often inconsistent, measured under different conditions, in different organisms, using different assay methods. A foundation model trained on this data will learn the patterns that are there, but the sparsity means that predictions for poorly characterized enzymes will have high uncertainty.
The path forward requires new high-throughput experimental methods for measuring kinetic parameters at scale (microfluidics-based enzyme assays, cell-free systems), combined with standardized reporting formats that make the data machine-readable from the moment of measurement.
Bottleneck 2: Fluxomics Is Still Expensive and Rare
The gold standard for validating a metabolic model's flux predictions is 13C metabolic flux analysis (13C-MFA), which uses isotope labeling and mass spectrometry to measure actual intracellular fluxes. This is expensive, technically demanding, and produces data for a relatively small number of central metabolic reactions. Most GEMs are validated against growth rates and a handful of extracellular metabolite measurements, which is a weak test: many different flux distributions can produce the same growth rate.
Until fluxomics becomes cheaper and more comprehensive, model validation will remain a bottleneck. Foundation models can help by predicting fluxes from more readily available data (transcriptomics, proteomics), but those predictions need to be validated against actual flux measurements, which brings us back to the data gap.
Bottleneck 3: Non-Model Organisms Are Data Deserts
The organisms with the best metabolic models (E. coli, S. cerevisiae, human cell lines) are also the organisms with the most experimental data. This creates a circular problem: models are best where data is abundant, but the organisms where new engineering and discovery opportunities are richest are often the ones with the least data.
I have felt this acutely in my own work. Building a model for Novosphingobium at GLBRC meant working with an organism that has a genome sequence, a handful of published transcriptomics studies, and almost no kinetic or fluxomic data. The model I built is useful for generating hypotheses, but its predictive accuracy is limited by the data it was trained on. Foundation models can partially fill this gap by transferring knowledge from well-studied organisms, but transfer learning across phylogenetically distant organisms is still an open problem.
Bottleneck 4: Negative Results Are Not Published
This is perhaps the most insidious data bottleneck. The metabolic engineering literature is heavily biased toward positive results: strains that produced more, knockouts that improved yield, strategies that worked. The strategies that did not work, the knockouts that were lethal, the overexpressions that caused growth defects, are largely unreported. An AI system trained on this literature will inherit that bias, overestimating the success rate of engineering interventions and underestimating the space of things that fail.
The solution is cultural as much as technical: the field needs to value and publish negative results, and databases need to systematically capture failed engineering strategies alongside successful ones. A foundation model trained on both successes and failures will be more useful than one trained on successes alone, for the same reason that a chess engine that learns from lost games as well as won games develops better strategy.
Novel Opportunities for Groundbreaking Discovery and Engineering
Despite the bottlenecks, the convergence of GEMs with foundation models and AI agents opens several opportunities that were not possible even two years ago. These are the areas where I think the most significant advances will come in the next five to ten years.
Automated Model Reconstruction for the Entire Tree of Life
There are over 500,000 prokaryotic genome sequences in public databases. Fewer than 200 have curated genome-scale metabolic models. The gap is not conceptual. It is logistical: building and curating a model takes months of expert time. AI agents that can automate reconstruction, gap-filling, and validation could close this gap by orders of magnitude, producing draft-quality models for every sequenced organism. Even imperfect models at this scale would reveal metabolic capabilities, pathway distributions, and engineering opportunities across the tree of life that are currently invisible.
Community Metabolic Intelligence
Microbial communities are the frontier of metabolic engineering. The most efficient biomanufacturing systems in nature are not single organisms. They are consortia where different species perform different metabolic steps, exchange intermediates, and collectively produce outcomes that no single species could achieve alone. Combining community-level GEMs with metagenomics data and foundation model predictions of interspecies metabolic exchange could enable the rational design of synthetic microbial consortia for biomanufacturing, bioremediation, and agriculture. My metagenomics work at GLBRC, where we built 492 MAGs from wastewater treatment samples, is a small step toward the data foundation this requires.
Enzyme Discovery Through Embedding Space Exploration
Foundation models learn embedding spaces where proteins with similar functions cluster together. Exploring the neighborhoods of known enzymes in these embedding spaces can reveal novel enzymes with desired catalytic activities, even if those enzymes have low sequence similarity to anything in current databases. For metabolic engineering, this means the search space for heterologous enzymes to introduce into chassis organisms expands enormously. Instead of searching the literature for known pathways, you search the embedding space for proteins with the right functional signature, including proteins from uncultured organisms represented only by metagenomic sequences.
Multimodal Metabolic Digital Twins
The ultimate goal is a metabolic digital twin: a computational model that faithfully represents the metabolic state of a living cell under any condition. Today's GEMs are a rough approximation. The path to a true digital twin requires integrating stoichiometric models (what reactions are possible), kinetic models (how fast they run), regulatory models (which are active), and spatial models (where in the cell they occur) into a single, coherent framework. Foundation models can serve as the glue, learning the relationships between these layers from multimodal data (genomics + transcriptomics + proteomics + metabolomics + fluxomics) and predicting the behavior of the integrated system under new conditions. This is years away, but every piece of the architecture, the GEM, the foundation model, the multimodal data infrastructure, is being built right now.
Bioproduct Discovery from Uncharacterized Metabolism
Between 30% and 50% of genes in most microbial genomes have no known function. This means that between 30% and 50% of the metabolic potential of life on Earth is invisible to current models. Foundation models that can predict the function of these genes, even approximately, open the door to discovering entirely new metabolic pathways, novel natural products, and biochemical transformations that no one has described. Combined with GEM-based analysis of how these pathways integrate with the rest of the cell's metabolism, this could unlock a wave of bioproduct discovery that rivals the natural product discovery era of the mid-twentieth century.
How to Take the Field Forward
What I believe needs to happen
- Standardize the interface between GEMs and foundation models. Right now, every research group that combines metabolic modeling with ML builds a one-off integration. The field needs standard data formats, APIs, and interchange protocols that allow GEMs to consume foundation model predictions (functional annotations, kinetic parameters, regulatory states) as first-class inputs. SBML and COBRApy need to evolve to support foundation-model-derived constraints natively.
- Build the kinetic data infrastructure. The sparsity of enzyme kinetic data is the single biggest bottleneck for predictive metabolic modeling. We need high-throughput experimental platforms for kinetic measurement, standardized reporting formats (building on STRENDA guidelines), and community databases that are designed from the start to be training data for ML models, not just repositories for human browsing.
- Invest in negative results. Create incentives and infrastructure for publishing and curating failed metabolic engineering experiments. Every failed knockout, every overexpression that reduced yield, every pathway that did not function in a heterologous host is a data point that makes models better. The field cannot afford to throw this data away.
- Train the next generation of metabolic modelers in AI. The people who build GEMs need to understand foundation models, and the people who build foundation models need to understand GEMs. Right now, these are largely separate communities. The most impactful work in the next decade will come from people who can operate across both.
- Build agent frameworks for metabolic engineering. The software infrastructure for AI agents that can operate metabolic modeling tools, design experiments, and interpret results does not yet exist in a mature form. Building it is both an engineering project and a scientific one: the agent needs not just tool-use capability but biological judgment. This is where domain expertise and AI capability need to merge.
- Focus on non-model organisms. The organisms with the greatest untapped metabolic potential are the ones we know the least about. Every investment in data generation, model building, and foundation model training for non-model organisms has outsized returns, because we are moving from near-zero knowledge toward first-order understanding. The discoveries are waiting in the data deserts.
A Personal Note
I write this from the perspective of someone who has built genome-scale models by hand, reaction by reaction, annotation by annotation. I have experienced the frustration of a gap-filling algorithm that cannot find a solution because the annotation missed a key transporter. I have experienced the satisfaction of a model that correctly predicts a growth phenotype that the wet-lab team later validates. I have experienced the humility of a model that is confidently wrong because the regulation it ignores is the thing that matters most.
The AI tools I have described in this post will not replace that experience. They will accelerate it. The biologist who understands what a metabolic model means, what its assumptions are, where its predictions can be trusted and where they cannot, will be more valuable in the age of AI, not less. The tools will get faster. The judgment will not automate.
What excites me most about this moment is not the technology itself but what it makes possible to ask. With automated reconstruction, foundation-model-enhanced annotation, and agent-operated analysis, we can for the first time ask metabolic questions at the scale of the biosphere. Not one organism at a time, but thousands. Not one condition, but every condition the organism has ever encountered. Not one engineering strategy, but the entire combinatorial space.
The biology is waiting. The tools are arriving. The question is whether we build the bridges between them fast enough to use them well.
All writing