A single bacterial cell, something you can barely see without a microscope, performs feats of information processing that would humble most software systems. It senses its environment through hundreds of receptor proteins. It integrates signals from nutrient gradients, temperature shifts, osmotic stress, and chemical toxins simultaneously. It regulates the expression of thousands of genes in real time, adjusting metabolic flux through dozens of interlinked pathways. It makes decisions: swim toward food, express a resistance gene, enter a dormant state, divide. All of this happens without a central processor, without a clock, without source code. The intelligence is distributed across a molecular network that has been optimized by evolution over billions of years.
Now consider what it takes to model that cell computationally. We have the genome sequence. We have transcriptomic snapshots. We have proteomic measurements. We have metabolic network reconstructions. We have kinetic parameters for some of the enzymes. And yet, with all of this data, we cannot predict with confidence what that cell will do when you change a single nutrient in its growth medium.
This gap between what biology knows and what we can compute is the central problem of computational biology. And it is harder than most people think.
What We Mean by Biological Intelligence
I want to be precise about language here. When I say biological intelligence, I do not mean consciousness or cognition in the human sense. I mean something more specific and more useful: the ability of a biological system to process information from its environment, maintain internal representations of relevant variables, and generate adaptive responses that increase its fitness.
By this definition, intelligence is everywhere in biology. A bacterium performing chemotaxis is intelligent. A gene regulatory network that implements a switch between two metabolic states is intelligent. An immune system that remembers a pathogen it encountered years ago and mounts a faster response the second time is intelligent. Evolution itself is intelligent: it searches the space of possible genotypes, evaluates fitness in the environment, and concentrates probability on solutions that work.
What all of these systems share is that they encode knowledge. Not as explicit rules written in a language we can read, but as structure: the arrangement of nucleotides in a genome, the topology of a regulatory network, the three-dimensional fold of a protein, the spatial organization of a tissue. Biological intelligence is embodied intelligence. The knowledge is inseparable from the physical system that implements it.
The genome does not contain a description of the organism. It contains a program that, when executed in the right biochemical context, constructs the organism. That distinction changes everything about how we must approach computational modeling.
Why the Encoding Problem Is So Hard
The Representation Gap
The most fundamental challenge is representational. Biology's "data format" is physical chemistry: molecular shapes, binding affinities, reaction kinetics, diffusion coefficients, membrane potentials, mechanical forces. Our digital representations are matrices of numbers: gene expression counts, protein abundance measurements, metabolite concentrations. Every measurement we take is a lossy compression of a rich, continuous, spatiotemporal physical reality into a sparse, discrete, snapshot of numbers.
When we build a genome-scale metabolic model, we represent the cell's metabolism as a stoichiometric matrix and a set of flux constraints. This captures the topology of the network and the mass balance relationships between metabolites. It does not capture enzyme kinetics, allosteric regulation, compartmentalization, metabolite channeling, protein-protein interactions, or the stochastic fluctuations that dominate behavior in single cells. The model is useful. It is also profoundly incomplete. And the information that is missing is not noise. It is biology.
The Scale Problem
Biology operates across scales in a way that digital systems do not. A single nucleotide change in a gene can alter a protein's binding affinity, which changes a regulatory interaction, which shifts the expression of a metabolic pathway, which alters the cell's growth rate, which changes the competitive dynamics of a microbial community, which affects an ecosystem-level process like carbon cycling. The causal chain crosses six orders of magnitude in space and time, from angstroms to meters, from nanoseconds to years.
No computational framework can simulate all of these scales simultaneously with full fidelity. We are forced to choose: model one scale in detail and treat other scales as boundary conditions or parameters. This means that every model is, by construction, an approximation that works within its chosen scale and breaks at the interfaces. Connecting models across scales, so that molecular details inform cellular behavior and cellular behavior informs population dynamics, is one of the great unsolved problems in computational biology.
The Context Problem
The same gene does different things in different cells, at different times, under different conditions. Context is not a confound in biology. It is the phenomenon. A gene's function is not a fixed property of its sequence. It is an emergent property of its sequence in the context of a particular regulatory network, in a particular cellular state, in a particular environment.
This is why single-gene knockout experiments so often produce unexpected results. The cell is not a bag of independent parts. It is a system, and perturbing one part changes the behavior of every other part that interacts with it, directly or indirectly. Encoding this context-dependence into digital models requires moving beyond lists of genes and functions toward network-level and systems-level representations, which in turn require far more data than we typically have.
The Interpretability Problem
Suppose you train a deep neural network on millions of protein sequences and it learns to predict protein function with impressive accuracy. You now have a model that encodes something about the relationship between sequence and function. But what does it know? The knowledge is distributed across millions of parameters in a high-dimensional weight space. You cannot open the model and read off a biological principle the way you can read a textbook.
This is the interpretability problem, and it is especially acute in computational biology because the purpose of biological modeling is not just prediction. It is understanding. A model that predicts correctly but cannot explain why is useful for engineering (designing proteins, optimizing metabolic pathways), but it does not advance scientific knowledge in the way that a mechanistic model does. The tension between predictive power and mechanistic interpretability is one of the defining tensions in the field right now.
What AI Changes About This Problem
The recent wave of biological AI, foundation models trained on massive datasets of sequences, structures, expression profiles, and perturbation responses, changes the encoding problem in a specific and important way. These models do not solve the problem of translating biological intelligence into digital representations. But they offer a new strategy: instead of encoding biological knowledge explicitly as rules, equations, and constraints, let the model discover representations implicitly from data.
ESM-2 learned protein structure from sequences alone, without ever being told what structure is. scGPT learned cell state representations from single-cell transcriptomic data, without ever being told what a cell type is. AlphaFold2 learned the physics of protein folding from known structures and co-evolutionary information, without solving the Schrodinger equation.
These models succeed because biological data has structure, and neural networks are very good at finding structure in high-dimensional data. The protein sequence space is not random. It is shaped by evolution, by physical chemistry, by functional constraint. The models find those constraints because the constraints are in the data, waiting to be extracted by a sufficiently powerful learning algorithm.
But there is a subtle danger here. The fact that a model can learn representations from data does not mean those representations are complete, or correct, or generalizable beyond the training distribution. A language model trained on published protein sequences will encode the biases of what has been studied and what has been deposited in databases. It will know a great deal about well-studied organisms and well-characterized protein families, and very little about the vast majority of biological diversity that has never been sequenced, purified, or characterized.
What This Demands of Computational Biologists
The encoding challenge is not going to be solved by computer scientists working alone. It requires people who understand what the biology means, who can evaluate whether a digital representation captures the right relationships and misses the right details, and who can connect the outputs of computational models to experiments that test their predictions.
That is why I believe the most important skill for a computational biologist in 2026 is not proficiency in any specific tool or language. It is the ability to think clearly about representation: what information is preserved when you transform a biological system into a digital model, what information is lost, and whether what is lost matters for the question you are trying to answer.
When I build a genome-scale metabolic model, I am making an explicit set of representational choices: which reactions to include, which constraints to impose, how to handle uncertainty in gene annotations. When I train a transformer model on protein sequences, I am making a different set of choices: how to tokenize the sequence, what embedding dimension to use, how to structure the attention mechanism. In both cases, the representation is not a neutral container for biological facts. It is a lens that determines what the model can see and what it is blind to.
Getting the representation right is the hard part. The algorithms, increasingly, will take care of themselves.
All writingThe defining challenge of computational biology is not building bigger models or generating more data. It is deciding what to represent, how to represent it, and knowing what you have left out.