A child puts her hand near a flame. She feels heat. She pulls away. The next time she sees a flame, she does not reach for it. Something has changed inside her that was not there before. She has a model of the world now, a small one, barely articulate, but functional: flames are hot, hot things hurt, do not touch.
That is a learning system.
It is also, in its essential logic, the same thing as a neural network trained on a million protein sequences to predict which ones will fold into stable structures. The complexity is different. The mechanism is different. The substrate is different. But the underlying architecture of the process is identical: experience comes in, internal representations change, and future behavior improves.
This essay is an attempt to think clearly about what that architecture actually is, where the key ideas came from, and why understanding it matters, not just for computer scientists, but for anyone who wants to understand how intelligence works, whether it runs on carbon or silicon.
The Core Logic: Experience, Representation, Prediction
Strip away the jargon and every learning system has three components.
First, there is experience: data, observations, encounters with the world. For a child, this is sensory input. For a machine learning model, this is a training dataset. For a bacterium developing antibiotic resistance, this is selective pressure from the environment. The form varies, but the function is the same: the system receives information from outside itself.
Second, there is representation: some internal structure that changes in response to experience. In a brain, these are synaptic connections. In a neural network, these are weights. In a population of bacteria, this is the distribution of genetic variants. The representation is what the system "knows," encoded not as explicit statements but as patterns of connection, activation, or frequency.
Third, there is prediction: the system uses its representation to anticipate what will happen next and act accordingly. The child predicts that the flame will burn and avoids it. The neural network predicts that a protein sequence will fold into a particular structure. The resistant bacterium survives an antibiotic dose that kills its neighbors. Prediction is where learning meets the world. It is the test.
A system that changes its internal state in response to experience, in a way that improves its future performance, is learning. That is the entire definition. Everything else is implementation detail.
The Hypotheses That Built Modern Learning Systems
The learning systems we use today, deep neural networks, large language models, biological foundation models, did not appear from nothing. They are the product of a handful of powerful hypotheses, each of which was controversial when proposed and is now so deeply embedded in our technology that we forget it was ever an idea at all.
The Neuron Hypothesis (McCulloch and Pitts, 1943)
The first formal model of an artificial neuron proposed that a network of simple binary units, each of which fires or does not fire based on the weighted sum of its inputs, could in principle compute any logical function. This was a radical claim. It said that intelligence, or at least computation, could emerge from the collective behavior of very simple components. You did not need a central controller. You did not need explicit rules. You needed a network.
This hypothesis is so foundational that it is easy to miss how audacious it was. It proposed that the architecture of the brain, a network of neurons, was not just biological plumbing. It was a computational architecture. And if it was a computational architecture, it could be abstracted, formalized, and eventually rebuilt in a different substrate.
The Learning Rule Hypothesis (Hebb, 1949)
Donald Hebb proposed a simple principle: neurons that fire together wire together. If two neurons are repeatedly active at the same time, the connection between them strengthens. This was the first concrete proposal for how a network could learn from experience, not by being programmed, but by having its connections shaped by the patterns it encounters.
Hebb's rule is the ancestor of every weight update rule in modern deep learning. When you train a neural network using gradient descent, you are adjusting connection strengths based on how well the network's predictions match reality. The mechanism is more sophisticated (backpropagation, not simple co-activation), but the principle is the same: the network's structure is shaped by the data it sees.
The Perceptron and Its Limits (Rosenblatt, 1958; Minsky and Papert, 1969)
Frank Rosenblatt built the Perceptron, a physical machine that could learn to classify inputs by adjusting weights. It worked. It learned to distinguish different patterns from labeled examples. The excitement was enormous. Then Minsky and Papert published a mathematical analysis showing that single-layer perceptrons could not learn certain basic functions, like XOR (exclusive or). The result was correct but its implications were overstated: it killed funding for neural network research for over a decade.
The lesson here is not about the mathematics. It is about the sociology of science. A correct but narrow result, applied too broadly, can delay progress for a generation. The limitation Minsky and Papert identified was real, but the solution (adding more layers, creating deep networks) was already conceptually available. It just took 20 years for the field to recover.
The Backpropagation Breakthrough (Rumelhart, Hinton, Williams, 1986)
Backpropagation solved the credit assignment problem: in a deep network with many layers, how do you know which weights to change and by how much? The answer is the chain rule from calculus, applied systematically from the output layer back through every hidden layer. This allowed deep networks to learn, and it reopened the field.
What makes backpropagation remarkable is not its mathematical complexity (it is an application of the chain rule, which every calculus student learns). What is remarkable is its generality. The same algorithm trains networks for image classification, language generation, protein structure prediction, and drug discovery. It works not because it captures something specific about any one domain, but because it captures something universal about how to adjust a parameterized function to better fit data.
The Representation Learning Hypothesis (Bengio, 2009; the Deep Learning Era)
The deep learning revolution rests on a specific hypothesis: given enough data and enough layers, a neural network can learn not just the mapping from input to output, but the right representation of the input for the task at hand. You do not need to hand-engineer features. You do not need to tell the network what to look for. The network discovers, through training, which features of the data are informative and which are noise.
This hypothesis is what makes foundation models like ESM-2 and AlphaFold2 possible. ESM-2 was not told what protein structure is. It was trained on millions of protein sequences and learned, purely from patterns of co-occurrence and conservation, a representation of protein sequence space that encodes structural and functional information. The biology was in the data. The model found it.
The Scaling Hypothesis (Kaplan et al., 2020; the Large Model Era)
The most recent and perhaps most consequential hypothesis: performance of neural networks improves predictably as you increase three things simultaneously: the amount of data, the size of the model, and the amount of compute. This is not a vague claim. It is an empirical observation backed by power laws: doubling the model size produces a measurable, predictable improvement in performance.
The scaling hypothesis is what drove the creation of GPT-3, GPT-4, and their successors. It is what drives Biohub's investment in 10,000 GPUs for biological AI. It is the reason that a relatively straightforward architecture (the transformer) trained at sufficient scale can produce behavior that looks remarkably like understanding. Whether it is understanding is a philosophical question. That it is useful is an empirical fact.
What All Learning Systems Share
If you step back far enough, every learning system, biological or artificial, shares a common structure. There is an environment that generates data. There is an agent that receives that data and maintains an internal model. There is a feedback signal that tells the agent how well its model predicts the environment. And there is an update rule that adjusts the model in the direction of better predictions.
In supervised learning, the feedback signal is explicit: here is the right answer, how far off were you? In reinforcement learning, the feedback signal is sparse: you get a reward or a punishment, but you have to figure out which of your actions led to it. In unsupervised learning, there is no external feedback at all: the system learns by finding structure in the data itself, by predicting what comes next, by compressing, by reconstructing.
What unifies all of these is optimization. Every learning system is, at its mathematical core, a system that adjusts its parameters to minimize some measure of error or maximize some measure of fit. The differences are in what is being optimized, how the error is defined, and what constraints the system operates under. But the logic is the same: experience in, error computed, parameters adjusted, predictions improved.
Why This Matters Beyond Computer Science
Understanding what a learning system is, at this level of abstraction, matters for everyone, not just machine learning researchers. It matters because learning systems are increasingly making decisions that affect human lives: which patients get screened for cancer, which drug candidates advance to trials, which genetic variants are flagged as pathogenic. If you do not understand the logic of these systems, you cannot evaluate their outputs, question their assumptions, or recognize their failures.
It also matters because the most important learning systems of the next decade will be hybrid: biological data, computational models, and human judgment working together. The scientist who understands both how a cell learns (through evolution, gene regulation, and adaptation) and how an AI system learns (through optimization, representation, and generalization) will be uniquely positioned to build the tools that connect the two.
That is the intersection I am working toward, and the reason I think about learning systems not as a computer science topic or a biology topic, but as a question about how intelligence works, in all its forms.
All writingThe question is not whether machines can learn. They demonstrably can. The question is whether we can build learning systems that learn the right things, for the right reasons, in ways we can understand and trust.