Computational Biology · ML Architectures · Opinion

The Architecture of Biological Reasoning

Every algorithm we build for biology encodes a worldview about what biology is. Tracing the lineage from dynamic programming through gradient boosting to graph neural networks reveals not just what we have learned to compute, but what we have learned to assume.

Every computational tool in biology is a frozen argument. Behind the algorithm sits an assumption about what the relevant biological structure is, what counts as signal, and what counts as noise. Hidden Markov models assume biology is a sequence of states. Convolutional neural networks assume biology is a translation-invariant pattern. Graph neural networks assume biology is a network of interacting parts. Each architecture is, in this sense, a philosophical position about life, dressed in mathematics.

This is not a complaint. It is the necessary condition of any computational science. The interesting question is not whether our algorithms encode worldviews. They do. The question is which worldview a given algorithm encodes, what it captures well, what it misses systematically, and what comes after when its assumptions break down.

The history of computational algorithms in biology is, in this sense, a history of assumption-making. Tracing it is useful not because the past explains the present, but because each generation of tools was built to address what the previous generation got wrong, and each carried its own blind spots that the next had to expose. We are, today, in a moment where the dominant architectures are powerful enough that their blind spots are easy to overlook. That should make us more interested in their lineage, not less.


The First Algorithms: Sequence as the Atomic Unit

Computational biology as a discipline begins, more or less, with the recognition that biological sequences could be treated as strings, and that string manipulation algorithms from computer science could be repurposed to ask biological questions. The early algorithmic moves were not subtle. They were foundational.

1970 — Needleman and Wunsch

Dynamic Programming Enters Biology

Saul Needleman and Christian Wunsch published an algorithm for global pairwise sequence alignment in 1970 that did something quietly radical: it treated the alignment problem as an optimization problem with a recursive structure, and solved it using dynamic programming. The algorithm guarantees an optimal alignment under a defined scoring scheme. This is, in retrospect, the first widely adopted computational biology algorithm, and its influence is hard to overstate.

The worldview encoded was specific: biology is a string, evolution is a sequence of edits, and similarity is a quantifiable distance under a defined cost model. Smith and Waterman extended this to local alignment in 1981, allowing the identification of conserved regions within otherwise divergent sequences. The basic algorithmic move, recursive computation of optimal substructures, became the template for an entire generation of bioinformatics methods.

1990 — BLAST

Heuristics Trade Optimality for Tractability

By 1990, sequence databases had grown large enough that the quadratic complexity of Needleman-Wunsch was a practical bottleneck. The solution, BLAST (Basic Local Alignment Search Tool), made an explicit trade: give up the guarantee of optimality in exchange for speed. BLAST searches for short exact matches (seeds) and extends them, using statistical theory to estimate the significance of matches found. The result is approximate alignment that runs orders of magnitude faster than dynamic programming and is good enough for most biological purposes.

BLAST is a turning point because it openly accepts an approximation. The algorithm cannot prove its alignment is optimal. It can only say, with quantified confidence, that the alignment is unlikely to have arisen by chance. The shift from "find the best alignment" to "find alignments that are statistically significant" is the first explicit move toward probabilistic reasoning in mainstream bioinformatics.

1990s — Hidden Markov Models

Probability as Architecture

Hidden Markov models were originally developed for speech recognition in the 1960s and 1970s, but their adoption in biology in the 1990s, particularly through the HMMER package and the Pfam database, represented a deeper philosophical shift. An HMM does not just compute a similarity score. It models a sequence as the output of a probabilistic generative process, hidden states that emit observable symbols according to defined probabilities, transitioning from state to state with their own probabilities.

For protein family classification and gene structure prediction, this was a powerful framing. A protein family is not just a collection of similar sequences. It is a stochastic process that generates sequences with characteristic patterns of conservation, insertion, and deletion. The HMM captures that process. The introduction of profile HMMs by Krogh and colleagues in 1994 made it possible to train these models on family alignments and use them to detect remote homologs that pure sequence comparison would miss.

The conceptual contribution of HMMs to biology was the normalization of probabilistic generative modeling as a research tool. Biologists who used HMMER were, whether they articulated it this way or not, treating biology as a stochastic process to be inferred, not a deterministic structure to be measured.


From Sequences to Vectors: The Statistical Learning Era

The late 1990s and early 2000s saw biology adopt a new family of computational tools, this time from the rapidly developing field of statistical machine learning. The shift was not just technical. It involved a different framing of biological problems.

The previous generation of methods had treated biological objects as structured: sequences had a directional order, alignments had a defined structure, HMMs had a state topology. Statistical learning methods, by contrast, typically operate on vector representations. Whatever the original biological structure, the data is converted into a feature vector, and the algorithm operates on the vector space. This was a significant simplification and, simultaneously, a significant loss of information.

Support Vector Machines and the Kernel Trick

Support vector machines, introduced by Vapnik and colleagues and adapted to biological problems through the late 1990s, became the dominant classifier in computational biology for nearly a decade. The reason was practical: with appropriate kernel choices, SVMs could handle the high-dimensional, low-sample-size data that characterizes most biological problems. Gene expression classification, protein function prediction, splice site detection, all of these became standard SVM applications.

The kernel trick, the ability to operate in a high-dimensional feature space without explicitly computing the mapping, was particularly important for biology. String kernels, spectrum kernels, and gappy pair kernels allowed sequences to be compared in ways that captured sequence-level patterns without requiring explicit alignment. This was the first generation of methods that could, in principle, learn relevant features from biological data rather than requiring those features to be hand-engineered.

Random Forests and Gradient Boosting: The Tabular Workhorses

For tabular biological data, particularly the kind of data that comes out of clinical studies, GWAS analyses, and high-throughput screening, two ensemble methods came to dominate. Random forests, introduced by Breiman in 2001, build many decision trees on bootstrap samples and average their predictions. Gradient boosting, formalized by Friedman in the same year, builds trees sequentially, each one correcting the errors of its predecessors. XGBoost, introduced by Chen and Guestrin in 2016, made gradient boosting practical at scale through algorithmic optimizations and engineering choices that made it the de facto standard for tabular machine learning across many fields, including biology.

The dominance of these tree-based methods in biological applications is worth dwelling on. They remain, even in 2026, the strongest baseline for many biological prediction tasks. The reasons are revealing. Tree-based methods handle mixed feature types (continuous, categorical, ordinal) natively. They are robust to feature scaling, missing values, and outliers. They produce interpretable feature importance estimates that can be sanity-checked against biological knowledge. They do not require massive datasets to perform well. For a clinical or biological practitioner trying to extract usable signal from a few hundred or few thousand patient records, XGBoost is often a more useful tool than a deep neural network.

What tree-based methods do not capture well is structure. They treat features as exchangeable: the order of features in the input does not matter, the relationships between features are learned only through interactions in the tree splits, and any spatial, temporal, or relational structure in the data must be encoded in the feature engineering. For a tabular dataset, this is not a limitation. For sequence data, image data, or network data, it discards the most informative property of the input.

The persistence of XGBoost as a benchmark in biological prediction is not an embarrassment for deep learning. It is a reminder that for many biological problems, the structure of the data is tabular, and the architectural priors of deep learning provide no advantage over methods explicitly designed for that case.


The Deep Learning Inflection: Architectures That Encode Structure

The deep learning revolution in biology, which arrived later and more unevenly than in computer vision or natural language processing, is best understood not as a generic method change but as a series of architectural decisions, each of which made specific assumptions about biological structure.

Convolutional Neural Networks and the Sequence Locality Prior

The first widely successful deep learning architecture in biology was the convolutional neural network, applied to genomic sequences. DeepBind in 2015 and DeepSEA in 2015 demonstrated that CNNs trained on large genomic datasets could predict transcription factor binding and chromatin accessibility from sequence alone, with accuracy exceeding the previous generation of motif-based methods.

The architectural prior of the CNN is translation invariance: the same pattern detector should fire wherever in the sequence the relevant pattern appears. For transcription factor binding sites, which can occur anywhere along a promoter region, this is a biologically appropriate assumption. For other biological problems, it is less clearly correct. Position matters in many biological contexts, and pure CNNs handle absolute position poorly.

The success of CNNs in regulatory genomics revealed something underappreciated: the field had been hand-crafting position weight matrices and motif models for decades, and a generic architecture with the right inductive bias and enough data could learn equivalent or better representations from raw sequence. This was, in retrospect, a warning shot for the entire feature-engineering tradition in bioinformatics.

Recurrent Networks and the Long-Range Dependency Problem

Recurrent neural networks, particularly LSTMs and GRUs, were applied to biological sequences with mixed results. The architectural prior of an RNN is that information flows along the sequence in order, with each position's representation depending on the history of preceding positions. For some biological problems (notably protein secondary structure prediction), this captured useful structure. For others, the limitations of RNNs in handling long-range dependencies became a serious bottleneck.

Biology is full of long-range dependencies. A regulatory element 50 kilobases from a gene can control its expression. Two amino acid residues separated by hundreds of positions in primary sequence can be in physical contact in three-dimensional structure. RNNs in practice struggled to learn these dependencies despite their architectural ability to represent them, because the relevant gradient signal degrades over long sequences during training.

Transformers and the Attention Revolution

The introduction of the transformer architecture by Vaswani and colleagues in 2017 was originally motivated by problems in machine translation. Its application to biology was rapid and transformative. The key architectural feature, self-attention, allows every position in a sequence to directly attend to every other position, with attention weights learned from data. Long-range dependencies, the achilles heel of RNNs, become a first-class capability.

For protein sequences, this was a critical match. The work that culminated in AlphaFold2 in 2021 and the ESM family of protein language models from 2019 onward demonstrated that transformer architectures, trained on the evolutionary record contained in protein sequence databases, could learn representations that captured the physics of protein folding implicitly. The model was not given any explicit information about protein structure during pretraining. It learned structural relationships from co-evolution patterns visible only when attention can span the full sequence.

The deeper implication was that the transformer architecture matched something fundamental about biological sequences. Biological function depends on relationships between distant elements of a sequence in ways that no architecture before transformers had captured well. Whether this is an architectural coincidence or a reflection of how biological information is actually organized is a question worth taking seriously.


Graph Neural Networks: Biology Is Not a Sequence

The most consequential recent architectural development in computational biology, in my view, is the rise of graph neural networks. Their importance is not just technical. It is conceptual.

A graph neural network operates on data structured as a graph, with nodes representing entities and edges representing relationships. Information is propagated between connected nodes through learned message-passing functions, allowing the network to compute representations that depend on the local graph neighborhood of each node. The architectural prior is that the relationships between entities matter as much as the entities themselves.

For biology, this is not a stretched analogy. It is the most direct architectural fit available. Biology is a network. Proteins interact with proteins. Metabolites participate in reactions. Genes regulate genes. Tissues are composed of cells in spatial relationships. The discrete-token, sequence-ordered representation that worked for CNNs and transformers is, for many biological problems, a flattening of an inherently relational structure.

Where Graph Architectures Fit

Drug discovery has been an early adopter of graph neural networks because molecules are, structurally, small graphs of atoms connected by bonds. Models like the message passing neural network framework developed by Gilmer and colleagues, and its successors in the chemical informatics community, have demonstrated that graph-based representations of molecules outperform fingerprint-based representations on many property prediction tasks. The graph captures the connectivity that determines chemistry. Fingerprints discard it.

Protein-protein interaction prediction, drug-target interaction modeling, and metabolic network analysis are all natural applications of graph neural networks. More recently, single-cell analysis has begun to incorporate graph methods: cells are represented as nodes in a similarity graph, and information propagation through this graph is used to denoise expression measurements, infer trajectories, and identify rare cell states. The neighborhood structure of the graph, in these applications, is itself a model of the underlying biology.

What Graph Architectures Still Cannot Do Well

Honesty requires acknowledging where graph neural networks struggle. They are sensitive to the construction of the underlying graph: changing how edges are defined can substantially change model behavior. Many graph architectures suffer from over-smoothing, where representations of distant nodes become indistinguishable as the depth of message passing increases. They are computationally expensive on large graphs and do not always benefit from the same kind of large-scale pretraining that has driven transformer success.

Most importantly, graph neural networks inherit the limitations of the graph used to instantiate them. A protein-protein interaction network built from low-quality experimental data will produce models that learn the experimental biases as much as the biology. The architecture is only as good as the structural representation it operates on, and constructing that representation is not automated. It is a research problem in its own right.


How Each Architecture Encodes a Different Biology

It is worth stepping back to make explicit what has been implicit throughout this history. Each architectural family encodes a different worldview about what biology is. The choice of architecture is, in this sense, a scientific commitment, not just an engineering preference.

Architecture
Implicit biological worldview
Dynamic programming
Biology is a string. Evolution is a sequence of edits. Similarity is a quantifiable cost.
HMMs
Biology is a stochastic generative process with hidden states. Observation reveals the process imperfectly.
SVMs and kernels
Biology can be embedded in a feature space where similar things are close. The choice of kernel is the choice of biological similarity.
Tree ensembles
Biology is tabular. Predictive signal lives in feature interactions discoverable by recursive partitioning.
CNNs
Biology has translation-invariant local patterns. The same motif means the same thing wherever it appears.
Transformers
Biology has long-range dependencies. Every position can in principle attend to every other position.
Graph neural networks
Biology is relational. The connections between entities are as informative as the entities themselves.

None of these worldviews is wrong. Each is right about some part of biology and silent or wrong about others. The history of progress in computational biology has been, in significant part, the history of recognizing which architectural prior matches which biological problem and developing the methodological discipline to choose appropriately rather than defaulting to whatever architecture is currently fashionable.


Where Stochasticity Re-enters: The Modern Synthesis

The architectures discussed so far are all, by default, deterministic. A given input produces a given output. The probabilistic generative framing that HMMs introduced did not propagate into the subsequent generations of architectures except in subtle ways. CNNs, transformers, and graph neural networks output deterministic predictions, with optional uncertainty estimates added afterward through techniques like dropout-based Bayesian approximation, deep ensembles, or conformal prediction.

This is, in my view, the most important place where the history of biological algorithms intersects with the long history of stochastic thinking in biology, and it is the frontier where the next generation of methods will be defined.

Generative Models as Stochastic Architectures

Variational autoencoders, normalizing flows, diffusion models, and the score-based generative models that underlie much of modern deep generative modeling all share a common feature: they are explicitly probabilistic. They model a distribution over the data, not just a mapping from input to output. For biological applications where the outcome of interest is itself a distribution (single-cell expression states, protein conformational ensembles, metabolic flux distributions), this matches the underlying biology in a way that deterministic architectures do not.

Variational autoencoders applied to single-cell data, like scVI, do not just produce point estimates of cellular state. They model the full distribution of expression patterns conditional on latent variables, allowing principled uncertainty quantification, missing data imputation, and integration of datasets from different experimental conditions. Diffusion models applied to protein structure prediction, like RoseTTAFoldDiffusion and similar approaches, generate ensembles of plausible structures rather than single optimal predictions, capturing the conformational flexibility that single-structure predictors cannot.

Stochasticity in the Architecture Itself

The deeper integration of stochastic thinking with deep learning is the move toward architectures that have stochasticity built in, not as a regularization technique or an uncertainty estimation method, but as a structural feature. Bayesian neural networks treat the weights themselves as distributions, with predictions being distributions over outputs marginalized over weight uncertainty. Gaussian process regression, deep kernel learning, and various forms of energy-based models all share the property that the model output is intrinsically a distribution, not a point estimate.

For biology, where the underlying systems are stochastic at the molecular level, this is more than a technical preference. It is an architectural alignment between the model's mathematical structure and the biology's physical structure. A deterministic neural network predicting gene expression is producing a point estimate of a quantity that is, in the underlying biology, a probability distribution. The model is computing something the biology does not compute: a single answer where the biology produces a distribution.

The next major architectural shift in biological AI is not a new network type. It is the integration of stochastic generative modeling with the relational and sequence-aware architectures we already have, so that the model output is structurally aligned with the distributional nature of biological systems.

Concrete Implications for Practice

The combination of stochastic and architectural thinking has practical consequences that are starting to play out across computational biology subfields.

In strain engineering and metabolic modeling, the integration of constraint-based methods with deep generative models is beginning to allow predictions over distributions of feasible flux states rather than single optimal solutions, an architectural move that aligns the model's output with the biological reality of metabolic heterogeneity within a cell population.

In single-cell analysis, the dominant tools are increasingly built on probabilistic graphical models that combine the relational structure of cell-to-cell similarity graphs with stochastic models of transcriptional dynamics, producing analyses that respect both the network structure of cellular relationships and the stochastic nature of gene expression.

In protein design, generative models that produce ensembles of candidate sequences with calibrated probability scores are replacing optimization-based approaches that produce single best designs, reflecting the recognition that protein function is a property of an ensemble of conformations and that designing for an ensemble is different from designing for a single structure.


Philosophy: What Are We Actually Doing?

Stepping back from the technical history, it is worth asking what we are doing, in the most basic sense, when we apply these algorithms to biological problems.

One way to read this history is as progressive sophistication: we kept building better tools to answer the same questions. That reading is partly true and mostly misleading. The deeper truth is that each generation of tools made certain biological questions answerable that had previously been unanswerable, and in doing so, redefined what counted as a worthwhile biological question.

Before BLAST, the question "what is the closest homolog of this protein in this large database" was practically unanswerable for most researchers. The tool changed what biologists could ask. After AlphaFold2, the question "what is the structure of this protein" went from being a multi-year experimental project to a few-second inference call for many proteins. The tool did not just answer an existing question better. It collapsed the time and effort cost of the question to nearly zero, which means the next questions can be larger and more interconnected.

This is the deepest pattern in the history of computational biology: tools do not just answer questions. They expand the space of questions that are economically and intellectually feasible to ask. The architectures we build are not neutral instruments for accessing pre-existing biological truths. They shape what biological truths are even thinkable.

The Lurking Question: Are We Modeling Biology or Modeling Datasets?

The most uncomfortable question to sit with, after surveying this history, is whether the models we build are modeling the biology we care about or are instead modeling the artifacts and biases of the datasets we have access to. A model trained on protein sequences is implicitly trained on the evolutionary, taxonomic, and experimental biases of the sequence databases. A model trained on single-cell expression data is trained on the biases of the cell types, tissues, and conditions that have been profiled.

For some questions, the dataset bias is small enough relative to the biological signal that the distinction does not matter much in practice. For others, the dataset bias is the dominant signal, and the model is mostly learning the structure of how data was collected, with biology as a secondary signal layered on top. Telling the difference is hard, and the field has historically been bad at it. Reproducibility crises in computational biology, models that do not generalize to new contexts, predictions that look biological but track collection methodology, all of these are symptoms of the same underlying issue.

Stochastic and generative architectures offer a partial solution here. By forcing models to output distributions rather than point estimates, they make the variance of the prediction explicit. A model that has high uncertainty on out-of-distribution inputs is admitting something honest about what it has learned. A deterministic model that produces confident, wrong predictions is silently passing dataset bias off as biological knowledge. The architectural choice has epistemic consequences.


What Comes Next

Six predictions about the next decade of computational architecture in biology

  • Probabilistic by default. Within five years, deterministic point-estimate predictions will be the exception rather than the norm for serious biological prediction tasks. Calibrated uncertainty will be a standard output, not an optional addition. The field will look back at deterministic predictions in high-stakes biology the same way it now looks back at p-value-only statistics.
  • Hybrid architectures. The clean lineage of architecture-per-problem will give way to hybrid models that combine the inductive biases of multiple architectures: graph backbones with attention readouts, sequence transformers with structural graph constraints, generative models conditioned on mechanistic priors. The era of choosing one architecture is ending.
  • Mechanistic priors return. The pendulum that swung hard toward learning everything from data will swing partway back. Models that incorporate mechanistic biological priors (mass balance, thermodynamic constraints, conservation laws, evolutionary structure) will outperform pure black-box models in the regimes where data is limited or out-of-distribution generalization matters, which is most of biology.
  • Foundation models become substrate, not products. The current wave of biological foundation models will be remembered the way we now remember BLAST and HMMER: as utilities. The next generation of important tools will be built on top of them, using them as feature extractors and starting points rather than as end products.
  • Interpretability becomes architectural, not post-hoc. The current approach of training a black-box model and explaining it afterward with feature attribution methods will be displaced by architectures that are interpretable by construction. Concept bottleneck models, neurosymbolic approaches, and structured latent variable models will move from research curiosities to practical tools.
  • The bottleneck shifts from compute to data quality. The improvements over the next decade will be limited less by architectural innovation and more by the quality, diversity, and structure of biological training data. Curated benchmarks, standardized data collection protocols, and infrastructure for high-quality biological data generation will matter more than the next architecture.

Closing Thought

The architectures we choose are bets about what matters in biology. They are bets that biology is sequential, or relational, or stochastic, or all of these in different ways at different scales. The history of computational biology, traced this way, is a history of those bets being placed, tested, and revised.

What strikes me, looking at this trajectory, is that the architectures that have lasted are the ones that respected biology's structure rather than imposing computational convenience on it. Dynamic programming respected the sequential structure of biological strings. HMMs respected their stochastic generative origin. Graph neural networks respected the relational structure of biological systems. Transformers, perhaps surprisingly, respected the long-range dependency structure that biological sequences carry. The architectures that did not last, or that occupied narrow niches, were typically ones that flattened biological structure into computational tractability.

The next chapter, the integration of architectural sophistication with explicit stochastic modeling, is not a departure from this pattern. It is its continuation. Biology is structured. Biology is relational. Biology is also, at its deepest level, stochastic. A complete computational biology requires architectures that can hold all three at once.

We are not there yet. But the lineage is clear, and so is the direction.


Key References

[01] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 1970;48(3):443-453. The founding paper of computational sequence analysis, introducing dynamic programming to biology.
[02] Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology. 1981;147(1):195-197. The local alignment algorithm that complemented Needleman-Wunsch and remains foundational.
[03] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403-410. The BLAST paper, marking the explicit acceptance of approximation in mainstream bioinformatics.
[04] Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology. 1994;235(5):1501-1531. The paper that introduced profile HMMs and made probabilistic generative modeling a standard tool in computational biology.
[05] Breiman L. Random forests. Machine Learning. 2001;45(1):5-32. The paper introducing random forests, foundational for tabular biological prediction.
[06] Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of Statistics. 2001;29(5):1189-1232. The formalization of gradient boosting that underlies XGBoost, LightGBM, and most modern tabular ML.
[07] Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794. The XGBoost paper, which made gradient boosting practical at scale and established it as the dominant tabular ML method, including in biology.
[08] Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology. 2015;33(8):831-838. The DeepBind paper, an early demonstration that CNNs trained on raw sequence could outperform hand-crafted motif models.
[09] Vaswani A, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017;30. The transformer paper, originally for translation, that became the architectural backbone of modern protein language models and many biological foundation models.
[10] Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583-589. The AlphaFold2 paper, the most consequential demonstration of what attention-based architectures can do when applied to a problem with sufficient training data and the right inductive biases.
[11] Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. International Conference on Machine Learning. 2017. The unifying framework for graph neural networks applied to molecular property prediction, foundational for graph-based drug discovery.
[12] Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature Methods. 2018;15(12):1053-1058. The scVI paper, demonstrating deep generative models for single-cell data and establishing probabilistic deep learning as a standard tool in single-cell genomics.
[13] Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130. The ESM-2 paper, establishing protein language models as substrate for downstream biological prediction tasks.
[14] Watson JL, et al. De novo design of protein structure and function with RFdiffusion. Nature. 2023;620:1089-1100. A landmark application of diffusion-based generative modeling to protein design, illustrating the integration of stochastic generative architectures into structural biology.
[15] Bronstein MM, Bruna J, Cohen T, Velickovic P. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv:2104.13478. 2021. A unifying treatment of how architectural priors (CNNs, GNNs, transformers, equivariant networks) all instantiate different geometric assumptions, directly relevant to thinking about which architectures match which biological structures.
B

Blaise Manga Enuh, PhD

Computational biologist and bioinformatics engineer at the Great Lakes Bioenergy Research Center. I build ML models, bioinformatics pipelines, and scientific software tools at the intersection of microbial biology and machine learning.

Back to site    Get in touch
All writing