Building Scientific Software: The Math, The Code, The Philosophy

There is a moment in every computational biologist's career where you realize that the script you wrote for your own analysis needs to become something other people can use. This moment is where most scientific software projects begin. It is also where most of them fail.

The failure is rarely about the science. The biology is sound, the algorithm works, the results are correct. The failure is about everything else: the wet-lab scientist who cannot install your dependencies, the collaborator who gets a different result on a different operating system, the reviewer who asks to see your code and finds a 2,000-line Jupyter notebook with no comments and six global variables named temp.

I have lived through this failure. I have also, slowly and painfully, learned how to avoid it. This post is my attempt to lay out what I wish someone had told me when I started: the mathematical foundations you actually need, the algorithms that appear again and again in computational biology, the software engineering patterns that separate fragile scripts from robust tools, and the philosophy of design that determines whether your software lives or dies.

Part I: The Mathematics You Actually Need

Most computational biologists learn mathematics in fragments, picking up pieces as they need them for specific analyses. This works until it does not. The moment you start building software tools rather than running individual analyses, you need a coherent mathematical foundation. Not because you will derive proofs on a whiteboard, but because mathematical fluency determines whether you can evaluate whether an algorithm is appropriate for your data, understand what its assumptions are, and anticipate where it will break.

Foundation 1

Linear Algebra: The Language of High-Dimensional Biology

Every omics dataset is a matrix. Gene expression profiles, protein abundances, metagenomic read counts, metabolic flux distributions: they are all matrices where rows represent biological entities and columns represent samples or conditions. If you cannot think in terms of matrices, you cannot build tools for this data.

The operations that matter most: matrix multiplication (the backbone of regression, PCA, and neural networks), eigendecomposition (what PCA actually does under the hood), singular value decomposition (how you reduce dimensionality while preserving variance), and matrix inversion (central to solving systems of linear equations in metabolic modeling). When you build a tool that performs dimensionality reduction, you are not calling a library function in a black box. You should understand that PCA finds the directions of maximum variance by computing the eigenvectors of the covariance matrix, and that this works well when relationships are linear and breaks when they are not.

Practical fluency means: you can look at a 20,000-gene by 500-sample expression matrix and immediately think about its rank, its condition number, and whether the operations you plan to perform on it will be numerically stable.

Foundation 2

Probability and Statistics: Quantifying Uncertainty

Biology is noisy. Every measurement you take has error, every sample you collect has batch effects, every experiment you run has confounders. The mathematical framework for reasoning about uncertainty is probability and statistics, and you need more of it than most computational biology curricula provide.

The concepts that come up constantly: probability distributions (knowing when your data is normally distributed, Poisson-distributed, or negative binomial matters for choosing the right test), hypothesis testing (understanding p-values, multiple testing correction, and why Bonferroni is conservative while Benjamini-Hochberg controls the false discovery rate), Bayesian inference (which lets you incorporate prior knowledge and is increasingly central to modern genomic analysis), and maximum likelihood estimation (the mathematical principle behind most model fitting you will ever do).

When you build a tool that reports statistical significance, you are making a mathematical claim. If you do not understand the assumptions behind that claim, your tool will produce wrong answers on edge cases and neither you nor your users will know it.

Foundation 3

Calculus and Optimization: Finding the Best Answer

Every time you fit a model to data, you are solving an optimization problem. Least squares regression minimizes the sum of squared residuals. Maximum likelihood estimation maximizes a likelihood function. Neural network training minimizes a loss function using gradient descent. These are all calculus problems.

The concepts you need: derivatives and gradients (how a function changes as its inputs change), the chain rule (the mathematical foundation of backpropagation in neural networks), convexity (whether your optimization problem has a single global minimum or multiple local minima), and numerical methods for optimization (gradient descent, Newton's method, the Levenberg-Marquardt algorithm for nonlinear least squares).

When I built my kinetic modeling tool, the core computational engine is nonlinear least squares optimization: fitting growth models (Monod, logistic, Gompertz) to experimental time-series data by minimizing the difference between predicted and observed values. Understanding the mathematics of that optimization, not just calling scipy.optimize.curve_fit, is what allowed me to handle edge cases: initial parameter estimation, convergence failures, model selection using information criteria like AIC and BIC.

Foundation 4

Differential Equations: Modeling Dynamic Systems

Biology is not static. Cells grow, populations change, metabolites accumulate and deplete, gene expression fluctuates over time. Ordinary differential equations (ODEs) are the mathematical language for describing how systems change. Michaelis-Menten kinetics, Lotka-Volterra competition models, SIR epidemiological models, and flux balance analysis in metabolic modeling all rest on systems of differential equations.

If you are building tools for dynamic biological systems, you need to understand: how to formulate a system of ODEs from a biological mechanism, numerical integration methods (Euler, Runge-Kutta, stiff solvers for metabolic systems), sensitivity analysis (how model outputs change as parameters vary), and parameter estimation (fitting ODE models to time-series data, which is a nonlinear optimization problem wrapped around a numerical integration problem).

Foundation 5

Graph Theory: Biology as Networks

Metabolic networks, protein interaction networks, gene regulatory networks, phylogenetic trees: biology is full of graph structures. Graph theory provides the vocabulary and algorithms for working with them. Shortest paths, connected components, network centrality measures, graph clustering, and tree traversal algorithms appear constantly in bioinformatics.

When you build a pathway analysis tool, you are operating on a directed graph where nodes are metabolites and edges are enzymatic reactions. When you build a phylogenetic analysis pipeline, you are constructing and manipulating tree data structures. Understanding graph algorithms is not optional.

The difference between a computational biologist who uses tools and one who builds them is not the amount of mathematics they know. It is whether they understand the mathematics well enough to know when it applies and when it does not.

Part II: Algorithms That Appear Again and Again

Computational biology has a canon of algorithms that appear across domains, from sequence analysis to metabolic modeling to single-cell genomics. If you are building scientific software, you will encounter these repeatedly, and understanding them at a level deeper than "I called the function" is what separates tools that work from tools that work correctly.

Sequence Alignment: Dynamic Programming in Action

The Smith-Waterman (local) and Needleman-Wunsch (global) alignment algorithms are the foundational dynamic programming problems in bioinformatics. They work by breaking the alignment problem into overlapping subproblems and building up the optimal solution from smaller solutions. Every sequence search tool you use, from BLAST to minimap2, is built on variants of this idea. Understanding dynamic programming means understanding that alignment is not a heuristic. It is an exact optimization with a scoring matrix, gap penalties, and a traceback that produces the mathematically optimal alignment given those parameters.

Hidden Markov Models: Sequence Annotation and Beyond

HMMs are the workhorse of biological sequence annotation. Gene finding (GeneMark, Glimmer), protein domain identification (HMMER, Pfam), and multiple sequence alignment (profile HMMs) all use hidden Markov models. The core idea is elegant: you model a biological sequence as being generated by a series of hidden states (coding region, intron, intergenic, etc.), each of which emits observed symbols (nucleotides, amino acids) with specific probabilities. The Viterbi algorithm finds the most likely sequence of hidden states. The forward-backward algorithm computes the probability of being in each state at each position. The Baum-Welch algorithm trains the model parameters from data.

Expectation-Maximization: Handling Missing Data

The EM algorithm appears whenever you have incomplete data or latent variables, which in biology is nearly always. Clustering algorithms like Gaussian mixture models use EM. Phylogenetic methods use EM. Haplotype phasing uses EM. Imputation of missing genotypes uses EM. The algorithm alternates between an E-step (estimating the expected values of the missing data given current parameters) and an M-step (updating the parameters to maximize the likelihood given the completed data). Understanding EM means understanding why your clustering algorithm sometimes converges to different solutions depending on initialization, and why running it multiple times with different starting points is not a hack but a mathematically justified strategy.

Graph Algorithms for Biological Networks

De Bruijn graph assembly (the basis of most short-read genome assemblers), flux balance analysis (linear programming on metabolic networks), shortest path algorithms for metabolic pathway analysis, and community detection in protein interaction networks all rely on graph algorithms. If you are building tools that operate on any kind of biological network, you need fluency with breadth-first search, depth-first search, Dijkstra's algorithm, minimum spanning trees, and linear programming.

Dimensionality Reduction: Making Sense of High-Dimensional Data

PCA, t-SNE, UMAP, and diffusion maps are the primary tools for visualizing and analyzing high-dimensional biological data. Each makes different mathematical assumptions and preserves different properties of the data. PCA preserves global linear structure. t-SNE preserves local neighborhood relationships but distorts global distances. UMAP attempts to preserve both local and global structure through topological methods. Diffusion maps capture the dynamics of a data-generating process. When you build a tool that includes dimensionality reduction, choosing the right method and understanding its limitations is a scientific decision, not a default setting.

Part III: The Philosophy of Scientific Software Design

Before writing a single line of code, there is a more fundamental question: what is this software for, and who is it for?

Most scientific software fails not because the algorithm is wrong, but because the developer never seriously considered the user. I have watched brilliant tools die because they assumed the user had a terminal, a conda environment, and 30 minutes to debug an installation error. The reality is that the biologist who needs your tool the most is usually the one least equipped to install it.

Design for the Person Who Is Not You

When I built my kinetic modeling tool, the first version was a Python script. It worked perfectly for me. It was useless for our wet-lab collaborators. The second version was a Streamlit web application with a point-and-click interface, automated parameter estimation, and an AI-generated summary of the results. The science did not change between versions. The audience changed, and therefore the design changed.

This is the most important lesson in scientific software: the tool's value is determined not by its computational sophistication but by the number of correct decisions it enables. A simple tool that 100 scientists use correctly is more valuable than a powerful tool that 3 scientists use and 2 of them get wrong.

Separate the What from the How

The single most important principle in software architecture is the separation of concerns. Your scientific logic (the algorithm, the model, the statistical test) should be completely independent of your interface (the command line, the web app, the API). This means structuring your code so that the core computation lives in functions or classes that take inputs and return outputs, with no knowledge of how those inputs arrive or how those outputs are displayed.

In practice, this means: your growth curve fitting function should accept a numpy array of time points and a numpy array of OD values and return fitted parameters and goodness-of-fit statistics. It should not know or care whether those arrays came from a CSV upload in a Streamlit app, a command-line argument, or a database query. This separation is what allows you to build a command-line tool, a web application, and a Python library from the same codebase.

Abstraction and Stacking Abstractions

Abstraction is the practice of hiding complexity behind a simpler interface. Every function you write is an abstraction: it takes inputs, does something complex internally, and returns outputs. The caller does not need to know how it works, only what it does and what contract it fulfills.

In scientific software, abstractions stack on top of each other. At the lowest level, you have numerical routines: a function that integrates an ODE, a function that computes a matrix decomposition, a function that evaluates a log-likelihood. On top of those, you build model-level abstractions: a GrowthModel class that takes a model type and experimental data, calls the appropriate numerical routine, and returns fitted parameters. On top of that, you build workflow-level abstractions: a function that takes a batch of experimental files, fits all of them, compares models, selects the best fit, and generates a report.

Good scientific software has clean abstraction layers where each level hides the complexity of the level below. Bad scientific software has everything in one layer, where numerical details, business logic, file I/O, and display code are all tangled together in a single function.

Part IV: Software Engineering Patterns for Scientific Code

Scientific code exists in a strange middle ground. It is not throwaway scripting (the results matter too much). It is not enterprise software (the development team is usually one to three people). It needs its own set of engineering practices, adapted from industry software engineering but calibrated for the realities of scientific work.

Version Control Is Not Optional

If you are building scientific software and you are not using Git, stop reading this article and go set up a repository. Version control is the foundation of reproducibility. It lets you track every change to your codebase, revert mistakes, branch for experiments, and collaborate without overwriting each other's work. Every scientific software tool should live in a Git repository from day one.

Testing: The Discipline That Saves You

Write tests. Write them before you think you need them. Unit tests verify that individual functions produce correct outputs for known inputs. Integration tests verify that components work together. Regression tests verify that changes to the code do not break existing functionality.

In scientific software, testing has a special importance: your code produces quantitative results that people will use to make decisions. A function that returns a p-value of 0.04 instead of 0.06 changes a scientific conclusion. The only way to have confidence that your code is correct is to test it against known solutions, edge cases, and adversarial inputs.

# Example: testing a growth model fitting function
def test_logistic_fit_recovers_known_parameters():
    """Given synthetic data from a logistic model with known parameters,
    verify that the fitting function recovers those parameters
    within acceptable tolerance."""
    known_K = 1.0     # carrying capacity
    known_r = 0.5     # growth rate
    known_N0 = 0.01   # initial population

    t = np.linspace(0, 20, 100)
    y_true = known_K / (1 + ((known_K - known_N0) / known_N0) * np.exp(-known_r * t))
    y_noisy = y_true + np.random.normal(0, 0.02, len(t))

    result = fit_logistic(t, y_noisy)

    assert abs(result.K - known_K) < 0.1
    assert abs(result.r - known_r) < 0.05
    assert abs(result.N0 - known_N0) < 0.05

Documentation: Write It for the Person Six Months from Now

That person is probably you. Docstrings for every public function. A README that explains what the tool does, how to install it, and how to run it in under two minutes. Type hints that make your function signatures self-documenting. Comments that explain why, not what. The code itself should explain what. The comments explain the reasoning behind non-obvious decisions.

Multiprocessing and Parallelism

Biological datasets are large and analyses are computationally intensive. Knowing when and how to parallelize is essential.

Python's multiprocessing module spawns separate processes that bypass the Global Interpreter Lock (GIL) and run on separate CPU cores. This is the right choice for CPU-bound work like fitting models to thousands of samples independently. The concurrent.futures module provides a higher-level interface with ProcessPoolExecutor for parallelism and ThreadPoolExecutor for I/O-bound tasks like downloading files or making API calls.

The key principle: parallelize at the right level of granularity. Do not parallelize individual numpy operations (numpy already uses optimized BLAS routines that may be multithreaded). Do parallelize independent tasks: fitting growth curves to 500 wells on a 96-well plate, running QC on 50 sequencing samples, aligning reads from 30 different libraries.

from concurrent.futures import ProcessPoolExecutor
from functools import partial

def fit_single_well(well_data, model_type="logistic"):
    """Fit a growth model to a single well's time-series data."""
    t, y = well_data["time"], well_data["od"]
    return fit_model(t, y, model=model_type)

def fit_plate(plate_data, model_type="logistic", max_workers=8):
    """Fit growth models to all wells in parallel."""
    fit_fn = partial(fit_single_well, model_type=model_type)
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(fit_fn, plate_data))
    return results

Pipeline Design: Composing Computational Steps

A bioinformatics pipeline is a directed acyclic graph (DAG) of computational steps, where the output of one step feeds into the input of the next. Good pipeline design follows several principles: each step should be an independent, testable unit; intermediate results should be saved to disk so that failures do not require rerunning the entire pipeline; steps should be idempotent (running them twice produces the same result); and the pipeline should be described declaratively (what needs to happen) rather than imperatively (the specific sequence of commands).

Tools like Snakemake, Nextflow, and Prefect formalize these principles. But even before adopting a workflow manager, you can structure your own code this way: each analysis step is a function that reads input files and writes output files, with a configuration file that specifies parameters and a master script that orchestrates the steps.

Part V: From Script to Product

The final transition, the one most computational biologists never make, is from a working analysis to a product that others can use. This requires thinking about packaging, deployment, and the user experience.

Packaging and Distribution

If your tool requires more than pip install your-tool or conda install your-tool to set up, you have already lost most of your potential users. Proper packaging means: a pyproject.toml or setup.py with pinned dependency versions, a virtual environment or container (Docker/Singularity) for reproducibility, and a one-command installation process. If you are distributing a web application, containerization with Docker is the standard: your users should be able to run docker pull and docker run and have a working application.

The Interface Layer

For command-line tools, use a proper argument parser (argparse, click, or typer) with help text, sensible defaults, and input validation. For web applications, Streamlit and Gradio have dramatically lowered the barrier to building interactive scientific interfaces. For programmatic use, a clean Python API with type hints and docstrings.

The most important design decision is: what should the default behavior be? Defaults should be safe, sensible, and cover the most common use case. A user who runs your tool with no optional arguments should get a reasonable result. Power users can override defaults. Everyone else should not have to.

AI as the Interpretation Layer

One of the most underexplored opportunities in scientific software is using large language models as an automated interpretation layer. In my kinetic modeling tool, after the analytical workflow completes, the structured results are programmatically passed to a large language model (Mistral via the Hugging Face API) that generates a preliminary scientific interpretation of the data. This closes the loop between computation and insight, and it means a wet-lab scientist can go from raw data to an AI-assisted summary without writing a single line of code.

This is not a replacement for expert interpretation. It is an accelerant. The AI-generated summary gives the user a starting point, a set of observations and potential explanations that they can then evaluate with their biological expertise. The key is structuring the output of your analytical pipeline in a way that is machine-readable and semantically rich, so that the language model has the context it needs to produce useful interpretation.

The best scientific software does not just compute an answer. It helps the scientist understand what the answer means and what to do next.

Where This Is All Going

Principles I keep returning to

Start with the user, not the algorithm. Ask who will use this tool and what decision they need to make before you write a single line of code.
Understand the math one level deeper than you need. If you are calling scipy.optimize.minimize, know what the Hessian is and why the algorithm might fail.
Separate concerns relentlessly. Your science should not know about your interface. Your interface should not know about your storage. Keep the layers clean.
Test against known solutions. Synthetic data with known parameters is your best friend. If your fitting function cannot recover known parameters from clean synthetic data, it will not produce correct results on real noisy data.
Parallelize at the right level. Independent samples, independent wells, independent genes. Not inner loops of numpy operations.
Default to reproducibility. Pin your dependencies. Set random seeds. Save intermediate results. Document your pipeline. Make it possible for someone else to get the same answer from the same data.
Ship it. A tool that is 80% polished and in the hands of users is infinitely more valuable than a tool that is 100% perfect and on your laptop.

Building scientific software is a discipline that sits at the intersection of mathematics, computer science, and domain expertise. It is not taught well in most PhD programs. It is learned by building things, breaking them, rebuilding them, and paying attention to what works. The tools I have built are better than the ones I built three years ago, and the ones I will build next year will be better still.

The most important thing is to start. Pick a problem, build a tool, give it to someone who is not you, and watch what happens. That feedback loop, between your computational thinking and someone else's biological need, is where the best scientific software comes from.

Blaise Manga Enuh, PhD

Computational biologist and bioinformatics engineer at the Great Lakes Bioenergy Research Center. I build ML models, bioinformatics pipelines, and AI-augmented scientific software tools at the intersection of microbial biology and machine learning.

Back to site Get in touch

All writing