Table of Contents

Build Reproducible and Scalable Computational Biology Systems

April 10, 2024

Many recent state-of-the-art computational biology developments depend on transformer models. To understand this landscape, we collaborated with Nick Wisniewski, an expert data scientist and computational biologist. In this post, we discuss high-level trends at the intersection of biology and AI, new (and old) technical challenges in building reproducible and scalable systems for AI-driven computational biology, and how frameworks like Metaflow can help address them, using Geneformer as an example.

Transforming bioinformatics

Generative AI algorithms aren’t just revolutionizing chatbots and search. Bioinformaticians are increasingly leveraging the capabilities of transformer architectures to model proteins, DNA sequences, and more, similar to how LLMs use transformers in language tasks.

Major investments are being made in companies building computational biology systems. The Economist is now reporting that Artificial intelligence is taking over drug development. The pre-eminent journal Nature has published several papers concerning Generative AI algorithms for foundational biological research.

To illustrate the trend, here’s a small sample of papers about transformer models applied to modeling RNA and DNA sequences:

Date	Model & code	Paper
August 2021	DNABERT	DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
October 2021	Enformer	Effective gene expression prediction from sequence by integrating long-range interactions
March 2022	Geneformer	Transfer learning enables predictions in network biology
January 2023	Nucleotide Transformer	The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
June 2023	DNABERT-2	DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
June 2023	HyenaDNA	HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
September 2023	Genomic pre-trained network (GPN)	DNA language models are powerful predictors of genome-wide variant effects
February 2024	Evo	Sequence modeling and design from molecular to genome scale with Evo
February 2024	sc-GPT	scGPT: toward building a foundation model for single-cell multi-omics using generative AI
March 2024	Caduceus	Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

A sample of impactful DNA-based transformer models

As in natural language processing (NLP), foundation models, trained on extensive datasets across many contexts, provide a versatile and reusable base for applications in biology. In NLP, LLMs are trained on large corpora of text, and in biology, the same model types are trained using a variety of data types such as RNA, DNA, and protein sequences. In any case, the primary goal of foundation models is to transfer general knowledge learned in the broad context to more fine-grained, task-specific contexts.

Why build foundational models for biology?

Transfer learning is crucial for biology, where understanding a global system outcome requires discovering patterns that emerge across interacting sub-systems, each level of the system exhibiting significant complexity.

Bioinformatics advances enable a deeper characterization of the cellular context through multi-omic measurements and CRISPR-based perturbation screens. Multi-omic measurements encompass the genome, transcriptome, proteome, metabolome, morpholome, and more, revealing patterns in how genetic information flows through complex biological systems.

With CRISPR/Cas9’s ability to edit DNA at specific locations, researchers can systematically knock out, knock down, or knock in genes to identify causal effects on cellular phenotype. Consequently, there is a growing need for advanced algorithms to integrate and synthesize various forms of new information, such as the transformer architecture, which has become a popular solution for multi-omic biology.

These systems show promise of significantly improving the speed and cost-effectiveness of developing therapeutics. Foundation models can perform in silico screening to identify novel drug targets, and elucidate their network biology, and reduce the size of perturbation screens through active learning. Moreover, in-silico methods enable the exploration of combinatorial gene perturbations, which can have dramatic non-additive effects. Such explorations are practically impossible in wet labs due to the vast number of combinations and contexts.

Hurdles in AI development and deployment for computational biologists

Despite the promise, many challenges remain in developing and deploying robust AI systems for computational biologists. The rest of this article highlights some primary challenges and how modern ML/AI infrastructure can address them:

Scalable compute: Researchers may not always have readily available hardware accelerators for cutting-edge AI workflows. Writing code to run on accelerators presents its own set of software engineering challenges.
Consistent environments: Experiments and workflows contained in notebooks are often hard to reproduce. A major reason for this is a lack of rigorous dependency management.
Automated workflows: When workflows can run in a shared production environment instead of with a researcher’s laptop in the loop, they are more reproducible. Setting up such systems is often beyond the scope of a scientist's job, and many forgo the benefits of workflow automation unless it is easy to do.
Collaborative science: Reproducibility at the scale of the scientific community requires tools for sharing and observing each other’s work. Scientific software should be easy to collaborate on and build on top of.

What is Metaflow and how can it help?

Metaflow is a Python framework that makes it easy to build not only models, but production-quality systems powered by ML and AI. It was originally developed at Netflix to address their diverse internal ML and AI use cases. After seeing the success of the framework internally - it powers major parts of Netflix today - Netflix open-sourced Metaflow in 2019.

While Netflix’s business is far from computational biology, the primary challenges of ML and AI listed above apply to Netflix as well, motivating the development of Metaflow. As the needs are universal, since its open-sourcing, Metaflow was quickly adopted by leading bioinformatics, pharmaceutical, and other healthcare-related companies.

Stemming from Netflix’s strong engineering culture and a decades-long experience in building robust cloud infrastructure, Metaflow’s key value proposition is to provide a smooth, developer-friendly path from experimentation and research to business-critical production, acknowledging the fact that most researchers are not systems engineers. Researchers should be allowed to focus on research and modeling, and the framework helps them to produce production-quality, reproducible software.

The need to elevate the level of software quality in research projects has become more acute in computational biology with the advent of large-scale AI models which are compute-hungry and often costly and complex to operate. They are also much more capable and versatile than earlier models and algorithms in bioinformatics, making it possible to apply them in novel, more advanced use cases - when supported by proper frameworks.

In the following sections, we’ll take each concern - consistent environments, scalable compute, automated workflows, and reproducible science - and show how you can use Metaflow to help solve them. We’ll then conclude with a case study of Geneformer, a foundation model announced in the journal Nature last year.

Scalable compute: mo parameters, mo problems

Let’s start with one of the most pressing issues in the era of AI: The rapidly increasing need of (cost-efficient) compute power. Metaflow provides a host of benefits for developing on GPUs in cases like training or fine-tuning foundation models, including packaging GPU dependencies and orchestrating distributed training.

Compute matters more than ever for computational biology, as models based on the transformers architecture are getting bigger at an increasing rate.

The largest transformer models by parameter count in NLP (green) and biology (purple)

The chart indicates that the bitter lesson will likely continue to apply to progress in building foundation models for computational biology. As models, datasets, and compute needs grow, workloads demand special accelerators with more compute power, more on-device memory, and faster interconnect between processors.

Finding sufficient quantities of such accelerators is hard unless your organization is a preferred partner of cloud providers or has extensive experience investing in a computing platform. Once you have access to them, using hardware accelerators effectively requires a complex stack of environment dependencies and new SDKs, which can be another learning curve that blocks focus on core biology research.

Date	Model	Number of parameters	Pre-train GPUs	Pre-train GPU type	Pre-train wall clock
June 2023	HyenaDNA	1.6 million	1	Nvidia A100-40GB	80 minutes
March 2022	Geneformer	10 million	12	Nvidia V100-32GB	3 days
September 2023	Genomic pre-trained network (GPN)	86 million	4	Nvidia A100-80GB	4 days
August 2021	DNABERT	90 million	8	Nvidia 2080Ti	25 days
January 2023	Nucleotide Transformer	2.5 billion	128	Nvidia A100-80GB	28 days
February 2024	Evo	7 billion	64	Nvidia H100	?

Consider that a relatively small transformer like Geneformer, a model with 10 million parameters, was trained on 12 V100 32GB GPUs for 3 days. Although V100s are a few generations old and widely available in the cloud, it remains difficult for research scientists without extensive HPC backgrounds to access dozens of them and run a gang-scheduled job, a requirement for distributed training.

A key value proposition of Metaflow is to grant researchers easy access to compute resources of various kinds. Simply by declaring the resources they need, say @resources(gpu=4), researchers can scale their workloads horizontally and vertically. For instance, you can train large models on instances with multiple GPUs, perform distributed training over multiple instances, or run batch inference or hyperparameter search over thousands of tasks.

Metaflow leverages computational resources in your cloud account or on-premise through services like AWS Batch and Kubernetes. While it is possible to deploy a basic version of open-source Metaflow manually, companies with business-critical projects often choose to use Metaflow with Outerbounds, which removes the need to dedicate engineering resources in setting up and operating the infrastructure stack, grants access to compute in multiple clouds and provides access to cloud GPUs directly from NVIDIA, amongst many other features - all in secure and cost-optimized manner.

Consistent environments: write code that prioritizes reproducibility

Sharing code in notebooks is common practice for AI researchers and bioinformaticians. Environments like RStudio and Jupyter notebooks provide an excellent scaffolding for interleaving code with human-readable notes, allowing rapid development and exploration of data.

Nevertheless, notebooks shared with a research paper often require significant software development skills for scientists looking to reproduce or adapt the work to get going as intended. Computational reproducibility of Jupyter notebooks from biomedical publications (Samuel and Mietchen, 2023) finds:

“Out of 27,271 Jupyter notebooks, 1,203 ran through without errors, 879 (3%) that produced identical results.”

A conclusion from this work mirrors a common observation in production-oriented ML/AI teams:

“The large majority of these notebooks could not be executed automatically, mostly due to issues with the documentation of dependencies.”

Despite their shortcomings (and partly thanks to them), notebooks are great for quick experimentation and ad-hoc analysis. When reproducible, scalable, and maintainable software, teams need to adopt tools that help them go beyond scrappy notebooks, such as Metaflow.

To enable consistent and reproducible environments, Metaflow manages environments automatically so that code runs consistently on different computers - locally or remotely - including composing docker images, conda packages, and PyPI packages in a few lines of code. You can use @pypi, for example, to specify what packages you need at the step level, like Pandas below:

from metaflow import FlowSpec, step, pypi, card
import random

class PandasFlow(FlowSpec):

    @pypi(python='3.10.11', packages={'pandas': '2.1.1'})
    @card
    @step
    def start(self):
        import pandas as pd
        data = {
            "x": list(range(10)),
            "y": [x**2 for x in range(10)]
        }
        self.df = pd.DataFrame(data)
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    PandasFlow()

Versioning environments in this way helps other researchers run the code without starting off debugging dependency issues. A small up-front cost of refactoring notebook code as a structured workflow pays huge long-run dividends in developer efficiency.

Consider the time it takes to determine why code behavior changes when you don’t pin dependencies and a critical package, such as Huggingface transformers, has changed. If you aren’t convinced that is a problem over the long run, look at how often this package needs to fix core tokenizers. Declaring versions and tracking them with workflow steps will save many debugging headaches when workflow code must be maintained for months or years, as it allows comparing new and old sets of dependencies and avoids surprise updates that break existing code.

The best part is that Metaflow integrates well with notebooks, so you can continue using them for explorations, ad-hoc analysis, and for other use cases where notebooks shine, alongside more production-ready workflows. When it comes to boosting the day-to-day productivity of researchers and other data-oriented developers, it is hard to beat the combination of a modern development environment, notebooks, and workflows written as proper software artifacts.

Automated workflows: get your laptop out of the loop

Writing software for scientific experimentation is often an interactive, iterative pursuit. However, if the goal is to write widely reproducible and foundational software, the code cannot depend on manual debugging or a scientist running a job from their laptop or workstation. Instead, the goal is to produce software that can run automatically, with minimal human intervention, and which other systems can confidently depend on.

This idea has been a pillar of robust software development for decades. Indeed, as we’ve argued before, MLOps as a category is a new spin on DevOps. These tools can greatly increase researchers' access to good software design practices. A primary job of MLOps tools is to make personas like bioinformaticians able to write production-quality software without requiring them also to have the skills of a professional software engineer.

Metaflow makes it straightforward for scientists to automate workflows. They can deploy workflows on a production-grade workflows orchestrator such as AWS Step Functions or Argo workflows,

python geneomic_data_curation_flow.py argo-workflows create

and have the flow run automatically on a schedule,

@schedule(hourly=True)
class GenomicsDataCuration(FlowSpec):

    ...

or react to events originating from external systems in real-time,

@trigger(event='data_updated')
class GenomicsDataCuration(FlowSpec):

    ...

or compose advanced systems where flows trigger each other:

@trigger_on_finish(flow='GenomicsDataLoader')
class GenomicsDataCuration(FlowSpec):

    ...

Metaflow enables the best of the both worlds: You can experiment quickly locally - similar to notebooks - and move to fully automated, highly-available, SLA-guaranteed production with a single command.

Reproducible science: multiplayer and open-source

Finally, paying attention to software tooling and the communication patterns this induces accelerates the productivity of research teams. As developers move from experimental and analytics environments to building production workflows, Metaflow provides teams with a rich set of tools, such as namespace isolation, tagging, and creating deployment branches. This enables teams to build on workflows without stepping on each other’s toes and maintaining observability of who did what and when, empowering researchers to experiment safely.

Metaflow is also open-source, so you can get all the benefits described in this article and check how it happens under the hood when needed. Moreover, many organizations build their ML platform on top of Metaflow, so there is a community of experts sharing plugins and examples of how you can customize and extend Metaflow to your needs.

Case study: Geneformer

A notable example is Geneformer (Theodoris 2023), pre-trained on a corpus of 30M single-cell transcriptomes from a wide range of human tissues. Geneformer excels at context-aware tasks that drive the speed of innovation in drug discovery, including batch integration, cell type annotation, disease classification, in silico reprogramming and differentiation, in silico perturbation, and in silico target discovery.

The Geneformer model and Genecorpus dataset are shared on Huggingface as open-source artifacts, significantly boosting the reach and engagement of the work. In the metaflow-geneformer repository on GitHub, you can find a refactored version of a cell classification fine-tuning task from the original examples.

This workflow shows a general pattern of fine-tuning a model for each category in a dataset, in this case, splitting a single-cell RNA dataset by the organ the cell came from and fine-tuning a downstream model to predict the disease states of that cell:

Fine-tuning Geneformer for each unique organ in the dataset in parallel, reducing time to finish the job

Data is loaded in scalable storage containers, such as AWS S3.
Data is split and loaded quickly from storage to computers, such as GPU instances in the cloud.
For each split, modelers can experiment with any Python code, such as HuggingFace transformers.
For each split, a model can be trained or fine-tuned in parallel.
Each resulting model is put in cloud object storage with a few lines of code.
Metaflow creates a natural versioning scheme to keep models organized and track them over time.

The effects of robust AI infrastructure multiply the benefits of increasingly good open-source biology foundation models. Better software infrastructure drives faster innovation in biology modeling, and better models drive faster innovation in biology research and development.

Get started today!

To learn more about Metaflow, check out the documentation, play with Metaflow in our browser-based sandbox environment, and join thousands of AI and ML developers at the Metaflow community Slack.

When you are ready to deploy Metaflow securely in your cloud environment, you can get started quickly for free - no engineering experience required!

Build Reproducible and Scalable Computational Biology Systems

Transforming bioinformatics​

Why build foundational models for biology?​

Hurdles in AI development and deployment for computational biologists​

What is Metaflow and how can it help?​

Scalable compute: mo parameters, mo problems​

Consistent environments: write code that prioritizes reproducibility​

Automated workflows: get your laptop out of the loop​

Reproducible science: multiplayer and open-source​

Case study: Geneformer​

Get started today!​

Start building today