Software-Defined Biology is Here

May 22, 2024

Who designs your molecules, a human or a computer? Gone are the days of software as a mere follow-on to scientific processes. What was once an artisanal pastime is now a programmable and scalable endeavor — and in many ways, software is the differentiating factor [1]. But we’re not just talking about the growing popularity of AI-enabled drug discovery; we’re talking about a seismic shift in the way we design, track, and optimize scientific experiments.

Enter software-defined biology: the codification of scientific experimentation and decision-making into software. In other words, the largest shift in the way drug discovery [2] companies operate in several decades. Not since the drug discovery industry entered the era of biology (moving away from an exclusive focus on chemistry) [3] have we seen such a radical shift in the way these companies operate.

Granted, this purported shift is not without its skeptics. Some may say “we’ve had software in drug discovery for decades.” But software-defined biology goes well beyond traditional bioinformatics and computational chemistry, instead embracing a new role for software: as a core part of the experimental process. Others may insist that “software is but a minor part of drug discovery” — and while that may be true today, we’re seeing a huge amount of progress in the field [4] signaling that we’re just at the beginning of this shift [5].

In this post I will endeavor to address the nay-sayers by showing how software-defined biology is fundamentally different from how the industry has historically thought about the use of software. We'll explore how this shift is unfolding, what it means for software-driven companies, and the implications for the future of drug discovery.

What do we mean by “Software-Defined Biology”

We previously defined software-defined biology as “the codification of scientific experimentation and decision-making into software.” But what does that word salad really mean?

Historically, science has been full of manual and ad-hoc processes. Information and protocols are recorded in an unstructured manner — whether it’s in an electronic lab notebook (ELN), semi-structured spreadsheets, or slide decks. It is nearly impossible to trace all the steps taken to get to a given decision, since so much happens inside a scientist’s head. Progress is driven by scientist intuition, and as a result, it is incredibly challenging to replicate or systematize.

By contrast, software-defined biology involves defining data models in code, capturing structured data and metadata, and introducing automated systems (such as machine learning models) for decision-making. This approach shifts the question from “what are the best molecules to make?” to “how can we build a system that makes the best molecules?”

Let’s take the example of a simple luciferase assay. We want to use this experiment to screen some potential drug candidates to see if they have activity against our gene of interest. In a traditional drug discovery format, the steps might be:

  1. Describe the experiment in an ELN
  2. Run the experiment
  3. Record the results in Excel
  4. Store the Excel file in an SDMS or cloud file storage

This method is simple to set up and get started with. An experienced drug discoverer will be able to look at the results and make informed decisions about what follow-up experiment(s) to run. However, it’s unlikely that a company would run only one luciferase experiment. Rather, that experiment signals the beginning of a larger screening campaign.

As scientist start to run more and more luciferase experiments, they will likely start to encounter issues. First off, the manual processes take the same amount of time for every experiment — so the amount of time spent on these experiments scales linearly with the number of experiments executed. Additionally, at higher throughput it becomes difficult or impossible to compare results across experiments. This forces scientists to make decisions based on incomplete or inaccurate data.

By contrast, in a software-defined biology company, the steps would be:

  1. Define a schema in code that represents the experiment
  2. Design the experiment in a tool that captures structured metadata
  3. Run the experiment
  4. Automatically push the results into a data warehouse
  5. Analyze the results in a dedicated tool that allows for comparison across experiments
  6. (Optionally) Retrain ML models based on the results

While this approach requires more upfront work, there are clear long-term benefits. Defining results schemas in code allows us to leverage software best practices surrounding version control and migration. We’ve all experienced the pain of trying to update a field in our LIMS and being unsure about whether this will break downstream pipelines. Moreover, capturing and structuring experimental data in a warehouse allows scientists to ask questions over all their data, not just the most recent experiment.

Finally, it’s impossible to ignore the impact and promise of using ML models in drug discovery. Human scientists are no longer the only way to interrogate results and propose new hypotheses — we can build pieces of software (ML models) to help. Unlike software tools that came before, newer ML models can provide a “reasoning” layer which allows for the creation of new ways of working such as “lab in the loop”.

Why now?

Apart from the recent advances in the ML space, many of the technical changes here are not groundbreaking — we’ve had ORMs and data warehouses for decades! The mindset shift, however, is crucial. Now that we’re seeing increased success with ML models, companies are clamoring to figure out how to apply AI/ML to their problems. There is finally a strong enough forcing function for drug discovery companies to value their data [6].

In parallel to the ML advances over the past decade, there has been an explosion of new modalities [7] that are often complex and composed of multiple parts. For example, mRNA therapeutics companies don’t just worry about the RNA payload — the delivery vehicle is just as critical. Optimizing multiple parts of a drug together leads to an exponential increase in complexity. The best way to manage this is through code., as it’s composable and allows you to model different ways.

What's next

We’re still in the early stages of the era of software-defined biology. Over the past few years, we’ve seen a growing number of biotech companies embrace this way of working. As these companies prove out their successes in the clinic [8], more and more companies will follow in their footsteps.

Up until now, these companies have had to build everything from scratch. That’s changing. The rise of specialized tools and platforms designed for software-defined biology is making it easier for new companies to adopt these practices. Still, many tools built for the previous patterns of doing science don’t fit well for modern companies.

Reach out if you want to talk more about software-defined biology or have questions about how Sphinx provides the infrastructure a software-defined biology company needs.


[1] There’s been a lot of debate about what a “techbio” company is and whether it is truly distinct from a biotech company. These debates focus mostly on the different business models. I want to focus instead on the operational differences — and how scientists think about them. For the purposes of this post, I’ll focus on the drug discovery space, but these concepts apply across the bioeconomy.

[2] This is a larger scientific trend that is manifesting in many fields in the broader life sciences field and beyond, but for the purposes of this post, we’ll focus on drug discovery.

[3] See the steady growth in the number of biologics approvals in recent years.

[4] AlphaFold 3, OpenCRISPR, and Evo are just a few examples of the exciting advances in the past couple of months.

[5] For context, after the first monoclonal antibody was discovered, it took another decade for the second one to get approved.

[6] And not just experimental data, but also proposed hypotheses that might never make it into the clinic. As Aviv Regev puts it: “They’re like, ‘This is never going to be a drug.’ Maybe it won’t. But it will make an algorithm that will be good for all drugs.”

[7] Most of the top 10 pharma products are “new” modalities.

[8] There’s some early suggestions that these types of companies have increased rates of Phase 1 success.

Excited by the idea of better software for scientists?
Let's talk
get started
May 22, 2024

Software-Defined Biology is Here

Share This post -

Who designs your molecules, a human or a computer? Gone are the days of software as a mere follow-on to scientific processes. What was once an artisanal pastime is now a programmable and scalable endeavor — and in many ways, software is the differentiating factor [1]. But we’re not just talking about the growing popularity of AI-enabled drug discovery; we’re talking about a seismic shift in the way we design, track, and optimize scientific experiments.

Enter software-defined biology: the codification of scientific experimentation and decision-making into software. In other words, the largest shift in the way drug discovery [2] companies operate in several decades. Not since the drug discovery industry entered the era of biology (moving away from an exclusive focus on chemistry) [3] have we seen such a radical shift in the way these companies operate.

Granted, this purported shift is not without its skeptics. Some may say “we’ve had software in drug discovery for decades.” But software-defined biology goes well beyond traditional bioinformatics and computational chemistry, instead embracing a new role for software: as a core part of the experimental process. Others may insist that “software is but a minor part of drug discovery” — and while that may be true today, we’re seeing a huge amount of progress in the field [4] signaling that we’re just at the beginning of this shift [5].

In this post I will endeavor to address the nay-sayers by showing how software-defined biology is fundamentally different from how the industry has historically thought about the use of software. We'll explore how this shift is unfolding, what it means for software-driven companies, and the implications for the future of drug discovery.

What do we mean by “Software-Defined Biology”

We previously defined software-defined biology as “the codification of scientific experimentation and decision-making into software.” But what does that word salad really mean?

Historically, science has been full of manual and ad-hoc processes. Information and protocols are recorded in an unstructured manner — whether it’s in an electronic lab notebook (ELN), semi-structured spreadsheets, or slide decks. It is nearly impossible to trace all the steps taken to get to a given decision, since so much happens inside a scientist’s head. Progress is driven by scientist intuition, and as a result, it is incredibly challenging to replicate or systematize.

By contrast, software-defined biology involves defining data models in code, capturing structured data and metadata, and introducing automated systems (such as machine learning models) for decision-making. This approach shifts the question from “what are the best molecules to make?” to “how can we build a system that makes the best molecules?”

Let’s take the example of a simple luciferase assay. We want to use this experiment to screen some potential drug candidates to see if they have activity against our gene of interest. In a traditional drug discovery format, the steps might be:

  1. Describe the experiment in an ELN
  2. Run the experiment
  3. Record the results in Excel
  4. Store the Excel file in an SDMS or cloud file storage

This method is simple to set up and get started with. An experienced drug discoverer will be able to look at the results and make informed decisions about what follow-up experiment(s) to run. However, it’s unlikely that a company would run only one luciferase experiment. Rather, that experiment signals the beginning of a larger screening campaign.

As scientist start to run more and more luciferase experiments, they will likely start to encounter issues. First off, the manual processes take the same amount of time for every experiment — so the amount of time spent on these experiments scales linearly with the number of experiments executed. Additionally, at higher throughput it becomes difficult or impossible to compare results across experiments. This forces scientists to make decisions based on incomplete or inaccurate data.

By contrast, in a software-defined biology company, the steps would be:

  1. Define a schema in code that represents the experiment
  2. Design the experiment in a tool that captures structured metadata
  3. Run the experiment
  4. Automatically push the results into a data warehouse
  5. Analyze the results in a dedicated tool that allows for comparison across experiments
  6. (Optionally) Retrain ML models based on the results

While this approach requires more upfront work, there are clear long-term benefits. Defining results schemas in code allows us to leverage software best practices surrounding version control and migration. We’ve all experienced the pain of trying to update a field in our LIMS and being unsure about whether this will break downstream pipelines. Moreover, capturing and structuring experimental data in a warehouse allows scientists to ask questions over all their data, not just the most recent experiment.

Finally, it’s impossible to ignore the impact and promise of using ML models in drug discovery. Human scientists are no longer the only way to interrogate results and propose new hypotheses — we can build pieces of software (ML models) to help. Unlike software tools that came before, newer ML models can provide a “reasoning” layer which allows for the creation of new ways of working such as “lab in the loop”.

Why now?

Apart from the recent advances in the ML space, many of the technical changes here are not groundbreaking — we’ve had ORMs and data warehouses for decades! The mindset shift, however, is crucial. Now that we’re seeing increased success with ML models, companies are clamoring to figure out how to apply AI/ML to their problems. There is finally a strong enough forcing function for drug discovery companies to value their data [6].

In parallel to the ML advances over the past decade, there has been an explosion of new modalities [7] that are often complex and composed of multiple parts. For example, mRNA therapeutics companies don’t just worry about the RNA payload — the delivery vehicle is just as critical. Optimizing multiple parts of a drug together leads to an exponential increase in complexity. The best way to manage this is through code., as it’s composable and allows you to model different ways.

What's next

We’re still in the early stages of the era of software-defined biology. Over the past few years, we’ve seen a growing number of biotech companies embrace this way of working. As these companies prove out their successes in the clinic [8], more and more companies will follow in their footsteps.

Up until now, these companies have had to build everything from scratch. That’s changing. The rise of specialized tools and platforms designed for software-defined biology is making it easier for new companies to adopt these practices. Still, many tools built for the previous patterns of doing science don’t fit well for modern companies.

Reach out if you want to talk more about software-defined biology or have questions about how Sphinx provides the infrastructure a software-defined biology company needs.


[1] There’s been a lot of debate about what a “techbio” company is and whether it is truly distinct from a biotech company. These debates focus mostly on the different business models. I want to focus instead on the operational differences — and how scientists think about them. For the purposes of this post, I’ll focus on the drug discovery space, but these concepts apply across the bioeconomy.

[2] This is a larger scientific trend that is manifesting in many fields in the broader life sciences field and beyond, but for the purposes of this post, we’ll focus on drug discovery.

[3] See the steady growth in the number of biologics approvals in recent years.

[4] AlphaFold 3, OpenCRISPR, and Evo are just a few examples of the exciting advances in the past couple of months.

[5] For context, after the first monoclonal antibody was discovered, it took another decade for the second one to get approved.

[6] And not just experimental data, but also proposed hypotheses that might never make it into the clinic. As Aviv Regev puts it: “They’re like, ‘This is never going to be a drug.’ Maybe it won’t. But it will make an algorithm that will be good for all drugs.”

[7] Most of the top 10 pharma products are “new” modalities.

[8] There’s some early suggestions that these types of companies have increased rates of Phase 1 success.

Sphinx can solve your data problems

Schedule a demo