Daphne Koller on machine learning in drug discovery: "It will be a paradigm shift"

(5 pages)

Daphne Koller has a knack for using technology to improve the human condition. She’s won some of computing’s highest awards and been at the center of a few of Silicon Valley’s efforts to improve lives—one example being Coursera, the global online learning platform she cofounded. A recipient of the MacArthur Foundation’s “genius grant” and one of Time magazine’s Most Influential People, Koller is a leading authority on machine learning. Her current mission, as the founder and CEO of drug discovery and development company insitro, is to harness the power of machine learning to create better medicines for patients in need.

Daphne Koller biography

Education

Hebrew University of Jerusalem, BSc and MSc

Stanford University, PhD in computer science

University of California, Berkeley, Postdoctoral researcher, Computer Science Division

Career highlights

insitro

Founder and CEO (2018–present)

Coursera

Co-chair (2016–19)
Founder and president (2014–16)
Founder and co-CEO (2012–14)

Calico Labs

Chief computing officer (2016–18)

Stanford University

Professor (1995–2014)

Fast facts

Cofounded Engageli, an online learning platform, in 2020

Was named one of Fast Company’s Most Creative People in Business (2014), Time magazine’s 100 Most Influential People (2013), and Newsweek’s 10 Most Influential People (2010)

Is an elected member of the American Academy of Arts and Sciences, the National Academy of Engineering, and the International Society for Computational Biology

Received the MacArthur Foundation Fellowship in 2004

Has won many awards, including the ACM Prize in Computing in 2008 and the Presidential Early Career Award for Scientists and Engineers in 1999

Authored more than 300 peer-reviewed publications in journals and venues, such as Science, Cell, Nature Genetics, NeurIPS, and the International Conference on Machine Learning, with an h-index of over 145

Koller recently spoke with McKinsey’s Lydia The at insitro’s headquarters in South San Francisco. Excerpts of their conversation follow.

Lydia The: Let’s start with a big-picture question: How do you think AI and ML [machine learning] can change drug discovery? How would you describe the opportunity?

Daphne Koller: Drug discovery in the past 50 years is a tale of glass half full and glass half empty. On the half-full side, we have transformative medicines that have made a very big difference to patients. On the half-empty side is the so-called Eroom’s Law, the reverse of Moore’s Law, where the cost of drug discovery has grown exponentially year on year without an increase in new drug output.

Why is that? It’s because there are multiple places in the drug discovery process where we need to make significant decisions. If we’re lucky, one decision will get us to a good outcome, but the rest will get us to a dead end. Every successful drug has to bear on its back the cost of all the failures. That means for many diseases there’s just no drug—because either it’s never been a priority and no one’s working on it, or people are working on it but haven’t figured out a path to create an effective medicine. Disease-modifying medicines are few and far between, and cures are almost nonexistent.

With AI, we’re able to use large amounts of data to build “compasses” that allow us to know, when we get to these forks in the road, which path will most likely lead to success. We aspire to create a much more engineered process, with a higher success rate. The aim is to go faster from identifying the genetics of a disease, or of a group of patients, to developing a disease-modifying intervention—so that maybe, when we get to 2035, there will be a lot more treatments to help patients live a long and healthy life.

Lydia The: How exactly is insitro tackling that? What’s the first angle you’re taking?

Daphne Koller: Our focus is “de-convoluting” the biology of human disease. Often, clinicians tackle disease without really understanding what the disease even is. Disease is often defined by coarse-grained symptomatic manifestations, some of which use classifications that date back 50 years or more. These are typically filtered through a subjective lens of both the patient and the clinician, so we end up with a mishmash that really doesn’t speak to the underlying biological causes of the disease.

At insitro, we collect high-content data to help us understand underlying biological processes that correspond to disease. Some of those data sets come from patients. For example, we collect imaging data, such as MRI and histopathology; various molecular measurements; and other data that allow us to identify, via machine learning, subtle patterns to disentangle distinct patient subsets.

At the same time, we generate in our lab large amounts of human-derived cells, called induced pluripotent stem cells. These are human cells that were reverted to stem cell status, from which we then create neurons or hepatocytes that carry our genetics. We can further introduce into those cells genetic variations that we know are likely to cause disease. Then we can measure those cells and interrogate—with microscopy or RNA sequencing—what disease looks like at the cellular level. This system gives us a rapid approach for testing therapeutic interventions that could potentially work in humans.

‘Let machines loose’

Lydia The: What role do AI and ML play in this process? In other words, what is AI doing that a scientist or researcher can’t do?

Daphne Koller: I’ll give you a couple of examples. In our recent work studying a fatty liver disease, we were able to identify—using ML—patterns within the liver tissue that correspond with known genetic drivers of disease. Human pathologists couldn’t see those patterns because they don’t even know what to look for. We found that if we let ML loose on the samples—if we let machines have an unfettered, unanchored look at the data—they’re able to identify disease-causing and disease-modifying associations that a human just can’t see.

Another example is our work on tuberous sclerosis complex, which is a rare but not ultra-rare disease: there are 50,000 patients diagnosed with it in the United States, a million worldwide, and we believe it’s underdiagnosed. We created an in vitro cellular disease model by introducing the genetic variant that causes the disease into our cellular systems via CRISPR, and then we were able to phenotype those cells using different methods—including some live cell imaging via our proprietary ML-enabled microscope. We were able to demonstrate reversions that had never been identified before. We’re now assessing those as potential novel drug targets.

Ensuring data integrity

Lydia The: Good data is critical to AI and ML. A question in my mind about AI and ML in drug discovery is, “Why now?” What makes you believe that we can now get enough data that’s fit for purpose for AI applications? And how do you ensure you have good data coming out of your models?

Daphne Koller: One of the things that led me to come back to this field after a bit of a digression into online education at Coursera is that I felt like now is a time when we can really make a difference in applying machine learning to biomedical data. When I was at Stanford, a large data set was 200 samples. You felt lucky if you had 500. Now we’re in a world where there’s an unbelievable ability to both access and generate data that is fit for purpose for machine learning.

On human data, one of the earliest efforts is the UK Biobank, which has been able to create deep phenotypic data from 500,000 individuals—measuring everything from whole body imaging, brain imaging, blood biomarkers, urine biomarkers, predisposing factors, and so on, as well as longitudinal outcomes. It has limitations, of course, not least of which is that its composition is very Eurocentric, but—both in the UK and elsewhere—others are building on this effort and creating additional cohorts, making it more diverse. These data sets are only going to get bigger and more useful due to the growing availability of electronic health records.

Separately, in the last decade or so, there have been major advancements in life science tools like CRISPR—which is a therapeutic intervention but is at least as powerful as a research tool—and in measurement technology, with things like super-resolution microscopy, single-cell RNA sequencing, and single-cell proteomics. All of these enable the creation of incredibly large data sets that allow us to interrogate, in very fine detail, the underlying biology of disease.

Lydia The: How do you think about the balance between using publicly available data—which exists but isn’t always clean or easy to use and doesn’t necessarily confer competitive advantage—and generating your own data?

Daphne Koller: Many people believe that by simply collecting a bunch of data haphazardly from different places and creating a sufficiently big pile, you’ll have something that is fit for purpose for machine learning. That’s very rarely the case, especially because some of that data comes from small experiments that were all done in a different way, in a different assay, under different conditions, with different definitions of what success looks like. That strategy is very dangerous, especially when mistakes are made at the early stage and you discover the mistake five years later in a very expensive clinical trial.

A serious female scientist in a lab coat and protective gloves stands in a modern lab, working on a small HUD or graphic display

The future of biotech: AI-driven drug discovery

See the collection

There are high-quality data sets available to everyone—not many, but they do exist. As I mentioned, the UK Biobank is an example. So we’ve onboarded data sets that we think add value to machine learning. There are novel methods that we can apply to these data sets that give us unique differentiated insights.

Now, is that a permanent competitive advantage? Probably not, because there are a lot of smart people out there. But it certainly gives us a head start.

Our bigger advantage is that we generate, in-house, complementary forms of data that align with what’s available publicly but allow us to do experiments. Our data set allows us to intervene and assess causality of a disease variant or of an intervention, so it’s a huge amplifier to what is available in public data sets. Using both, the whole is considerably greater than the sum of the parts.

A spirit of collaboration

Lydia The: Something I think about a lot is talent and culture. At insitro, you’re mainly looking for two different types of talent: people with a biology background and people with a computer science background. Everyone talks about how hard it is to avoid silos and how challenging it can be to get everyone working together and understanding one another. How do you make sure that you don’t create second-class citizens and that your employees truly see one another as equals?

Daphne Koller: That’s one of my favorite topics. One of the things I’m most proud of is the way in which insitro has brought together people with diverse backgrounds: machine learning scientists, software engineers, automation engineers, stem cell scientists, discovery biologists, drug hunters, and more.

We’ve laid out behavioral norms from the very beginning to make sure that there’s collaboration and team spirit within our company. We expect employees to engage with one another openly, which means you are willing to ask “naive” questions and accept “naive” ideas from people outside your discipline; constructively, which means you seek to make the outcome better rather than trying to be the smartest person in the room; and with respect for what everyone brings to the table.

This spirit of collaboration permeates insitro and is something that every new “insitrocyte” comments on when I have my 60-day meeting with them. And we’ve consistently found that it gives rise to not only better solutions but also better problems. Questions that we would never have thought to tackle just emerge when people with different backgrounds come together and say, “What is it that we’re really trying to do here?”

Lydia The: On that point and looking forward, what can we reasonably hope for in this space? There’s a wide range of opinions about the potential of AI and ML in drug discovery. What’s your take?

Daphne Koller: I find that there’s an interesting bimodal distribution of opinions regarding the role that machine learning could play in drug discovery. Just four to five years ago, skeptics thought AI was going to be completely useless. More recently, there’s been a greater recognition of the value, but there are still people who think it will be a point solution, like combinatorial chemistry—something that will help in a narrow niche but won’t have broader impact.

On the other side, there are people who are starry-eyed true believers who say, “This is going to be artificial general intelligence!” and who believe they’re going to “get every drug approved within six months!” Biology is really hard, and we need to be very careful when intervening in something as precious as human life. If people tell you that they’re going to have “100 drugs in the clinic in three years” and have predictions along those lines, that too is wrong, in a different way.

Where I think we’ll be in 15 years is that machine learning will be an absolutely critical, pivotal shift—a paradigm shift in how we discover and develop medicines. It’s going to be more akin to computers as a tool than it is to combinatorial chemistry as a tool, in the sense that it will touch every single facet of how we discover and develop medicines, and accelerate and improve every single one of them.

Explore a career with us

Search Openings

‘It will be a paradigm shift’: Daphne Koller on machine learning in drug discovery

Daphne Koller biography

‘Let machines loose’

Ensuring data integrity

The future of biotech: AI-driven drug discovery

A spirit of collaboration

Explore a career with us

Related Articles

The Next Normal – The future of biotech: AI-driven drug discovery

Reimagining the future of biopharma manufacturing

AI in biopharma research: A time to focus and scale