“Where do I think the next amazing revolution is going to come? … There’s no question that digital biology is going to be it. For the very first time in our history, in human history, biology has the opportunity to be engineering, not science.” —Jensen Huang, NVIDIA CEO
Aviv Regev is one of the leading life scientists of our time. In this conversation, we cover the ongoing revolution in digital biology that has been enabled by new deep knowledge on cells, proteins and genes, and the use of generative A.I .
Transcript with audio and external links
Eric Topol (00:05):
Hello, it's Eric Topol with Ground Truths and with me today I've really got the pleasure of welcoming Aviv Regev, who is the Executive Vice President of Research and Early Development at Genentech, having been 14 years a leader at the Broad Institute and who I view as one of the leading life scientists in the world. So Aviv, thanks so much for joining.
Aviv Regev (00:33):
Thank you for having me and for the very kind introduction.
The Human Cell Atlas
Eric Topol (00:36):
Well, it is no question in my view that is the truth and I wanted to have a chance to visit a few of the principal areas that you have been nurturing over many years. First of all, the Human Cell Atlas (HCA), the 37 trillion cells in our body approximately a little affected by size and gender and whatnot, but you founded the human cell atlas and maybe you can give us a little background on what you were thinking forward thinking of course when you and your colleagues initiated that big, big project.
Aviv Regev (01:18):
Thanks. Co-founded together with my very good friend and colleague, Sarah Teichmann, who was at the Sanger and just moved to Cambridge. I think our community at the time, which was still small at the time, really had the vision that has been playing out in the last several years, which is a huge gratification that if we had a systematic map of the cells of the body, we would be able both to understand biology better as well as to provide insight that would be meaningful in trying to diagnose and to treat disease. The basic idea behind that was that cells are the basic unit of life. They're often the first level at which you understand disease as well as in which you understand health and that in the human body, given the very large number of individual cells, 37.2 trillion give or take, and there are many different characteristics.
(02:16):
Even though biologists have been spending decades and centuries trying to characterize cells, they still had a haphazard view of them and that the advancing technology at the time – it was mostly single cell genomics, it was the beginnings also of spatial genomics – suggested that now there would be a systematic way, like a shared way of doing it across all cells in the human body rather than in ways that were niche and bespoke and as a result didn't unify together. I will also say, and if you go back to our old white paper, you will see some of it that we had this feeling because many of us were computational scientists by training, including both myself and Sarah Teichmann, that having a map like this, an atlas as we call it, a data set of this magnitude and scale, would really allow us to build a model to understand cells. Today, we call them foundational models or foundation models. We knew that machine learning is hungry for these kinds of data and that once you give it to machine learning, you get amazing things in return. We didn't know exactly what those things would be, and that has been playing out in front of our eyes as well in the last couple of years.
Spatial Omics
Eric Topol (03:30):
Well, that gets us to the topic you touched on the second area I wanted to get into, which is extraordinary, which is the spatial omics, which is related to the ability to the single cell sequencing of cells and nuclei and not just RNA and DNA and methylation and chromatin. I mean, this is incredible that you can track the evolution of cancer, that the old word that we would say is a tumor is heterogeneous, is obsolete because you can map every cell. I mean, this is just changing insights about so much of disease health mechanisms, so this is one of the hottest areas of all of life science. It's an outgrowth of knowing about cells. How do you summarize this whole era of spatial omics?
Aviv Regev (04:26):
Yeah, so there's a beautiful sentence in the search for lost time from Marcel Proust that I'm going to mess up in paraphrasing, but it is roughly that going on new journeys is not about actually going somewhere physically but looking with new eyes and I butchered the quote completely.[See below for actual quote.] I think that is actually what single cells and then spatial genomics or spatial omics more broadly has given us. It's the ability to look at the same phenomenon that we looked at all along, be it cancer or animal development or homeostasis in the lung or the way our brain works, but having new eyes in looking and because these new eyes are not just seeing more of something we've seen before, but actually seeing things that we couldn't realize were there before. It starts with finding cells we didn't know existed, but it's also the processes that these cells undergo, the mechanisms that actually control that, the causal mechanisms that control that, and especially in the case of spatial genomics, the ways in which cells come together.
(05:43):
And so we often like to think about the cell because it's the unit of life, but in a multicellular organism we just as much have to think about tissues and after that organs and systems and so on. In a tissue, you have this amazing orchestration of the interactions between different kinds of cells, and this happens in space and in time and as we're able to look at this in biology often structure is tightly associated to function. So the structure of the protein to the function of the protein in the same way, the way in which things are structured in tissue, which cells are next to each other, what molecules are they expressing, how are they physically interacting, really tells us how they conduct the business of the tissue. When the tissue functions well, it is this multicellular circuit that performs this amazing thing known as homeostasis.
(06:36):
Everything changes and yet the tissue stays the same and functions, and in disease, of course, when these connections break, they're not done in the right way you end up with pathology, which is of course something that even historically we have always looked at in the level of the tissue. So now we can see it in a much better way, and as we see it in a better way, we resolve better things. Yes, we can understand better the mechanisms that underlie the resistance to therapeutics. We can follow a temporal process like cancer as it unfortunately evolves. We can understand how autoimmune disease plays out with many cells that are actually bent out of shape in their interactions. We can also follow magnificent things like how we start from a single cell, the fertilized egg, and we become 37.2 trillion cell marvel. These are all things that this ability to look in a different way allows us to do.
Eric Topol (07:34):
It's just extraordinary. I wrote at Ground Truths about this. I gave all the examples at that time, and now there's about 50 more in the cardiovascular arena, knowing with single cell of the pineal gland that the explanation of why people with heart failure have sleep disturbances. I mean that's just one of the things of so many now these new insights it's really just so remarkable. Now we get to the current revolution, and I wanted to read to you a quote that I have.
Digital Biology
Aviv Regev (08:16):
I should have prepared mine. I did it off the top of my head.
Eric Topol (08:20):
It's actually from Jensen Huang at NVIDIA about the digital biology [at top of the transcript] and how it changes the world and how you're changing the world with AI and lab in the loop and all these things going on in three years that you've been at Genentech. So maybe you can tell us about this revolution of AI and how you're embracing it to have AI get into positive feedbacks as to what experiment to do next from all the data that is generated.
Aviv Regev (08:55):
Yeah, so Jensen and NVIDIA are actually great partners for us in Genentech, so it's fun to contemplate any quote that comes from there. I'll actually say this has been in the making since the early 2010s. 2012 I like to reflect on because I think it was a remarkable year for what we're seeing right now in biology, specifically in biology and medicine. In 2012, we had the beginnings of really robust protocols for single cell genomics, the first generation of those, we had CRISPR happen as a method to actually edit cells, so we had the ability to manipulate systems at a much better way than we had before, and deep learning happened in the same year as well. Wasn't that a nice year? But sometimes people only realize the magnitude of the year that happened years later. I think the deep learning impact people realized first, then the single cells, and then the CRISPR, then the single cells.
(09:49):
So in order maybe a little bit, but now we're really living through what that promise can deliver for us. It's still the early days of that, of the delivery, but we are really seeing it. The thing to realize there is that for many, many of the problems that we try to solve in biomedicine, the problem is bigger than we would ever be able to perform experiments or collect data. Even if we had the genomes of all the people in the world, all billions and billions of them, that's just a smidge compared to all of the ways in which their common variants could combine in the next person. Even if we can perturb and perturb and perturb, we cannot do all of the combinations of perturbations even in one cell type, let alone the many different cell types that are out there. So even if we searched for all the small molecules that are out there, there are 10 to the 60 that have drug-like properties, we can't assess all of them, even computationally, we can't assess numbers like that.
(10:52):
And so we have to somehow find a way around problems that are as big as that and this is where the lab in the loop idea comes in and why AI is so material. AI is great, taking worlds, universes like that, that appear extremely big, nominally, like in basic numbers, but in fact have a lot of structure and constraint in them so you can reduce them and in this reduced latent space, they actually become doable. You can search them, you can compute on them, you can do all sorts of things on them, and you can predict things that you wouldn't actually do in the real world. Biology is exceptionally good, exceptionally good at lab sciences, where you actually have the ability to manipulate, and in biology in particular, you can manipulate at the causes because you have genetics. So when you put these two worlds together, you can actually go after these problems that appear too big that are so important to understanding the causes of disease or devising the next drug.
(11:51):
You can iterate. So you start, say, with an experimental system or with all the data that you have already, I don't know from an initiative like the human cell atlas, and from this you generate your original model of how you think the world works. This you do with machine learning applied to previous data. Based on this model, you can make predictions, those predictions suggest the next set of experiments and you can ask the model to make the most optimized set of predictions for what you're trying to learn. Instead of just stopping there, that's a critical point. You go back and you actually do an experiment and you set up your experiments to be scaled like that to be big rather than small. Sometimes it means you actually have to compromise on the quality of any individual part of the experiment, but you more than make up for that with quantity.
The A.I. Lab-in-the-Loop
(12:38):
So now you generate the next data from which you can tell both how well did your algorithm actually predict? Maybe the model didn’t predict so well, but you know that because you have lab results and you have more data in order to repeat the loop, train the model again, fit it again, make the new next set of predictions and iterate like this until you're satisfied. Not that you've tried all options, because that's not achievable, but that you can predict all the interesting options. That is really the basis of the idea and it applies whether you're solving a general basic question in biology or you're interested in understanding the mechanism of the disease or you're trying to develop a therapeutic like a small molecule or a large molecule or a cell therapy. In all of these contexts, you can apply this virtual loop, but to apply it, you have to change how you do things. You need algorithms that solve problems that are a little different than the ones they solved before and you need lab experiments that are conducted differently than they were conducted before and that's actually what we're trying to do.
Eric Topol (13:39):
Now I did find the quote, I just want to read it so we have it, “biology has the opportunity to be engineering, not science. When something becomes engineering, not science, it becomes exponentially improving. It can compound on the benefits of previous years.” Which is kind of a nice summary of what you just described. Now as we go forward, you mentioned the deep learning origin back at the same time of CRISPR and so many things happening and this convergence continues transformer models obviously one that's very well known, AlphaFold, AlphaFold2, but you work especially in antibodies and if I remember correctly from one of your presentations, there's 20 to the 32nd power of antibody sequences, something like that, so it's right up there with the 10 to the 60th number of small molecules. How do transformer models enhance your work, your discovery efforts?
Aviv Regev (14:46):
And not just in antibodies, I'll give you three brief examples. So absolutely in antibodies it's an example where you have a very large space and you can treat it as a language and transformers are one component of it. There's other related and unrelated models that you would use. For example, diffusion based models are very useful. They're the kind that people are used to when you do things, you use DALL-E or Midjourney and so on makes these weird pictures, think about that picture and not as a picture and now you're thinking about a three-dimensional object which is actually an antibody, a molecule. You also mentioned AlphaFold and AlphaFold 2, which are great advances with some components related to transformers and some otherwise, but those were done as general purpose machines for proteins and antibodies are actually not general purpose proteins. They're antibodies and therapeutic antibodies are even further constrained.
(15:37):
Antibodies also really thrive, especially for therapeutics and also in our body, they need diversity and many of these first models that were done for protein structure really focused on using conservation as an evolutionary signal comparison across species in order to learn the model that predicts the structure, but with antibodies you have these regions of course that don't repeat ever. They're special, they're diverse, and so you need to do a lot of things in the process in order to make the model fit in the best possible way. And then again, this loop really comes in. You have data from many, many historical antibodies. You use that to train the model. You use that model in order to make particular predictions for antibodies that you either want to generate de novo or that you want to optimize for particular properties. You make those actually in the lab and in this way gradually your models become better and better at this task with antibodies.
(16:36):
I do want to say this is not just about antibodies. So for example, we develop cancer vaccines. These are personalized vaccines and there is a component in making a personalized cancer vaccine, which is choosing which antigens you would actually encode into the vaccine and transformers play a crucial role in actually making this prediction today of what are good neoantigens that will get presented to the immune system. You sometimes want to generate a regulatory sequence because you want to generate a better AAV-like molecule or to engineer something in a cell therapy, so you want to put a cis-regulatory sequence that controls gene expression. Actually personally for me, this was the first project where I used a transformer, which we started years ago. It was published a couple of years ago where we learned a general model that can predict in a particular system. Literally you throw a sequence at that model now and it will predict how much expression it would drive. So these models are very powerful. They are not the be all and end all of all problems that we have, but they are fantastically useful, especially for molecular therapeutics.
Good Trouble: Hallucinations
Eric Topol (17:48):
Well, one of the that has been an outgrowth of this is to actually take advantage of the hallucinations or confabulation of molecules. For example, the work of David Baker, who I'm sure you know well at University of Washington, the protein design institute. We are seeing now molecules, antibodies, proteins that don't exist in nature from actually, and all the things that are dubbed bad in GPT-4 and ChatGPT may actually help in the discovery in life science and biomedicine. Can you comment about that?
Aviv Regev (18:29):
Yeah, I think much more broadly about hallucinations and what you want to think about is something that's like constrained hallucination is how we're creative, right? Often people talk about hallucinations and they shudder at it. It sounds to them insane because if you think about your, say a large language model as a search tool and it starts inventing papers that don't exist. You might be like, I don't like that, but in reality, if it invents something meaningful that doesn't exist, I love that. So that constrained hallucination, I'm just using that colloquially, is a great property if it's constrained and harnessed in the right way. That's creativity, and creativity is very material for what we do. So yes, absolutely in what we call the de novo domain making new things that don't exist. This generative process is the heart of drug discovery. We make molecules that didn't exist before.
(19:22):
They have to be imagined out of something. They can't just be a thing that was there already and that's true for many different kinds of therapeutic molecules and for other purposes as well, but of course they still have to function in an effective way in the real world. So that's where you want them to be constrained in some way and that's what you want out of the model. I also want to say one of the areas that personally, and I think for the field as a whole, I find the most exciting and still underused is the capacity of these models to hallucinate for us or help us with the creative endeavors of identifying the causes of processes, which is very different than the generative process of making molecules. Thinking about the web of interactions that exist inside a cell and between cells that drives disease processes that is very hard for us to reason through and to collect all the bits of information and to fill in blanks, those fillings of the blanks, that's our creativity, that's what generates the next hypothesis for us. I'm very excited about that process and about that prospect, and I think that's where the hallucination of models might end up proving to be particularly impressive.
A.I. Accelerated Drug Discovery
Eric Topol (20:35):
Yeah. Now obviously the field of using AI to accelerate drug discovery is extremely hot, just as we were talking about with spatial omics. Do you think that is warranted? I mean you've made a big bet on that you and your folks there at Genentech of course, and so many others, and it's a very crowded space with so many big pharma partnering with AI. What do you see about this acceleration? Is it really going to reap? Is it going to bear fruit? Are we going to see, we've already seen some drugs of course, that are outgrowths, like Baricitinib in the pandemic and others, but what are your expectations? I know you're not one to get into any hyperbole, so I'm really curious as to what you think is the future path.
Aviv Regev (21:33):
So definitely my hypothesis is that this will be highly, highly impactful. I think it has the potential to be as impactful as molecular biology has been for drug discovery in the 1970s and 1980s. We still live that impact. We now take it for granted. But, of course that's a hypothesis. I also believe that this is a long game and it's a deep investment, meaning decorating what you currently do with some additions from right and left is not going to be enough. This lab in the loop requires deep work working at the heart of how you do science, not as an add-on or in addition to or yet another variant on what has become a pretty established approach to how things are done. That is where I think the main distinction would be and that requires both the length of the investment, the effort to invest in, and also the willingness to really go all out, all in and all out.
(22:36):
And that takes time. The real risk is the hype. It's actually the enthusiasm now compared to say 2020 is risky for us because people get very enthusiastic and then it doesn't pay off immediately. No, these iterations of a lab in the loop, they take time and they take effort and they take a lot of changes and at first, algorithms often fail before they succeed. You have to iterate them and so that is actually one of the biggest risks that people would be like, but I tried it. It didn't work. This was just some over-hyped thing. I'm walking away and doing it the old way. So that's where we actually have to keep at it, but also keep our expectations not low in magnitude. I think that it would actually deliver, but understanding that it's actually a long investment and that unless you do it deeply, it's not going to deliver the goods.
Eric Topol (23:32):
I think this point warrants emphasis because the success already we've seen has not been in necessarily discovery and in preliminary validation of new molecules, but rather data mining repurposing, which is a much easier route to go quicker, but also there's so many nodes on past whereby AI can make a difference even in clinical trials, in synthetic efforts to project how a clinical trial will turn out and being able to do toxic screens without preclinical animal work. There's just so many aspects of this that are AI suited to rev it up, but the one that you're working on, of course is the kind of main agenda and I think you framed it so carefully that we have to be patient here, that it has a chance to be so transformative. Now, you touched on the parallels to things like DALL-E and Midjourney and large language models. A lot of our listeners will be thinking only of ChatGPT or GPT-4 or others. This is what you work on, the language of life. This is not text of having a conversation with a chatbot. Do you think that as we go forward, that we have to rename these models because they're known today as language models? Or do you think that, hey, you know what, this is another language. This is a language that life science and biomedicine works with. How do you frame it all?
Large Non-Human Language Models
Aviv Regev (25:18):
First of all, they absolutely can remain large language models because these are languages, and that's not even a new insight. People have treated biological sequences, for example, in the past too, using language models. The language models were just not as great as the ones that we have right now and the data that were available to train models in the past were not as amazing as what we have right now. So often these are really the shifts. We also actually should pay respect to human language. Human language encodes a tremendous amount of our current scientific knowledge and even language models of human language are tremendously important for this scientific endeavor that I've just described. On top of them come language models of non-human language such as the language of DNA or the language of protein sequences, which are also tremendously important as well as many other generative models, representation learning, and other approaches for machine learning that are material for handling the different kinds of data and questions that we have.
(26:25):
It is not a single thing. What large language models and especially ChatGPT, this is an enormous favor for which I am very grateful, is that I think it actually convinced people of the power. That conviction is extremely important when you're solving a difficult problem. If you feel that there's a way to get there, you're going to behave differently than if you're like, nothing will ever come out of it. When people experience ChatGPT actually in their daily lives in basic things, doing things that felt to them so human, this feeling overrides all the intellectual part of things. It's better than the thinking and then they're like, in that case, this could actually play out in my other things as well. That, I think, was actually materially important and was a substantial moment and we could really feel it. I could feel it in my interactions with people before and after how their thinking shifted. Even though we were on this journey from before.
Aviv Regev (27:30):
We were. It felt different.
Eric Topol (27:32):
Right, the awareness of hundreds of millions of people suddenly in end of November 2022 and then you were of course going to Genentech years before that, a couple few years before that, and you already knew this was on the move and you were redesigning the research at Genentech.
Aviv Regev (27:55):
Yes, we changed things well before, but it definitely helps in how people embrace and engage feels different because they've seen something like that demonstrated in front of them in a way that felt very personal, that wasn't about work. It's also about work, but it's about everything. That was very material actually and I am very grateful for that as well as for the tool itself and the many other things that this allows us to do but we have, as you said, we have been by then well on our way, and it was actually a fun moment for that reason as well.
Eric Topol (28:32):
So one of the things I'm curious about is we don't think about the humans enough, and we're talking about the models and the automation, but you have undoubtedly a large team of computer scientists and life scientists. How do you get them to interact? They're of course, in many respects, in different orbits, and the more they interact, the more synergy will come out of that. What is your recipe for fostering their crosstalk?
Aviv Regev (29:09):
Yeah, this is a fantastic question. I think the future is in figuring out the human question always above all and usually when I draw it, like on the slide, you can draw the loop, but we always put the people in the center of that loop. It's very material to us and I will highlight a few points. One crucial thing that we've done is that we made sure that we have enough critical mass across the board, and it played out in different ways. For example, we built a new computational organization, gRED Computational Sciences, from what was before many different parts rather than one consolidated whole. Of course within that we also built a very strong AI machine learning team, which we didn't have as much before, so some of it was new people that we didn't have before, but some of it was also putting it with its own identity.
(29:56):
So it is just as much, not more, but also not less just as much of a pillar, just as much of a driver as our biology is, as our chemistry and molecule making is, as our clinical work is. This equal footing is essential and extremely important. The second important point is you really have to think about how you do your project. For example, when we acquired Prescient, at the time they were three people, tiny, tiny company became our machine learning for drug discovery. It's not tiny anymore, but when we acquired them, we also invested in our antibody engineering so that we could do antibody engineering in a lab in the loop, which is not how we did it before, which meant we invested in our experiments in a different way. We built a department for cell and tissue genomics so we can conduct biology experiments also in a different way.
(30:46):
So we changed our experiments, not just our computation. The third point that I think is really material, I often say that when I'm getting asked, everyone should feel very comfortable talking with an accent. We don't expect our computational scientists to start behaving like they were actually biology trained in a typical way all along, or chemists trained in a typical way all along and by the same token, we don't actually expect our biologists to just embrace wholeheartedly and relinquish completely one way of thinking for another way of thinking, not at all. To the contrary, we actually think all these accents, that's a huge strength because the computer scientist thinks about biology or about chemistry or about medical work differently than a medical doctor or a chemist or a biologist would because a biologist thinks about a model differently and sometimes that is the moment of brilliance that defines the problem and the model in the most impactful way.
(31:48):
We want all of that and that requires both this equal footing and this willingness to think beyond your domain, not just hand over things, but actually also be there in this other area where you're not the expert but you're weird. Talking with an accent can actually be super beneficial. Plus it's a lot of fun. We're all scientists, we all love learning new things. So that's some of the features of how we try to build that world and you kind of do it in the same way. You iterate, you try it out, you see how it works, and you change things. It's not all fixed and set in stone because no one actually wrote a recipe, or at least I didn't find that cookbook yet. You kind of invent it as you go on.
Eric Topol (32:28):
That's terrific. Well, there's so much excitement in this convergence of life science and the digital biology we've been talking about, have I missed anything? We covered human cell atlas, the spatial omics, the lab in the loop. Is there anything that I didn't touch on that you find important?
Aviv Regev (32:49):
There's something we didn't mention and is the reason I come to work every day and everyone I work with here, and I actually think also the people of the human cell atlas, we didn't really talk about the patients.
(33:00):
There's so much, I think you and I share this perspective, there's so much trepidation around some of these new methods and we understand why and also we all saw that technology sometimes can play out in ways that are really with unintended consequences, but there's also so much hope for patients. This is what drives people to do this work every day, this really difficult work that tends not to work out much more frequently than it works out now that we're trying to move that needle in a substantial way. It's the patients, and that gives this human side to all of it. I think it's really important to remember. It also makes us very responsible. We look at things very responsibly when we do this work, but it also gives us this feeling in our hearts that is really unbeatable, that you're doing it for something good.
Eric Topol (33:52):
I think that emphasis couldn't be more appropriate. One of the things I think about all the time is that because we're moving into this, if you will, hyper accelerated phase of discovery over the years ahead with this just unparallel convergence of tools to work with, that somebody could be cured of a condition, somebody could have an autoimmune disease that we will be able to promote tolerogenicity and they wouldn't have the autoimmune disease and if they could just sit tight and wait a few years before this comes, as opposed to just missing out because it takes time to get this all to gel. So I'm glad you brought that up, Aviv, because I do think that's what it's all about and that's why we're cheering for your work and so many others to get it done, get across the goal line because there's these 10,000 diseases out there and there's so many unmet needs across them where we don't have treatments that are very effective or have all sorts of horrible side effects. We don't have cures, and we've got all the things now, as we've mentioned here in this conversation, whether it's genome editing and ability to process massive scale data in a way that never could be conceived some years ago. Let's hope that we help the patients, and go ahead.
Aviv Regev (35:25):
I found the Proust quote, if you want it recorded correctly.
Eric Topol (35:29):
Yeah, good.
Aviv Regev (35:30):
It's much longer than what I did. It says, “the only true voyage, the only bath in the Fountain of Youth would be not to visit strange lands but to possess other eyes, to see the universe through the eyes of another, of a hundred others, to see the hundred universes that each of them sees, that each of them is; and this we do, with great artists; with artists like these we do fly from star to star.”—Marcel Proust
Eric Topol (35:57):
I love that and what a wonderful way to close our conversation today. Aviv, I look forward to more conversations with you. You are an unbelievable gem. Thanks so much for joining today.
Aviv Regev (36:10):
Thank you so much.
*************************************
Thanks for listening or reading to this Ground Truths Podcast.
Please share if you found it of interest
The Ground Truths newsletters and podcasts are all free, open-access, without ads.
Voluntary paid subscriptions all go to support Scripps Research. Many thanks for that—they greatly helped fund our summer internship programs for 2023 and 2024.
Note: you can select preferences to receive emails about newsletters, podcasts, or all I don’t want to bother you with an email for content that you’re not interested in.
Comments are welcome from all subscribers.
Aviv Regev: The Revolution in Digital Biology