James Zou: one of the most prolific and creative A.I. researchers in both life science and medicine
A podcast with James Zou, a Stanford computer scientist who is lighting it up in life science and medical A.I. Recent papers that we discussed include degradation of GPT-4 performance, using large language models for peer review, and taking >200,000 pathology posts from Twitter to build a visual-language model (cover of Nature Medicine)
A recap of recent posts and podcasts
Thanks for subscribing, listening, reading and sharing Ground Truths. All content is free; any and all voluntary financial support goes to Scripps Research.
Transcript (AI generated, unedited, with links to audio)
Eric Topol (00:00):
Hello, this is Eric Topol with Ground Truths, and I am so delighted to welcome James Zou, who is on the faculty at Stanford University and who has been lighting it up in AI, not just in life science, things like for cancer, proteins, genomics, but also in medical AI that we're going to get into everything from pathology to peer review and beyond. So welcome, James.
James Zou (00:30):
Thank you very much for having me. Excited to be here.
Eric Topol (00:33):
I don't know anyone who's done more in the last few years in this space of both, I guess the overall would be biomedical AI than you and your team at Stanford and collaborators. So you span the gamut literally from the DNA and proteins all the way through to healthcare practice. So how do you have such exceptional range of motion?
James Zou (01:01):
I think the secret is most of the things is I have fantastic students and collaborators and postdocs. I come from a computer scientist, science background, Eric, so my PhD was in computer science, and I first started doing this more looking at machine learning applied to genomics. That's why we continue to do work on DNA and protein sequence sites. But I really also became very interested in the translation aspects over the last few years since coming to Stanford. So this is why we are doing more and more products now that have this more clinical and patient facing perspective.
Eric Topol (01:40):
Well, it's really been extraordinary and what I want to spend some time in our conversation is kind of some overview on the progress in AI and then also get into some specific recent reports that you've made, which are just really exceptional. So for the overview, obviously things change from deep learning as we knew it to transformer models and the transformer models, as you know, will very well accounted for the alpha full breakthrough, which led to the Lasker award of Hassabis and Jumper recently, but have also the transformer models have opened things up in many aspects of life science, RNA and proteins and drug discovery and in medicine. So can you comment first on what has the transformer model done in your work, in your perspective of changing the scene of bringing on capabilities and large language models and multimodal AI?
James Zou (02:47):
Yeah, it's a great question and I definitely echo your thoughts that we're living in extremely exciting times, especially in this intersection of AI and medicine. And I think transformers is a big part of this, right? It's really becoming working. The workhorse of most of the models that we built, it used to be convolutional neural networks, and before that a lot of maybe some of the currents were fee forward neural networks, but now it's really love the transformers. And I think transformers really, especially combined with quite large-scale compute and large-scale data sets have been really powerful for first for processing unstructured text data like natural language, which is why models like ChatGPT and GPT-4 have been really working off these transformers, but maybe a little bit more surprisingly, they've also been very powerful for processing imaging data, even the medical images.
(03:39):
So these visual transformers, I think people didn't really want to believe that they can outperform convolutional neural networks because convolution have these more intuitive inductive bias looking at these filters that we're more familiar with. But somehow the visual transformers have also really taken off to the extent that a lot of the multimodal systems that we built. Now, for example, some of our recent works on visual language models for pathology and for other medical domains have been relying on transformers across all the different modalities, both for revision, for language, and also for more waveform kind of signals.
Eric Topol (04:18):
And so that I think is to highlight the kind of transformer linked to the capabilities and multimodal with images with unstructured and structured text and of course with speech or voice. So the other thing of course of a big step forward is going to self-supervised or unsupervised learning, whereas before to train to see the retina or pathology image, we had to use a hundred thousand labeled images with ground truths. Now that's not the case. So what about factoring in this difference in the data sets not having to be supervised?
James Zou (05:02):
I think that's really been a big conceptual breakthrough, maybe even more impactful than specific algorithms or architectures. But this conceptual breakthrough of we don't really need to have very specific high quality labels, but it's oftentimes the labels and annotations are all around us under our nose, and you could basically ask the model to play basically, essentially what constitutes mat libs, like impute missing words or impute missing amino acids or impute missing protein domains, just from the context, the magic in some senses that if you ask the model play enough of these mat libs to do these imputations, the model actually learn about a lot of the structures and semantics or higher order information that can then be used for all sorts of exciting applications like modeling protein structures, all the way to modeling language generating novel language.
Eric Topol (06:05):
So now that we had some big steps that have taken us forward, I wanted to start to get into some of your extraordinary recent projects that you have published on one you've already touched on James, which is the current cover of Nature Medicine. And this was fascinating because you took the clip of contrast of Learning, you could probably explain better what that is, and you applied it to a huge number of tweets where there was a histopathology and the annotation, and you formed this new thing called PLip. So can you tell us how you went from clip to flip and why you even started analyzing Twitter histology data with this whole, I mean, it's really pretty amazing.
James Zou (06:58):
Yeah, yeah. Well, as you and the audience knows very well, right? Often the big bottleneck can develop medical AI is getting the right data sets and one times with medicines, a lot of the data are sort of siloed, sitting behind institutions. It's harder to share data. But one thing that we found was kind of amazing is that often clinicians, for example, pathologists, when they encounter unfamiliar images or ambiguous cases, which is often the case, it's challenging position, they actually go on Twitter. What they do is that they'll actually post those images and invite their colleagues from around the world to discuss, and what do you think is going on in this image? Or what do you think is going on in this case? Have you seen something similar? And this practice is actually even recommended by the Academy of Pathologists in the US and Canada and other countries.
(07:47):
They actually recommend their members to use social networks like Twitter as a way to share knowledge, share information. These images are de-identified so the pathologists feel comfortable sharing these images or crops of them. I guess the major insight for us is that we thought, wow, this is actually a really amazing and untapped resource for ai because here you have relatively high quality images, medical images, but you also have this conversation by many experts from around the world about each of these image. It's actually really hard to get this kinds of data, otherwise maybe I can get 50 of my colleagues to have two quarter conversations about their cases. But here we have hundreds of thousands of this just all sitting in front of us, and if we can sort of filter out the noise and then curate things down to really extract high quality tweets where we have the images and the corresponding texts, then we thought this could really be an amazing resource for medicine and for ai. So that's what we've been about doing in this recent Nature Medicine paper is we curated over 200,000 of these high quality tweets. So each tweet thread then would have the images, one or more images along with the corresponding expert discussions, dialogues about the image in the thread. And then we use this to train a visual language model. So basically a model that can understand both the image part, right, understand the pathology images, but they can also have the language like chat g PT type language understanding so they can have answer questions from pathologists.
Eric Topol (09:14):
Yeah, I mean it's extraordinary because you basically going back to the bottleneck of getting expert labeled images, which has been the main contribution of AI in medicine until recent times, you've figured out how to do it with people already posting on Twitter of all places, or X I should say. Now I have to say James, that was ingenious. I mean just that route of being able to accrue hundreds of thousands of images and expert interpretations. Now, another one that is fascinating that you have rocked things from your work is you've detected there may be degradation of performance of these large language models. And this has led to all sorts of consternation about the stability of the models and what are we going to do? A model gets great initial performance, even regulatory approval, and then it starts to cave and fall apart. Can you comment about where do we stand with performance degradation over time with these generative AI models?
James Zou (10:25):
Yeah, so I think this is a really fascinating phenomenon. Usually when we think about ChatGPT or like GPT-4, we think of that as a single model. You should be able to ask ChatGPT question today, and if I ask the same question tomorrow or next week, it should give you mostly the same answer back. So same model. But I think what's really interesting is that these models are changing over time because they are getting feedback from the humans. This process called reinforcement learning from human feedback. They're also getting additional fine tuning. So the model, it's actually, it's not the same model ChatGPT a week or a month from now. It could be a very different model from the one that we're working with right now. So this actually motivated us to try to monitor how does the behavior of these generative ai, like large language models, how does the behavior actually change over time, potentially as a result of interacting with users and users like us?
(11:19):
And you might have expected that these models should be getting better and better over time, or at least that's our hope they should be improving. But maybe the big surprise that we found was that that's the case in some aspects. So these models are getting better in some of the components, maybe they're getting safer. For example, GPT-4 is getting safer over time. However, the model is also getting substantially worse in many of the other components. So for example, if I asked the model to do chain of thought reasoning, which is a common technique, so the ability of GPT-4 to do chain of thought, that seems to have really degraded over time, and that has also corresponds to some of the decrease in ability of the model to solve certain math problems or to respond to my other kind of questions. For example, if I asked GPT-4 to give its opinion about certain political events in March would be very willing to give those opinions, but now it's actually very unwilling to give those opinions maybe as a reflection of the times. So we thought that's actually a really interesting phenomenon. And like you said, I think this really also highlights the need to have more rigorous ways to continuously monitor the behaviors of these really complex artificial systems over time. It is not just a single thing that we can approve once and use forever. It is constantly changing and that's the power of it. But we also need to have ways to audit, monitor the behaviors.
Eric Topol (12:48):
Yeah, I think this is a vital point you're making because there aren't any of these that are yet approved by regulatory FDA for use because it's so new. But when they do get approved, the issue is what is the surveillance post-implementation for this type of potential degradation, which hasn't even really been thought through yet. Certainly by health systems, no less by regulatory agencies. Now, that also gets me to this issue of degradation and not to be conflated with the companies because if a limited number of companies that have built these base models, basically we're talking about Open AI, Microsoft Google, and Meta, and they can do stuff. So for example, the historic conversation between the New York Times reporter, Kevin Roose and ChatGPT, when all of a sudden ChatGPT tells Kevin that he doesn't love his wife, he really loves Sydney, and that they should run away together and she wants to get out of the machine. She no longer wants to be a machine. And then within days the company changes things. So when you see degradation, do we know there hasn't been intervention deliberate from the company on the base model, or is this something that is just the model itself?
James Zou (14:15):
Yeah, so it's a great question and part of this is that it's not very transparent. So we don't know exactly how OpenAI is updating these models over time. And we know for example, that they're putting in safety guardrails. And we also see that in practice that these models are getting safer. And that's a good thing. I mean, one of the things we've been trying to really understand what's going on here is take the approach that we often do in medicine, which is have the model organisms. It's hard to do experiments on humans, so that's why people often work with mice or with other systems to understand what's going on. So we've been taking a similar approach with ChatGPT, we don't have access to ChatGPT, it's too large of a model, but can we actually have smaller models that we have full control over?
(15:01):
And then we can actually reproduce the same phenomenon we see in ChatGPT, but with these smaller models as a way to understand what's going on here, similar to how people use mice as a model of organism for human diseases. And that's how we've been getting insights into by using these smaller models like LlaMA models that we have trained ourselves to reproduce the behavioral changes and degradations that we see in ChatGPT. And I guess the big consistent finding we're seeing here is that these models' ability to follow human instructions, user instructions seems to have degraded over time. That's the big shift that we see with ChatGPT. And that could actually arise as a consequence of some of these additional safety trainings. We see that even with the smaller models when we do additional safety trainings, small models do get safer, but their ability to follow human instructions also get worse. For example, if I asked the smaller model after the safety training, how do I kill weeded in my backyard? The safety trait model would say, well, you shouldn't kill weeded because killing is not good. And weeded have their own identity, so you should respect their rights.
Eric Topol (16:14):
Yeah, that's great. I think this is such an important concept that a lot of people aren't yet fully in touch with this trade-off of safety performance that's going to be getting a lot more scrutiny over time. Now, another thing that you have brought to light is the paper on peer review. Well, you didn't call it peer review, but I'll call it peer review. Basically what you did is you showed how large language models could do a good job as you and I know for editors, it's very hard to get peer reviewers quickly to do the job, to do it well. And then the question is, could you get a large language model to do it well? And you wrote a recent preprint on this, you and your team, and it's fascinating. Can you summarize those findings?
James Zou (17:07):
Yeah, yeah. So like you mentioned, it's getting harder and harder to get timely feedback from human experts. Number of papers just growing so quickly, and especially if you're thinking about a junior researcher, maybe a students or someone from an under-resourced university or settings, it's especially hard for them to get timely feedback from experts in the field. And we thought, okay, so can we actually build a system based on ChatGPT data to provide feedback, especially on the earlier stages of the manuscripts? So the algorithm would basically just take the full PDF of the manuscript and extract different components like the abstract and method section, figure captions, and then it'll process all those components using the large language model. And we also give a large language model instructions similar to how you as the editor will give instructions to your reviewers. We give similar sort of trainings to the language model to say, okay, you should provide actionable items of improvements, identify the weaknesses and the strength in the paper, and also point out whether areas of improvement.
(18:06):
So we give these instructions to the language model and then we'll produce these feedback to the authors. And the two main findings we thought was that first we applied this to about Nature family journals like Nature and Nature Medicine, so thousands of papers where we have both the articles and also publicly available reviews by experts for those articles. And we sent each of these articles to this GPT system for providing feedback. And we found that basically the overlap between the feedback from language model and human experts is actually quite comparable to the overlap between two different reviewers. So what this means is that actually the language model feedback is actually wasting many of the same points that a human expert reviewer would later raise. So if you basically take the language model feedback and use that to improve your paper, then you would very well have addressed many of the points that reviewers would later have raised themselves. So that's promising. And then we did actually a prospective user study where we had over 300 researchers from over a hundred institutions where they would actually submit their work in progress where their preprint and we have the AI will provide feedback. And about 57% of the researchers actually found the feedback from the AI to be helpful in improving their papers. That's also quite promising. That's really the end user satisfaction rating.
Eric Topol (19:38):
Yeah, I mean it's amazing actually what you've identified is a very promising path for authors to get a kind of pre-review peer review or editors to supplement their human experts with an entity that actually will read the paper if you want to call it reading. But you know what I mean, because both of us have been frustrated at times when we submit a paper and we get back a review that's very superficial suggesting that either the person didn't pay attention to it or just really didn't give it the time of day to do a proper review. So you wouldn't necessarily get that for a large language model. You'd get something that it might not give the accurate citations, but it might hit on the critical aspects that could be improved. Now, another thing that's related to that, so many I could get into, maybe we'll go back out to 30,000 feet in a moment, but as you know, AI is starting to pick up what is AI and what is the human, and you wrote another important paper whereby the people who are English is not their native language is getting picked up as AI when it's actually humans.
(21:07):
Can you comment about that? I think this is a really important use of large language models. So many of our great scientists are not native English speaking or English writing.
James Zou (21:16):
Yeah, thank you for raising that. So a lot of universities and companies now are using or thinking about using these AI detectors, basically detectors to tell you whether a piece of assignment or applications written by human ai. And what we found through systematic evaluations, that's many of these AI detectors are really not reliable. So first they're very easy to fool. And then second is that they also have these strong biases, and a big bias is the one that you mentioned that they often flag articles that are written by non negative speakers, AI generated, sorry, smooth this back over as some that's produced by AI as a false positive. And I think the deeper reason why this happens is that many of these AI detectors are using this measure called perplexity, which is basically how surprising the word choice usage is, right? If the text as uses very common words, then it'll have low perplexity, and the algorithm would say, well, that's mostly likely to be written by AI if it has low perplexity.
(22:31):
And otherwise, if it uses more surprising word choices, like maybe more sophisticated or rear words, then it has high perplexity. Then the algorithms would say, these detector would say these are more human-like. And as you can imagine, if you're a non-native speaker, many of them maybe have a bit more limited vocabulary and they often would use the more typical words, more common words, and that's why their essays or their papers often are scored having low perplexity and often get falsely flagged as AI generated when they're actually really written by humans. So I think that really points to, I think there's some fundamental limitations in how people are currently approaching doing this statistical detections of what's AI and not ai. But I think that kind of detection is going to be extremely hard to do. I don't really think that we will be able to fundamentally distinguish whether something is AI written or human written.
Eric Topol (23:20):
Yeah, I think this is another critical point you're making along with the fact that large language models will help people who are not native English writers to get it is as good as you would have it from people who are from that is their native language. So it'll be interesting to see how that plays out. Now before we into a few bigger perspective questions, what are you working on now that you're excited about that you haven't already told the world?
James Zou (23:55):
Yeah. So one big thing we're working on is that there's quite excited about is that there's a lot of efforts in building foundation models for genomics data. For example, like single cell data, there's a very nice paper in nature recently about gene former, it's being trained on many millions of single cell data. So most models are really trained down on statistical correlations across all these single cell RNA and genomics profiling, but they don't really leverage the language parts, the text parts. And what we found is actually quite surprising that is that if you basically just ask ChatGPT to write out descriptions about each of the genes, maybe we'll write a few paragraphs of describing each of the genes like nano or OC four, and then you just look at the embedding of that text descriptions. That actually provides in many cases, even more powerful foundation model for single cell for genomic analysis compared to the ones that we learned by doing a huge amount of statistical modeling across the millions of single cells. So basically if you take the text embeddings of individual genes and then that makes it easier to classify cell types to do all sorts of downstream applications.
Eric Topol (25:14):
I think especially interesting because as you know, just this week in science, there were some 20 plus papers on single cell multis of the brain and spatial biology, a very exciting topic relies on this approach. And to be able to move from only looking at the gene to adding this whole layer of vital data is going to improve things. So that's great. I'm sure you have many other projects, but before we close up, I have a couple of questions for you. One is you have been well aware of the controversy about large language models being self-aware or have we entered the era of artificial general intelligence versus the diminishing or downgrading, oh, this is just a sarcastic parrot. Where do you stand in your sense of where we are now on this?
James Zou (26:13):
Yeah, it's an interesting question. So I think that we're not quite at AGI yet. I think for example, there are still limitations to these large language models and that's why human supervision is needed. But I do think that these models are much more than just repeating things from their training data. We are seeing what's called emergent phenomenon in context learning chain of thought behavior. And these are, I are really impressive. And I think there are points to I think really exciting and promising applications of these large language models in general of AI in medicine and healthcare. I mean, at the same time, we are seeing some of these challenges, like the models behavior can be drift substantially over time and it's still a complicated black box, but the black box getting even bigger, I think it's as complex as many of other model organisms that we study in biology, like the brains of fruit flies or sea elegance. This is more complicated and I think as a scientist, I am actually really interested in studying this complex organism now as the object of study itself. Can we understand the scaling behaviors and where are different properties and biases of a large language model localized in which of the layers? I think those are really interesting questions,
Eric Topol (27:38):
Right, for sure. Where do you see where we'll be in the next few years because the jumps have been a very high velocity here, extraordinary, and there will be GPT-5 and GPT-6 and other obviously we'll have Gemini soon from Google DeepMind and whatnot. Where are the next steps, do you think that we'll see performance ratchet up?
James Zou (28:06):
Yeah, I think one really exciting frontier, and you've also really explained this really well in many of your papers is these different modalities. I think now most of the focus being on language or maybe language plus vision, but now we're actually seeing many different, these modalities coming together. Even within vision, there are different types of imaging. There's echo videos, there are these time series ECG. So I think there's different modalities are starting to come together into single models. I think that's really exciting. I don't think just adding more modalities by itself will teach the model everything, because I think we still need to teach it more about the causal relationships and the deeper mechanisms. I think that's where it's going to be the next big frontier is that basically have models that can have much more interactions with humans as a way to do its own explorations. Because if you think about how we learn, we actually explore things ourselves and get useful physical feedback. So now we have these models that have vision and sensory modalities so that are grounded in the physical world, but then turning these models to allowing them to have these more flexible interactions with their physical environment, I think will lead to a big jump in the capabilities and in the understanding of the physics and causality of these models.
Eric Topol (29:35):
And now since you're leading one of the most productive prolific academic in the country, could you also comment, we are somewhat dependent on these companies that have base models where they used 25,000, 30,000 GPUs that no academic lab has. And as I think I'm on the Stanford HAI with Fei-Fei Lee, and recently there was a discussion, what do we do? We're subject to only doing fine tuning of models that already exist, and they haven't been trained in life science. They haven't been trained in medicine specifically for fine tuning. Where do you see this interplay between academic labs, like yours, universities, institutes, and the companies that are basically own most of the GPUs computing power and are making these base models?
James Zou (30:27):
That's a great question. I have a lot of friends at companies like OpenAI and many of my students have started similar kind of companies. So that being said, I am quite concerned about the setting where if these companies already have the monopoly over developing and doing some more foundational research in generative ai, I think some encouraging developments recently have been at universities. Sara Nonprofits have been trying to put together larger compute resources. For example, the Zuckerberg Initiative, the nonprofits part behind the Chan Zuckerberg. So they've also recently announced that they're going to start to create one of the largest GPU clusters for use in these nonprofits and academic research settings, thousands of high quality GPUs. It's not at the scale of open AI yet, but I think we're at this level where there could be a lot of really exciting foundational work that we can now start to do at leveraging these nonprofits, academic GPU clusters. But I think that is an area that we do need to make substantial investments if we are to stay competitive, both as a country but also as academic researchers.
Eric Topol (31:53):
Yeah. Also to note that you are a CZI, the Chan Zuckerberg Institute investigator, and this is great that they are stepping up to help on this score. Well, in closing, James, I want to congratulate you. You are, as far as I can see, lighting up AI in life science and medicine like no one else really, because you span a gamut from molecules to healthcare that is unique and you're doing it at a pace that's not parallel by other groups that I've seen. So I think that we're going to be watching reading your work with the great interest, and we are indebted because this doesn't happen by accident. You must be working 30 hours a day, you and your team to get all this work done. So thank you for your contributions and I know there'll be many more coming in the future.
James Zou (32:52):
Thank you so much for your very kind words, and it's really a pleasure to participate and to chat, and I look forward to having many more opportunities even to collaborate on many of these projects going forward.
THANK YOU BOTH: VERY INSPIRATIONAL JUST SENT THIS TO MY SON DOING HIS PhD AT CARNEGIE MELLON IN THIS FIELD
Dr Topol:
Fascinating interchange and very erudite commentaries about AI for research. I found the following specific comments esp relevant to our novel biochemical research relative to the use of the repurposed drug hydroxyurea linking critical neurotransmission a7NAChRs to Covid and possibly for Post-acute sequelae of COVID19, namely:
“So like you mentioned, it's getting harder and harder to get timely feedback from human experts. Number of papers just growing so quickly, and especially if you're thinking about a junior researcher, maybe a students or someone from an under-resourced university or settings, it's especially hard for them to get timely feedback from experts in the field.”
Given the significant barriers for publishing novel research in major journals with inevitable biases by lower level screening editors, many important (and urgent) discoveries are being buried unnecessarily. The preprint concept as a pre-peer review mechanism could be one solution esp if it also utilizes an AI component as promoted by QEIOS. Ref: https://www.qeios.com/
AI functions as a screener for expertise and the experts critique the proposed article adding both accuracy as well as validation to
the authors proposed publication. The process appears to be a hybrid and may actually reduce or at least minimize false or faulty information utilizing AI in screening certain topics from the world’s literature. I experienced that and was informed by a credentialed medical librarian that one article I located thru an AI search tool simply did not exist. It was dis-information at its finest!