0:00
/
0:00
Transcript

What could Alphafold 4 look like? (Sergey Ovchinnikov, Ep #3)

2 hours listening time

(This was released a few days ago, but it occurred to me that ICLR-attendees had better things to do than watch a podcast, so I’m sending it out now instead!)

  1. Introduction

  2. Timestamps

  3. Transcript

Watch on Youtube, Apple Podcasts, or Spotify!

Introduction

To those in the protein design space, Dr. Sergey Ovchinnikov is a very, very well-recognized name.

A recent MIT professor (circa early 2024), he has played a part in a staggering number of recent innovations in the field: ColabFold, RFDiffusion, Bindcraft, automated design of soluble proxies of membrane proteins, elucidating what protein language models are learning, conformational sampling via Alphafold2, and many more. And even beyond the research that have come from his lab in the last few years, the co-evolution work he did during his PhD/fellowship also laid some of the groundwork for the original Alphafold paper, being cited twice in it.

As a result, Sergey’s work has gained a reputation for being something that is worth reading. But nobody has ever interviewed him before! Which was shocking for someone who was so pivotally important for the field.

So, obviously, I wanted to be the first one to do it. After an initial call, I took a train down to Boston, booked a studio, and chatted with him for a few hours, asking every question I could think of. We talk about his own journey into biology research, some issues he has with Alphafold3, what Alphafold4-and-beyond models may look like, what research he’d want to spend a hundred million dollars on, and lots more. Take a look at the timestamps to get an overview!

Final note: I’m extremely grateful to Asimov Press for helping fund the travel + studio time required for this episode! They are a non-profit publisher dedicated to thoughtful writing on biology and metascience, such as articles over synthetic blood and interviews with plant geneticists. I myself have published within them twice! I highly recommend checking out their essays at asimov.press, or reaching out to editors@asimov.com if you’re interested in contributing.

Timestamps

[00:00:00] Highlight clips

[00:01:10] Introduction + Sergey's background and how he got into the field

[00:18:14] Is conservation all you need?

[00:23:26] Ambiguous vs non-ambiguous regions in proteins

[00:24:59] What will AlphaFold 4/5/6 look like?

[00:36:19] Diffusion vs. inversion for protein design

[00:44:52] A problem with Alphafold3

[00:53:41] MSA vs. single sequence models

[01:06:52] How Sergey picks research problems

[01:21:06] What are DNA models like Evo learning?

[01:29:11] The problem with train/test splits in biology

[01:49:07] What Sergey would do with $100 million

Transcript

[00:00:00] Highlight clips

My big goal in life has always been to come up with a unified model of protein evolution that accounts for all these different effects. And so what may appear to be creativity is just trying to tackle every part of the problem…

But I think one thing that maybe computer scientists don't quite realize yet is that all of biology is related. Every biological data point, there's no IID, every sample is related to another sample out there. And so if you do like a random train-test split, you might actually have overlaps…

And in some ways that you can think of that, that's what essentially Alphafold is doing. Like Alphafold will say, I'm going to make a guess, that's like zero recycles, and then you iterate and you sort of move around. But maybe if you do many, many independent seeds. And I think that's actually what some of these models like o1 and o3 are doing, like they have many, many independent starting points and they explore. And so I think in some ways, I guess we could say we've been already doing that for a while in the protein world. And they're kind of catching up…

I've never sort of sat down and said, okay, this direction is probably the most meaningful thing to do. It's more just like, okay, this is like a puzzle and there's no solution here. I'm just trying to figure out what's going on here.

[00:01:10] Introduction + Sergey's background and how he got into the field

Abhi: Today I'm going to be talking to Dr. Sergey Ovchinnikov, a recent biology professor at MIT. Sergey is an easily recognizable name to those in my own field, as he is one of the undisputed greats in the world of machine learning assisted protein engineering. His prior research includes ways to make protein folding more accessible, models that can generate de novo protein binders with massive success rates, and methods to help scientists learn what protein language models are actually learning about protein folding.

Today we'll be talking about issues with existing protein models, what future protein models may look like, Sergey's own journey in this field, and lots more. Thank you for coming to the show, Sergey.

Sergey: Yeah, thank you so much for inviting me to be here. Excited to talk to you about those topics.

Abhi: And so just to start this off briefly, I'd love for you to give us an overview of what your historical research focus has been and what types of problems you're most curious about.

Sergey: Well, I would say my journey started in trying to understand the relationship between species, how different organisms are related to each other, but more specifically trying to compare their DNA sequences, their protein sequences to each other. One of the things that we encountered early on is that sometimes you might think two things are highly related to each other based on the similarity of their protein sequences, but in fact, it's because of convergent evolution. There might be similar selection pressures in two different organisms that make their protein sequence look very similar, when in fact it's essentially just convergent; there's similar selection going on. To be able to separate that, you really need to understand the underlying protein structure, and also understand the protein function for these protein sequences.

My journey started off actually in phylogenetics, and then I transitioned during my PhD to try to say, okay, I need to learn everything about protein structure in order to sort of go back and maybe correct the signal so we could do more proper phylogenetic analysis. In some ways, I'm still trying to do that. That's one of the areas we're actually pursuing. We're thinking about how do we do better phylogenetics? But along the way, we've done a few side projects, you could say, like try to, hey, maybe we get into protein design, but that's all sort of related to building models and evaluating these models for this general goal of understanding protein evolution and evolution in general.

Abhi: It's interesting that your original background concerned phylogenetics, and you're still really curious about phylogenetics, even though you're probably most known as being the protein design guy. Do you plan to do that many pure phylogenetics projects in the future? Or is that kind of on the back burner for now?

Sergey: We do have a few projects actually going on in the group where we're trying to actually do better phylogenetic reconstruction. But even for the reverse problem... so one thing I've always thought of it as sort of: structures getting in the way of phylogeny. But now we're also beginning to believe, for example, closely related species could actually influence extracting coherent coevolution signals from multiple sequence alignments. And so it's also becoming an issue where maybe for bacteria, it turns out it maybe is not a big issue because most bacteria are highly diverged and you can almost think of it as a star tree; they're all equally separate from each other. But when you start to deal with eukaryotic organisms, like even fungi, you may have random mutations that are propagated and might mislead some of the calculations in terms of structure. So in some ways, not just purely trying to understand phylogeny, but it turns out phylogeny might actually become an important thing to think about when doing even studies more on the structural side or even protein design side.

Abhi: And for people who aren't super aware of how Sergey's work translates to the current state of the art in the field, there's a pretty clear direct line between the co-evolutionary work you did in your PhD and a relationship between that and how a model like Alphafold2 actually works. In many ways, you were dramatically ahead of the curve. I'm curious, during the mid-2010s when you wrote those papers, was it clear to even you that this work would be particularly useful for the problem of protein folding, or did it feel very much like a pure phylogenetics problem that had no relationship to any translatable research?

Sergey: Well, I think in the early 2010s or so, other folks besides myself had been thinking about: can we somehow extract covariance signal and could that be used to predict structure? So I think that's always been on the radar of people: this covariance signal will be useful for structure prediction.

At the time, I was mostly thinking from the perspective of using that signal to subtract it out and do better phylogeny. That was my initial goal. But one thing we found out is that actually the signal is useful, especially if you start to look at metagenomic sequences. What I mean by that, one of the interesting things people have found was that when you start to ask the question, okay, where is coevolution useful in terms of being able to predict contacts, to be able to predict structure? Often those structures were already solved; somebody's already determined that structure because if there's a lot of proteins for that protein family, very likely that somebody's already predicted the structure of one of those sequences or actually determined the structure experimentally. The only things that were left unsolved were membrane proteins, just because those were really, really hard to crystallize. But with metagenomics, what happened was suddenly protein families that had only few sequences suddenly had huge amounts of sequences. And so now these coevolution methods suddenly became more and more relevant for those kinds of protein families.

Abhi: I remember when we first talked a few weeks ago, you said coevolution for the problem of protein structure prediction was one of a few other parallel directions that were going on. I think you named two others. I'd love to hear you recapitulate that, because I think that was a really interesting story.

Sergey: I think maybe I brought up the fact that... well, maybe I talked about different groups working on this. Is that what you're referring to? Maybe that's what I was... okay. Yeah. So essentially, the idea of extracting coevolution from multiple sequence alignments has been around for a while. Actually turns out this continued to be pursued in three different fields, essentially: people in the physics world work with things like Potts models and Ising models, how we transfer this to this problem. People in the computer science world were also thinking like Markov random fields, Boltzmann machines, how do we use that? And then people in the, I guess you could say more computational biology field, were also thinking along those lines, like mutual information, and applying those kind of approaches.

I think the part that I found remarkable or interesting looking back is that often these people didn't cite each other because they didn't know about each other. It turns out, if you actually look at the math, the math is almost identical, but they just never sort of talked to each other because they used completely different terminology for the same concepts.

Abhi: But the underlying data that you used was all identical? The data was used...

Sergey: Even the algorithms were the same. Okay. It's just that I think in computer science, people call it Markov random fields. And then in physics, people call it Potts models. And then other places were just calling like coevolution models. But turned out a lot of the math was actually identical. They just used different symbols. Like one field will use W to represent coevolution, another would use J. And it's like, okay. But if you look at it, it's the same equation. They're just using different symbols, different words for the same thing. But I think on my side, I worked more I guess with the folks coming from the computer science side. So I worked with Ti-ti from CMU in Pittsburgh. And so I came from it from the Markov random field perspective. And I think in our papers we always call it Markov random field, but then I realized sometimes people who do Potts models, they get a little confused because they're like, Hey, is this something different? But it's like, no, it's the same thing. So now I have to say both terms when I refer to it.

Abhi: Do you think the field has generally consolidated into the pure computer science direction or are there still computational biologists and physics people who are pursuing their own parallel paths?

Sergey: I think they've all sort of... I think past 2011 or so, when things suddenly started to work, I think all these groups sort of became aware of each other. Some folks actually started collaborating together. And so there's been... so now I think they're all kind of aware of each other, but I'd say pre-2011, it was kind of multiple parallel efforts of maybe people not recognizing that they're working on the same thing.

Abhi: And speaking on your own computer science background, there's relatively little publication history on your own backstory and the lore of Sergey. I'd love to hear about why you decided to study biology, what made you focus on phylogeny, and what led to the eventual pivot closer to computer science applied to the whole subject.

Sergey: Let's see. Where, how far should I go back here? We could maybe even start in college?

Abhi: Maybe even high school? If you think that's a good place to start.

Sergey: So I started off back actually when I was still in high school. I was on robotics teams. So I used to program robots. There's this thing called US FIRST. And it's essentially a team of, I think it's international now, but at the time they were doing... high school students would build robots and they would compete with each other. And I was the one usually involved in programming these robots. But the reason why this sort of led to biology is to me, when I started to think about biology from the perspective of code, it kind of all clicked for me. And so what I mean by that is: you could think of all organisms, they all have some kind of code, or you could think of all organisms maybe as robots to some extent. There's some code that essentially codes for how they act, how they're developed and so on. But to me it kind of felt like we didn't really have a good understanding of the compiler or the compiled code. We have all this code, we don't know the syntax. We knew roughly where proteins start and end, but no idea at the time. And so to me, really, that's why I got really excited about biology because I was like, Hey, this is like unknown code. We have no idea what it's doing. What if we start to compare these codes to each other?

And so like right now when we think about GitHub, we get to probably look at some project and ask everybody who's cloned this project and who's modified it. You could check to see which parts were modified and which parts were not modified. And that would quickly tell you, okay, these parts that were probably not modified are important for this overall project. And parts that were heavily modified, those parts are probably not as important. Or maybe they were modified and they've gained new function, I guess you could say. And so in some ways, when you're comparing different source codes to each other, to me it felt like, oh, you could compare a bunch of genomes to each other and figure out what...

Abhi: And this naturally leads to phylogenetics.

Sergey: Exactly, exactly. So you start to compare all these genomes to each other, I guess you could say, reconstruct the GitHub history of all genomes. But then that lets you start to understand the syntax and so on. And so that's sort of what got me initially. So then when I went to college, I was like, you know what, I'm going to learn about biology. At one point I was... I think I moved a little bit into history because I was like, maybe I'll do history of science because I wasn't sure if I was good enough to do science yet at that time. But then eventually I transferred and started doing more biology. But over time, I joined a couple of labs, like one lab that worked on milkweeds, another lab that worked on various arachnids. And I was actually participating in extracting the DNA, sequencing those guys, and then getting all the sequences. But once we started getting sequences back, I think my advisor at the time quickly realized that I had some computational skills. So like, okay, maybe you could help assemble some of these sequences. So I worked on genome assembly and building some algorithms to be able to do that at the time. And that sort of, I guess you say, brought me back into computation because initially I was like, I'm going to do biology, and then I'm now back to using these algorithms that I worked on in the past.

Abhi: And from there, you also did your PhD work in phylogenetics.

Sergey: Oh, yeah. Maybe I should clarify. So for my undergrad education, I joined a couple of different labs that were working on phylogenetics. But then for my PhD, that's where, during my undergrad, I started to realize we can't really do correct phylogeny without understanding structure, without understanding covariance patterns. Because one of the things that's interesting in phylogenetics is when you build, when you compare a bunch of sequences to each other, there are certain sites that have high entropy and mislead phylogeny. And what I mean by high entropy, there are certain positions that just change rapidly. There are certain organisms that also evolve rapidly. And those organisms that evolve rapidly, they would appear just by chance to be highly related to each other. There's this process called long branch attraction in phylogeny where highly evolving species suddenly start to get grouped together.

And so what to get around this problem, what folks sometimes do is they always say, okay, let's remove positions that are not consistent with each other. Because if there are multiple positions that are kind of consistent, then you would say, okay, this is probably due to phylogeny. Positions that are inconsistent, those are probably just random, and so we should just remove those. But turns out these metrics of looking for these self-consistencies between positions is actually very, very similar to coevolution. These are sites that are covarying with each other, but it's not clear is this covariance due to phylogeny or is this covariance due to coevolution? And the only way to tell that is to say, can we look at the protein structure? Like if those two positions are consistent and they're close together on a structure, that's a strong indicator that maybe there's some coevolution going on and that could mislead your phylogeny signal. Yeah. And so it sort of turns out these signals are completely entangled. And so that's when it was like, okay, I'm going to get my PhD, I'm going to learn everything about structure so I could disentangle this effect. And that was sort of, I guess you say, my journey into that space.

Abhi: And then you graduate from your PhD and during your postdoc you continue this exact same line of work.

Sergey: Yep. Yep. I guess now that I sort of, I felt like as an undergrad, I understood phylogeny. Then as a grad student, I understood structure and I was like, okay, now I'm going to combine the two and finally try to resolve this problem. And so then I went on to become a fellow at Harvard University. And there I actually started to say, okay, how do I actually combine these things? How do I build a unified model that understands conservation, coevolution, phylogeny? And that was my work during, I guess you could say during my fellowship. But during that time, large models started coming out. Like, for example, folks started training giant, giant language models for proteins. Things like initial versions of Alphafold started coming out. And there I was thinking, okay, is it possible these models have already learned to do that? Like, did we kind of get scooped without realizing? Like, are these models learning phylogeny? Are these models learning coevolution? And so in some ways, my work kind of partially pivoted towards: let's fully understand what these models are actually learning. We really need to dig into them because that would tell us, one, did we get scooped? And two, do we still need to work on this problem?

Abhi: I'm curious. I think you're often... whenever people think of Sergey Ovchinnikov, they often think of deeply creative papers. I think papers that you wouldn't really expect to come from any other people. Do you think there's an aspect to... do you think a lot of your quote unquote, alpha as a researcher comes from your background in Phylogenetics and that people who work at Isomorphic and EvoScale could stand to learn a little bit more about phylogeny?

Sergey: I don't know if that's where it's coming from, but I guess for me, I guess my big goal in life has always been to come up with a unified model of protein evolution that accounts for all these different effects. And so what may appear to be creativity is just trying to tackle every part of the problem. Like for example, we're trying to extract evolution signal, but then we also need to think about alignments of sequences, right? So for example, maybe we're extracting the wrong coevolution signal because sequences are misaligned. And so we venture into the alignment problem. But then once you start thinking about alignments, then you're like, well, how do you know you got the right alignment? Yeah. Well there, that's where it's like, well, maybe a structure prediction model could tell you that the alignment's correct or not. Right. And so... I guess what you could say, what may look like creativity, it's all just trying to solve this unified model problem, I guess. That would be one way to put it.

Abhi: How often do you return back to your background of phylogeny when you're looking at these problems? Are you often thinking from a phylogenetic lens or you often thinking from a pure machine learning researcher lens?

Sergey: I guess I'm always thinking from the, I guess you could say phylogeny or protein evolution lens would be one way to put it. It's... so I'm always coming back to this thinking like, how do we solve problems in this bit, like constructing this unified model that I keep talking about?

[00:18:14] Is conservation all you need?

Abhi: And the, like one of the axioms you have in your head when you're working on these problems is that you can learn almost everything you need from conservation. Is that a fair way to put it? I do notice you have this side interest in molecular dynamics. I sometimes see you post papers in that realm on your Twitter. But you don't actually ever seem to publish in that area. Do you think molecular dynamics will actually become really important in the future when Alphafold 4 or 5, 6 comes out? Or do you think for the moment conservation is the most important thing?

Sergey: Let's see. I guess when I say conservation, what I mean is like there's certain, I guess you could say there are certain very important positions that are highly conserved for purposes of function. So maybe to step back a little bit, we have, I guess you say there's lots of sequences out there and every sequence has some amount of selection going on. And some sequences are maybe in one particular organism, and that organism needs to do some function. And maybe a group of organisms, they all have the same function and so they have a highly conserved position. But it's not because there's coevolution there, it's just that that position is super, super important for that group of organisms.

But it could get misleading from the perspective of phylogeny, meaning like maybe you have some random mutation that happened early on in the tree of life and now because of this doubling effect, like you have one speciation event and now you have another speciation event, and every single time, I guess you say half the organisms now have that random mutation. And that could be a little bit misleading from the perspective of saying, is this really conserved? Or is this just a signal that gets propagated and propagated? And so sort of decoupling those effects.

But maybe coming back to your question of molecular dynamics, it's one of those things where we do... when we're interpreting models, we're thinking about how are these models working? Like, for example, language models. If we go step back to language models for a second, are they learning conservation, like learning each group of sequences have different levels of conservation just because they belong in a certain space of sequences? Are they working because we have coevolution going on and different groups of sequences have different coevolution? Or are they somehow internally in the model like solving the protein folding problem, or maybe doing molecular dynamics? Like is each layer sort of learning at different steps of folding or physics?

And when these first models started coming out, it was a little unclear which of those would be true. Like, is it, is each layer sort of folding up the protein, or is it sort of picking up on all these statistics of evolution, I guess you could say? And so that's one of the problems we're trying to separate. But that being said, of course, if we want the model to learn physics, then maybe we need to get into molecular dynamics in terms of trying to get these models to reason over molecules, reason over interactions of atoms. Yeah. I'm not sure if that's what you're asking or getting to, but that's...

Abhi: I guess the broader strokes of the question is do you think we'll ever escape the well of having... because you published this paper last year about how this pattern of a model always learning co-evolutionary statistics even continues if you don't have multiple sequence alignment. Even if you have a pure language model that has only seen sequence and no conservation, it still is learning folding via evolutionary statistics. And we obviously care more about the scope of all possible proteins rather than just the proteins that are near evolutionary proteins. Is this a problem in your mind that we need to find some way to move beyond the well of evolution? Or do you think for now it's actually completely fine?

Sergey: I think it depends on what you're trying to do and what your claims are. And what I mean by that is, I mean, I guess from the pure engineering perspective, there's probably nothing wrong with saying, Hey, let me grab a piece of evolution here, piece of evolution here, and sort of stick it together and make some kind of Frankenstein protein. And I think protein language models actually seem to be really good at that. They've learned different parts of proteins, which motifs tend to be the same across many different proteins. And you can imagine using such a model to sort of stitch different things together.

But then of course for somebody who's maybe coming more from a first principles point of view, they're like, Hey, this is kind of cheating. We want to be able to understand why these sequences code for these things and why are we able to stitch these things in a certain way. And maybe that would allow us to move into space that nature has never explored.

But then again, there are people that argue also that maybe any protein seems to be a combination of fragments. And so maybe nature's already explored all possible fragment space, in which case, having a model learn fragments is not a terrible thing. You just need to be able to sample these recombinations of things.

Abhi: Which do you think is true?

[00:23:26] Ambiguous vs non-ambiguous regions in proteins

Sergey: I think there's definitely space still left to explore. The way I like to think about it is that any given protein sequence is sort of composed of ambiguous and non-ambiguous regions. And so what I mean by that is: there are certain regions where there's a sequence that always codes for a helix. There's a sequence that always codes for a particular turn. And there are parts of the sequence that are these ambiguous motifs that you have actually no idea what they're coding for. It could be a helix, it could be a beta strand, it could be a loop, it could be a break there or so on.

And the only way you'd know is in the context of the full protein. The tertiary structure folds up and you're like, okay. It's almost like this region is some kind of chameleon sequence that could sort of adapt to different things depending on its context. And I think any natural protein is sort of a combination of these two things. There are certain parts that you say, I want this part to be rigid, and so you probably want to have this non-ambiguous sequence. And then there are some regions that are, you say, you know what, maybe this part needs to be flexible. And so maybe I'll put in some ambiguity here that maybe could respond upon a ligand coming close.

And what that means is that now of course if we try to predict these proteins, things that are made up of purely non-ambiguous sequences, Alphafold, ESMFold, like all these fold methods can predict them really well because I think they've already learned all these sort of non-ambiguous sequences, motifs that all these proteins share. But then for regions that are more ambiguous, that maybe have to do with function, these models, unless there's evolutionary information associated with them, are unable to predict it.

[00:24:59] What will AlphaFold 4/5/6 look like?

Abhi: Do you think the future looks something like you have one model to predict the non-ambiguous parts and then like a more physics-based model to predict the ambiguous parts? Or yeah, like, I'd love to get... I guess the broader question here is: what do you think Alphafold 4 or 5, 6 looks like? Do you think it goes something in that direction or somewhere else?

Sergey: Probably. I would say that if you construct a protein that's purely from these, I would argue non-ambiguous sequences, and I would say all of de novo design people who do de novo design often completely just exploit these non-ambiguous sequences. Like every single stretch, you can actually even do this experiment: if you take any little cut and try throwing it into Alphafold, it'll predict that without the context of the rest of the protein. And I would say it's a very simple search problem. You're essentially just stitching fragments together, and that's why it's really, really easy to predict these proteins.

But of course now once you start moving into this more ambiguous space, it becomes a really large search problem. And there you sort of do need to maybe start to say, okay, maybe we need to put physics in here. Maybe one way to put it would be like, there's this global search problem where you do huge fold exploration, and then there's more like a local search. Like once you roughly know what the fold is, you sort of fix all the little details. And I think Alphafold has learned to take a few steps along some kind of energy function that it learned. But it has a really hard time doing a kind of global search.

Abhi: Like I remember you said there was this figure in one of your papers where the Alphafold uses the MSA to find itself roughly where it is on the energy landscape. Yep. And then does some local energy minimization from there on out.

Sergey: Exactly. So I guess our current hypothesis is that multiple sequence alignments sort of give you the global... or give you sort of, I guess you could say you can skip the global search and you're just focusing on the local search. And sometimes you can mess with the multiple sequence alignment. Like people have found, if you subsample the MSA, sometimes you turn on dropouts, you enable sort of random masking, you could sometimes explore other parts of this starting space. And in some cases, like what we try to do is say, well, what if we just give it like a template structure? So this was work with James Rooney, and we're like, oh, we could give it some starting point and try to almost move around this space and see how well it is actually able to know which part's correct or which part's incorrect.

But coming back to your question of like what's the next version of Alphafold, I do think it's probably going to be some version where there is some global search exploration that gets baked into the model. And so what I mean by that is: you can imagine in every instance of Alphafold is sort of a few moves of a game, but it's not the full game that we're playing. And so, but there are probably many, many starting points. And so having some way to sort of guide or move into this global space is going to become quite important as we move into space where we say, how about proteins where we don't have evolutionary information and so on. Or if you want to design these proteins that look completely different.

Abhi: By many different steps, could I mentally equate that to inference time compute? Or are you referring to something else?

Sergey: I guess two things here. Because, so I guess want to step back a little bit. There have been researchers that have shown that you can run Alphafold with thousands and thousands of seeds and you could sort of think of each seed as sort of being seeding some independent MCMC trajectory and some search. And so you have a bunch of random starting points. And for example, even Alphafold 3 for antibody-antigen where there's no evolutionary information, they found they have to go up to like a thousand, and even after a thousand you can continue adding more and more seeds. And so those are kind of things where those are kind of random starting points. And so if you're lucky, one of those things will get to the right answer.

But one could imagine doing something, say, well, what if we have a smarter way of seeding, like some smart seeds? Is there some way to bypass? So I could almost imagine some kind of a model that sort of says, okay, here's where you need to explore or here's some hypothesis where to explore. And these are some of the directions my group is currently exploring, saying like, could we somehow seed or do some smarter seeding of the space?

Abhi: Going back to what your lab is working on, a theme I've seen in a lot of your lab's work and people who have collaborated with you is a belief that the existing base models are actually really powerful. And if you use them in interesting ways, you get a lot of value out of it. Do you think that there is that much value in pre-training from scratch, or do you think a lot more work could be done with the existing models and there's a lot of really interesting work that could be done there? Like you have Bindcraft, which for people who haven't even heard of Bindcraft, it basically allows for you to create de novo protein binders at ridiculous success rates, far higher than anything else that had come out before.

Sergey: Well, I guess coming back to a little bit, the earlier point that I was trying to make is: we know that Alphafold is highly limited to a certain space. Like there's, I guess you could say this idealized, non-ambiguous space. And so the question is, well, one question is: do we even need to move into this ambiguous space? Or alternatively say, well actually, you know, if we're happy with this space and there's a lot of things to accomplish here, why not just limit ourselves there? Why even explore to other things? And if we take that philosophy, then I would say, well actually current tools are fine. We could just invert Alphafold. That's what we do in Bindcraft. You can essentially use ESMFold to sort of score these sequences. We think all these models have learned this sort of non-ambiguous, idealized, I guess you say low contact order space. And let's just fully exploit that and just design within that space.

And there's nothing wrong with that. It's just a little bit less satisfying from the perspective of saying, Hey, what if we want to move into something more complicated? And I guess some people could argue that, and we argue this, is that, well, if you start to move into more complicated function... so what I mean is: if you're binding to something, we're not too much worried about, Hey, does the binder change upon binding? Is there some flexibility to the binder? The fact that it's like a rock or doesn't unfold is fine. So binding seems to be a relatively good application of these highly idealized structures.

But then if we start to move into, say, now if we want to design enzymes, now maybe we have to start to understand this sort of ambiguity. But maybe there's some sort of middle ground where you say, I'm going to try to make it as rigid as possible, but maybe destabilize a few little things. And so then becomes like a local search problem. Maybe that's one thing Alphafold can do. And so there's been sort of these combinations of methods that have been coming out, like from David Baker's group, it's like, well, we could restrict some spots based on what we believe nature has optimized for. We're going to keep that fixed and then we redesign everything else. And maybe we don't fully understand the part that we kept fixed, but we also just say that we know it moves somehow, we just have to keep it all in that same spot. But everything else we can redesign and make it rigid. And then that creates like new enzymes.

Abhi: If it turns out at least for the rigid binder world, if it turns out non-ambiguous sequences are just kind of intractable for humanity as a whole to deal with, do you think that's a big loss for anyone? Is it a big deal or is it kind of fine?

Sergey: Well, I guess it depends why you're getting into this problem. I think there are some people who go into protein design, they're coming more from like a, I guess you could say, more of a biophysics background. They're like, Hey, I want to understand this problem. And often people like to quote Feynman saying, Hey, what I cannot create, I don't understand. And so the idea there is: you want to be able to, if you could say, Hey, my model can make something and it works in lab, then I fully understand this problem. And so there you kind of want to have the model work for the reason that you understand, like the mechanism.

I mean, I guess now we're in this weird space where now we could actually make things and we still don't understand it. So it's like, what's going on here? So it's like, how do we check understanding anymore? And I think for those folks that want to understand things on more, I guess you could say on the biophysical side, they do want to sort of make it work for the right reason, I guess you could say.

Abhi: But for the people who are most interested in, I want to create a protein for a specific functionality. I don't care if I understand how it works or not. Do you think the non-ambiguous space is perfectly fine?

Sergey: I think for the most part it is. Okay. Depending on the application. So I would say like for binding, I think we're definitely... I think don't necessarily need to worry about maybe this ambiguous space.

Abhi: for enzymes it's a little bit more complicated.

Sergey: For enzymes, that's... I think when, like for example, I think some of the dreams people have is to design molecular motors, for example. And then there it's like, well, this is where maybe we can't get away with this hack. Or folks say, well, maybe upon binding you have a conformational change.

I mean, there has been work where folks have done these locker kind of proteins where they... it's like you have one helix flipping out and the ambiguity is mostly in that loop. So you essentially have one helix and then you have a loop. And that loop could potentially have a lot of ambiguity in there and it could get displaced by another helix coming in. And so I guess there are these hybrid things where you sort of combine things that are ambiguous and non-ambiguous. And so maybe there's still a lot to push in these hybrid sequences.

Abhi: If you look at the problem of at least rigid binder design, do you mentally consider that problem solved or are there still edge cases where you think there's a lot of work to be pushed on?

Sergey: I would say that it doesn't work for everything. And part of it has to do with the fact that... so one thing we find is like if you take a target that maybe has a little bit like a hydrophobic patch somewhere, interestingly, if you just run Alphafold with any random sequence, Alphafold always puts any random sequence near that patch. And then, of course, once it goes there, it's not really confident. And as you optimize, that sequence becomes more and more confident. But then there are some targets, like you give Alphafold random sequences and every iteration it just places in a different location. Alphafold doesn't see a clear signal anywhere where something can go. And I think those are regions that are more hydrophilic. So for example, if the surface is completely hydrophilic, essentially there, unless you already have like a perfect sequence on the other side... by perfect sequence, what I mean is like you have maybe the correct hydrogen bond patterns that can maybe detect one part on the surface... it's really, really hard to optimize. It's like you almost need the right answer before you even try to design or optimize.

There, this is where we think maybe methods like diffusion or flow could be useful because you sort of target, say, Hey, let's just explore here. While with hallucination, you're constantly... it's almost like you're hoping that it already knows where to go before you even start. And that becomes a bit of a limitation because you maybe want to restrict yourself to a certain spot.

[00:36:19] Diffusion vs. inversion for protein design

Abhi: That makes sense. And actually, kind of on that point, I think the last time we talked, you had a lot of really interesting things to say about the value of diffusion versus hallucination or masking. I'd love to get your pitch as to why diffusion is the way forward.

Sergey: Well, I guess I wouldn't say diffusion is necessarily the way forward, but it is definitely a step in the direction. So maybe just to step back a little bit, ultimately the protein design problem is find a sequence that folds into one structure and no other structure. But also not just folds into that one structure and no other structure, but also where the conformational landscape, or I guess you could say the folding landscape has a sort of, I guess you could say, a smooth landscape where you can actually get there. Because you could imagine you can have a sequence that has very low free energy, but there's a huge barrier you have to go over to actually be able to fold into that structure.

And the reason why we like things like Alphafold or inverting Alphafold is because we say, well, actually during design we're at every single step of design testing for that condition: does it fold into that structure and no other structure?

The problem with the method like diffusion is that you're sort of coming at the structure, the sequence is not yet implemented. But even if you do joint sequence and structure optimization, you still, at the end of the day, you're only evaluating that sequence for that one structure that you're diffusing. And you're never asking the question, does that sequence fold into something else?

I mean, of course you can run it thousands and thousands of times and then after the fact, run Alphafold on all those sequences and check for that condition. But if the goal is to, if at the end of the day you're going to be using Alphafold, why not just use it as the oracle itself?

So, I guess to describe to people who are not familiar: you can either generate a bunch of sequences and then check for this sort of, I guess you could say inverse folding property of making sure the sequence only folds into one structure and no other, or you could say, well, let me just evaluate that condition at every step of design.

And in the past, people sort of abandoned this idea of inverting structure prediction models, just because it was really, really hard to optimize. Like if you try to backprop through Alphafold, it gets very, very unstable once you start to only work with a single sequence. And it takes too many steps. And so I guess you could say diffusion was sort of invented to try to, or at least not necessarily invented, but was introduced into this field in order to get around that problem of sort of instability of back propagating through structure prediction models. But I think now we're sort of getting to a point where I would say, actually, you know what, with some of these tricks like relaxing the sequence representation, maybe removing some of the recycles from the model, we could actually do... like it's not as expensive as it used to be. And now we could actually maybe move back into this kind of approach.

Abhi: Like the inversion approach?

Sergey: Yep. Yep, yep.

Abhi: Okay. So actually you... I may have misinterpreted your original viewpoint. You don't think diffusion is actually particularly... I guess like do you have a strong stance one way or the other as to whether inversion versus diffusion will ultimately win as the primary protein design method?

Sergey: Well, I guess what I was trying to get at is that, I think ultimately it's testing for the same... even when you run diffusion at the end of the day, you still use Alphafold as your oracle to check. That's true. And so one could argue like, well if that's your end goal is to pass the Alphafold test, why not just use Alphafold throughout the whole process? And in some cases... and so usually people didn't want to do that just because Alphafold was really, really difficult to work with, in terms of inverting it and propagating through the signal. But I think the reason why hallucination has been coming back recently is because people have finally figured out, hey, actually there is a way to use it as the oracle and use it during optimization.

But that being said, there are still some problems, like what I described a little earlier, where let's say you want to design a binder to a particular location, and you give Alphafold some starting sequence, and that sequence just doesn't go to that spot. And you could add some hotspots. You can say, Hey, I really want to go there, but unless just by chance Alphafold predicts that protein to be close in that region, there are no gradients to push that sequence in that spot. Yeah. And so there's just no way to tell Alphafold, here's where I want to explore. You're almost hoping just by chance it appears there and then you optimize there. In the case with diffusion, since you're working explicitly in structure space, you could say, Hey, I have my structure. I'm going to initialize my noise here. And now you're kind of forced to explore there.

Abhi: You can guide the diffusion process like the tiniest bit.

Sergey: Yep. Yep. So you, I guess you could directly steer the structural components. And that I think is, could be powerful in that context.

Abhi: I saw that you had this, like, this Boltz Inverse Design thing that came out recently. Do you think there's going to be that big of a step up from Bindcraft? As in, some sense, Bindcraft is using Alphafold2 as an oracle. This new approach is like using Boltz as an oracle. Do you think there's that big of a step up in actual improvement in terms of generating binders? Because I remember I watched a lecture by Martin a few months ago and he said initial tests with Boltz actually showed that it wasn't that much more successful than Alphafold2. Does that also continue to be empirically the case?

Sergey: That's a great question. We haven't really done any kind of benchmarking in terms of protein-protein interactions. And was it... does it able to predict things better? The reason why we... so Yilun, in my group has been working on this... is we're thinking like, how about other problems where we don't actually have proteins that we're binding to? So let's say the protein might be, instead of protein, you say might have DNA or RNA or a small molecule, or maybe a protein with a modification. And so I guess I was thinking the... what we call Boltz Design was more, not necessarily because we think it could do better than Alphafold2, but more because it could now address other classes of problems that Alphafold2 cannot do. So anything to do with small molecules or anything to do with non-protein.

But just coming back to the protein-protein interaction problem, I guess my immediate guess would be that it wouldn't do much better. But I don't know. I guess that's something worth exploring.

Abhi: Yeah, it was surprising to me. Because instinctively I would expect, oh, Alphafold 3 has seen more molecular interactions. Thus, it probably learns to do protein-protein interactions better, and it seems like that hypothesis hasn't really proven itself out. Is there good intuition as to why?

Sergey: Well, I mean, at the end of the day, like if you sort of look at Alphafold2 versus Alphafold 3, and then all the different implementations of Alphafold 3, the core parts of the model are the same. Like you still have the Evoformer, and then you still have the Pairformer, which removes parts of the MSA. And the only thing that really got replaced was the structure module got replaced with the diffusion module. But to me it seems like both of those things are kind of doing the same thing. Okay. Like it's just one is a bit more deterministic than the other. Like one, you start with the noise and then you try to satisfy some constraints coming from the pair features. One, you start all 0, 0, 0 and try to match those constraints.

And I guess if I was to guess, I would say maybe the DeepMind team probably thought that by doing the first part and the more expensive part, we're going to kind of like separate it as a separate component and then do a lot of diffusion afterwards, like many, many iterations, that you would get many different solutions. But in reality what happens often is the first part just gives you back the same constraints. And so then you always sample the same structure. And so I think there were great plans there, ideas there that might have improved things. But ultimately I think the structure prediction part maybe didn't get helped too much by these updates. But that being said, like all small molecule, non-protein things definitely improved.

I mean I might take a little part of that back because I think in the Alphafold 3 paper they did show for like, for example, antibodies, you can do far more recycles... I mean not, sorry, far more different random seeds and you could find the right solution. So maybe there are some improvements there.

[00:44:52] A problem with Alphafold3

Abhi: On the topic of Alphafold 3, I remember you talking one point, and I might completely misremember this, that you had some quibbles about how the recycling process is being done in Alphafold 3. I'd love to hear you just expand on that.

Sergey: Alright. So yeah, so I guess for those of you who are not familiar: Alphafold2, you make a prediction and then you take that structure and you feed it back into the model and you say, make another prediction. You sort of keep cycling this over and over and over. And each step, the structure kind of improves and improves.

And one thing we've noticed with Alphafold2 is if you try to do like homooligomer predictions, if you look at recycle zero, all the copies are literally overlapping on each other completely, overlaid on each other. And then each recycle kind of pulls the copies apart so they're no longer clashing with each other. So what we think is happening is the Pairformer, or I guess you say the Evoformer in the context of Alphafold2, sort of learned some constraints, but these constraints are physically not possible. Like you try to actually realize that structure and you realize that you create a bunch of clashes. And I think one of the cool things about Alphafold2 is like you see that there are these errors in your structure and you feed those errors back into the model and you refine those constraints and you keep refining it.

One thing people have noticed with Alphafold 3 is that you start to see these clashing problems. Like if you try to predict like a homooligomer structure, these copies are all overlaid on each other. So there's no feedback in Alphafold 3 back from making an attempt to predict the structure and then refining the constraints.

Abhi: I thought Alphafold 3 does have a recycling component.

Sergey: So it has recycling, but it's only within the MSA and Pairformer.

Abhi: Gotcha.

Sergey: Like there's no... the structure that gets predicted doesn't get fed back into the model.

Abhi: So there's almost like an original sin problem going on a little bit where it has no chance to try and modify what it predicts.

Sergey: I mean yes, exactly. Exactly. So it's like you try to realize those constraints, but there's never feedback that you failed to realize these constraints. And I think the reason they did this was well-meaning because I think they were hoping, Hey, let's do this expensive... this whole cycling or this recycling one time and then run diffusion thousands and thousands of times. Yeah.

Abhi: The hope is that diffusion does what you're asking for, right? Like it realizes the constraints early on and then refines it. Why do you think... because in my head, I've always mentally thought of recycling as like a pseudo-diffusion process. How come the actual diffusion process doesn't seem to do what recycling is doing?

Sergey: I suspect the reason is because the diffusion part doesn't have a chance to sort of remodify how it's interpreting the coevolution signal. And so what I mean by that is we think like the MSA module and then the Evoformer or the Pairformer is taking the evolutionary data and trying to clean it up or trying to disentangle some of maybe ambiguity in the constraints. And but once you've disambiguated all those constraints, you're kind of now stuck to just try to satisfy those final disambiguated constraints. But what if the ambiguity didn't work out correctly? Like you didn't correctly disambiguate these things. And the only way you would know is to try to embed those constraints in a 3D structure. And I guess if one was to go back to the diffusion part and let it reprocess the MSA at each iteration, then maybe that would work. But then you're back to what you call the pseudo-diffusion part of Alphafold2. Yeah. Because that's essentially what Alphafold2 is doing. It's like you're doing this iterating process and you're fixing the constraints, refining, I guess you see the potentials, and then rediffusing the structure and you're iterating on that over and over and over.

So yeah, but I would say like one simple fix to all this is say, Hey, let's just feed the structure back in. My guess is the reason why they didn't do this was because of that thing that I talked earlier about these hopes. But that would be one, I think an obvious thing to sort of say, Hey, let's bring back this iterative refinement step. And that would be one way to maybe fix that problem.

Abhi: Has any of the Alphafold... I haven't looked too closely at any of the Alphafold 3 replications. Have any of them tried to poke at this problem or they haven't really touched it?

Sergey: They, from what I've seen, they haven't, no. Okay. And I think part of the reason I think everybody right now is just focusing on saying, let's reproduce the baseline. I'm sure that some of these groups are already thinking about how to improve on top of this, but I think right now the goal is to reproduce the baseline.

Abhi: That makes sense. Um. And kind of like on this note of replications, we seem to be kind of entering... the field at large seems to be entering something similar to the GPT-2 moment in biology, or phrased more differently, a period where the best models are being built by private companies and are for the most part, never open sourced. I'm curious to hear your own thoughts about this as someone who has built probably the most popular open source implementation or open source plugin to Alphafold2. Where do you think the trend of open source protein models and biology models is going?

Sergey: Yeah, I think it's definitely sad to see companies now not wanting to release their code. Because I always felt there was a synergy between industry who have a lot of resources like GPUs and building models, but often having no idea what the models are doing or how to interpret them. And then academic groups taking these models and trying to interpret them. And so there's this synergy: somebody builds a model, a bunch of people are trying to interpret it, figure out what's wrong with the model, what is it learning, what is it not learning, and then feeding back into the industry and the industry could try to fix those models.

I'm guessing maybe industry has reached a point where they're like, Hey, we've now... it works too good. We can make billions of dollars on this. And so they have less incentive to release these models for academic research. And yeah, so it's definitely... I mean, I'm personally not too happy that they're not releasing these. Yeah.

Abhi: Do you... like in the NLP world, people thought things would go the exact same way, but you had entities like Meta release the Llama models. Do you think there'll be some entity like that in the biology world where some either very strong academic group or some very caring for-profit institution releases good open source models? Or do you think that's also unlikely?

Sergey: Well, I mean, like if you look at what happened with Alphafold 3, there have been indeed groups like, I think like ByteDance for example, released Protenix, and I think they've now also made it for both commercial and non-commercial use. Chai released their model. So, and like HelixFold, I think there have been a few groups, I guess similar to the whole DeepSeek kind of thing, where they're releasing everything for people to be able to use these kinds of things. And so, yeah, I think there are always going to be some groups out there that are just releasing.

I'm not sure how much it'll continue. I think sometimes what people do is they... industry, I'm guessing, is they would release some early versions of things just because it's really good for PR and so on. But then once they get things working much better, they might not want to release it anymore. But then maybe some other group catches up and releases everything. So it's kind of a... so I think there are going to be cycles where people try to... they think they have something cool and they're trying to hide it, and then somebody else will catch up and eventually get there.

Abhi: Kind of on this note, at least amongst NLP frontier AI labs, there's some level of homogeneity amongst all the approaches they're going. For the most part, everyone's doing autoregressive models, but there's also like one group doing diffusion language models because they think that's how you get better language generation. It seems like for the most part, almost everyone in the frontier biology labs are pursuing the exact same approach, from my perspective. Do you see strong levels of heterogeneity in the underlying details, or would you agree that they all seem to have a very similar thesis?

Sergey: I think sometimes it looks like everybody's doing the same thing because everybody's just trying to reproduce the baseline. But once the baseline is produced, then people start to explore more. So, I mean, it's true that once something comes out, like somebody says, Hey, we have a working solution, everybody's trying to catch up to it. But once people have caught up to it, then people say, Hey, you know, how can we improve it? How can we change it? But every year you always have somebody else say, Hey, there's another way of doing it. And so there are always these shifts. And I guess in our world we do see, like people are doing some kind of MCMC hallucination and then people switch to diffusion, and now people are moving into flow and now hallucination's coming back. And I'm guessing there's going to be a sort of... but there's always like some group that shows something and then everybody tries to catch up to it. Yeah. So I don't think there's that issue per se.

[00:53:41] MSA vs. single sequence models

Abhi: Yeah. It does feel like EvoScale is the most unique amongst everyone by eschewing MSA. Do you think getting rid of MSA is going to be a good long-term solution to not having to deal... because the MSA track is annoying to deal with in practice. But do you think it's so essential that it's very hard to escape from needing it?

Sergey: I guess our argument is that even the models that do work on single sequences, they are learning something akin to an MSA.

Abhi: Mm-hmm.

Sergey: Not necessarily memorizing MSA per se, but the statistics that you would have for a given protein family, it's learning these things. So. And I guess it's sort of a debate between... I like to think of it as sort of a debate of storage versus lookup. Like you could have a model that just memorized all the statistics, or you have a model that's much smaller and can retrieve some information. And then using that information, recompute some of the statistics.

And in some ways you could sort of see parallels to that in the OpenAI field, where they're often thinking like, Hey, instead of making these models bigger, let's just give it access to Mathematica or give it access to, I think in their case it would be Bing, instead of Google. But it's like giving it access to search. And MSA is actually sort of, I guess you could say essentially a version of search. Like you search for a few things and based on that information you compute the statistics.

And so I guess the question is: what makes more sense? Do you make models larger and larger, or do you make smaller and smaller models, but give it access to information? And I personally prefer the latter saying, well, could we just make smaller models that have access to information?

But that being said, I think folks that do evolutionary scaling kind of stuff, I think Alex Rives and so on, they do believe that as you scale eventually, maybe at some point it has some kind of emergent property. Like right now, to me, it seems like it's mostly learning conservation, coevolution, maybe a bit better coevolution in terms of maybe being able to share statistics between protein families that we previously were not sharing. But I guess the hope is at some point it snaps out of that behavior and says, Hey, now to make it even better, you need to learn physics. Yeah. And so that, I think that's... I think it's worth exploring. I mean, it's going to be really expensive. But if we don't believe in emergent behavior, then it's the question: why bother scaling? Why not just give it access to a bunch of sequences and then let it search for some of those, that information itself.

Abhi: And like one alternative thought. Maybe instead of providing MSA during train time, you provide it during inference time. Like, I've seen a few papers try to do like a RAG approach when it comes to actually providing MSAs. Do you think that approach has much steam in it or there's like some reason it doesn't really work that well in practice?

Sergey: Could you explain the process again?

Abhi: Yeah. Like the idea is like you provide the sequence to the model and the model can refer to like a retrieval augmented generation database which contains embeddings of homologous or aligned sequences. And it can just feed that information in and that's how it computes things.

Sergey: Yeah. Yeah. I mean that seems like one way to do it. I mean, I guess you could always... I mean, to step back a little bit, one of the issues with MSAs is that one is often you don't actually know what sequences to include. Like you get a bunch of sequences, but some of those sequences might not belong. Like maybe they have different conformations or maybe they're completely from a different protein family. In some cases it's just really hard to find these sequences because maybe they're less than 20 or 10% identical to the query sequence. And so some of the typical methods people used might not be able to find these sequences. And then finally there are problems of just aligning the sequences.

And so I guess the retrieval methods could be useful because they're able to potentially say, well, instead of doing sequence alignments to find those sequences, maybe there's some barcode I can use to find those guys. Yeah.

Abhi: But you're probably trading off on something and you're losing... You're using a heuristic, really. You're unable to do the actual alignment and you're trading off on some level of accuracy.

Sergey: Yeah, yeah, yeah. I mean, I think if you try to actually take every sequence and compare it to every sequence, it's just going to take forever. It's going to... like even making one multiple sequence alignment on hundreds of CPUs would take probably multiple days to do a search against all the sequences out there. And there are all these heuristics like, Hey, let's try find... like for BLAST the way it works is sort of, you chop up things in small words and you say, do any of the words match? And MMseqs2 does this to a similar extent. And then ungapped alignments, then gapped alignments and so on. But you're probably losing a lot of things. And so you could always find more sequences by turning off all these pre-filtering. And so in some ways, one could also argue that, well, maybe there are just issues in making multiple sequence alignments. And so having a model that just stores evolutionary statistics is better... Yeah. Or alternatively, maybe there are other ways to retrieve sequences that don't require explicitly aligning things or looking for matching words.

Abhi: It feels like, I imagine what some people at EvoScale think about is if you try to do the alignment yourself, you can't include every possible alignment. You have to make some sort of cutoff as to how many you include. And like, how exactly do you do the alignment? And it feels better to have the model decide what is important versus what isn't important. I guess this is a question more for your phylogenetics background. Do you think the way people handle alignments with models like Alphafold2 are actually pretty sophisticated and there's not that much further optimization that a zero priors ESM approach could really improve upon? Or you imagine there's still a ways to go to improve things?

Sergey: I think there are still ways to improve things. The reason I'm saying that is actually in the most recent CASP, people have found the best working methods is to essentially try to create multiple sequence alignments in thousands and thousands of different ways. Like people try lots of different methods, different E-value cutoffs, different coverage cutoffs. And you could always make better predictions by sort of sampling lots of different kinds of MSA generation methods. Or also what you include and don't include.

And in some ways we've also found this before. One of the projects we worked on before, we said, well, what if you make MSA generation differentiable? Like let's say you have a collection of sequences but you don't know how they align to each other. Could we pass them through some kind of differentiable Smith-Waterman module and then try to figure out the best way to align them to maximize the pLDDT in Alphafold? And we did actually find that for proteins that have very few sequences, sometimes just realigning one sequence made a huge difference. And so there's still room to improve there.

In some cases it worked for the wrong reason. And what I mean by it worked for the wrong reason is it turns out... this is work from Sam Petti, who's at Tufts now, but she looked at some of these alignments that we generated, like these multiple sequence alignments, like, Hey, we got a better multiple sequence alignment. Was it actually better? And it turns out in some cases, the sequence got misaligned and that improved Alphafold because right now what happens, this currently aligned sequence was influencing the covariance statistics. And so if you misalign it, it sort of gets averaged out. And so in some cases, some sequences just do not belong. And by realigning them, we actually removed it. And so it's like in some cases... so I guess you could say the MSA problem is still unsolved.

Abhi: Yeah. Like, I think one question that I have tried to figure out myself, and I've kind of come up empty-handed is: is there a principled way to improve the MSA? And it seems like the answer is no, really? Like the CASP result AFsample where you're just rerunning things like a billion times with different MSA subsets. It's kind of under-discussed in the paper as to what makes one MSA better than another MSA. Is there actually a principled approach to decide which one is better other than just looking at the confidence metrics? Or is there some often like some phylogenetic theory that you could rely upon to prove that it's clear why this one MSA is better than another?

Sergey: Yeah. Yeah. Well. Yeah, I guess it's an unanswered question, like why does it get better? I guess one hypothesis is like, well maybe each MSA is essentially initializing this global search. Coming back to the earlier topic, like each MSA is sort of perturbing it enough to start somewhere else, and now you found the solution. And it's not really the fact that the MSA is any better. You just added a little bit of noise and now you're starting somewhere else. Yeah. And so maybe if you just use the same MSA with lots of random seeds, maybe we would have gotten the same answer.

That being said, other researchers like Hannah Wayment-Steele found that actually if you cluster the multiple sequence alignments, you can look at different clusters and different clusters seem to have different maybe evolutionary signal, and that could push Alphafold to do one thing versus another thing. And I guess originally I would've thought that wouldn't be the case because with Alphafold, you have attention between the sequences and so you're like, well, Alphafold should be able to figure out which sequence to include and not to include. But in practice it looks like you can get a better signal by essentially subsampling the MSA closer and closer to the query.

That being said, I think other researchers have shown that actually sometimes you can just make random MSAs and you can still explore that. It doesn't necessarily mean that one invalidates the other. It just turns out there are lots of ways to get Alphafold to predict alternative structures. Yeah. Like you can maybe get it to predict the alternative structure because you're just initializing in a different search space, or because there's some bias in one place or another. And so there are a lot of ways to sort of push Alphafold to do different things. And whether or not they're doing it for the right reason or wrong reason is sort of still a debated question, I guess you could say.

Abhi: I think kind of related to this point, I have been... a lot of your papers kind of poke at the problem of mechanistic interpretability when it comes to these base models. But as far as I can tell, you haven't clearly tried to train an SAE on ESM or anything like that. Do you think there's a lot of steam in these approaches being applied to protein models or there's some gap that you're unsure about and that's why you haven't tried exploring it yet?

Sergey: I mean, we have looked more specifically at different attention layers and seen different attention layers maybe learn different things. I guess the reason we never really... I mean there have been recent papers, I think a couple papers where they look at these sparse autoencoders applied to the... not our papers, but other folks have done this now.

Abhi: Yeah. There are people from Reticular, I think. I think also from MIT, they started a company around figuring out which layers of ESM correspond to the specific secondary structure.

Sergey: Yeah. I guess, I mean, it's definitely an interesting sort of thing you could look at to see like, is there parts... I mean, the reason why we personally haven't done that was because we're like, well, the attention layer already tells us there are different layers learning different things. And you sort of see secondary structures popping up and so on. So that, I guess for us, the attention layers seemed to be already interpretable from that perspective. But that being said, maybe there are certain features that are not necessarily in the attention themselves. That may also be... I guess it would be interesting to compare these, I guess you could say sparse autoencoder features to the corresponding, I guess you could say attention heads. And see are they picking up something different or not.

I mean, I guess to step back a little bit, folks have in the past, like Martin Weigt for example, have looked at sort of taking these POTS models and decomposing them and finding that different sort of, I guess you could say components of the POTS model have sort of connection to different features of the protein. So I guess one thing that might be fun to explore might be to say, well, what if we just take the categorical Jacobian from a POTS model, I mean from a language model, and decompose that and see do these correspond to similar kind of features that SAE models are picking up on?

And I think, I think there's still room to explore. That being said, I think that one exciting thing about these sparse autoencoders is that perhaps if you could figure out exactly in the model what is being activated to make a specific prediction, then that could be some kind of steering approach. Like you could say, well, now I know if this is activated, it means that this is going to be an enzyme or something. Yeah. Or this is going to be a secondary structure. And so then maybe during hallucination, if you want to use that approach, or maybe you could somehow enable those things. And so if you could figure out those connections. And so, so I guess what I'm saying is that there might be actually some benefits to it that are beyond what we're working on that we're not considering at the moment.

[01:06:52] How Sergey picks research problems

Abhi: And kind of on that note, this is leaving the realm of science a little bit, not entirely, but if you were a PhD student today, what do you think you would be working on? It can be either in this field or some other field.

Sergey: Let's see. It's a great question. I haven't really thought about it too much. I think for me, I've always... yeah, I don't really have a good answer for that for you. I'll have to think about it some more.

Abhi: Well, you can come back to that question. If I guess to give some hypotheses, it actually feels like when you were describing your original personality when you were back in high school, it actually feels like the type of mind that would've gone to physics much more than biology. Do you think... and like it turns out you studied phylogenetics, which is kind of like a field which lacks a lot of translation to the real world. Do you think it's good for someone to focus on something that has very theory heavy and relatively little immediate application in the hope is that you can eventually convert that into actually useful research during grad school or during further research? Or it's good to immediately start with the immediate applicable stuff straight up?

Sergey: It's a good question. So I think for me, I've always done things that I felt were more fun or more curiosity-driven research, I guess. I've never sort of sat down and said, okay, this direction is probably the most meaningful thing to do. It's more just like, okay, this is like a puzzle and there's no solution here. I'm just trying to figure out what's going on here. And I think when you go down the path of saying, okay, what's going to be the most... I don't know, going to make me the most money or something... I mean, I guess people are driven different ways. For me that's not really what I think about when I think about thinking of a new topic to work on. For me is like, is there some unsolved problem here? And like, why does this work the way it works? And sort of... I think I just like to solve puzzles is maybe where...

Abhi: So it's not even necessarily like, I want to have this grand scientific impact. It's just like I get nerd sniped by something and I spend years working on that.

Sergey: I think so. Yeah. Yeah, yeah. Exactly. Exactly. I mean, sometimes you're lucky and you happen to get curious about a topic that actually turns out to be really popular and takes over. I mean, when I was moving into this whole coevolution field, I didn't think that this would be some hot topic. So maybe I got pretty lucky in picking an area that happened to have exploded recently. I mean there's definitely like a chance or like a lottery thing here. Yeah, like, for example, when I was moving into coevolution, a lot of people actually told me like, don't do this. This didn't work because people have been exploring this since the nineties and seventies and so on. It never quite worked. And so they said this is the wrong area to be in. But I still believed I could figure it out. And so... but it's also those areas, people didn't think it was going to work at the time. And so... I mean like right now, if I was to give somebody advice, I would say, well, if it's an area everybody thinks is going to work, there's probably going to be a bunch of people doing it because everybody else thinks probably the same way this is going to work. And so it's almost more fun to say, okay, here's something that no one thinks is going to work.

Abhi: Okay. That's good advice. So, not field specific, just pick something that doesn't seem to have billions of dollars worth of attention being focused on it.

Sergey: Yeah. Well I'm not sure if it's always good advice. Like, maybe ultimately you want to pick something that's not completely crazy, but... yeah. But I would say whoever's listening to this, don't listen to me because I'm probably giving really wrong advice here. You probably want to pick something that will secure your future somehow. But at least for me, I've never... every time I pick a problem, I always think like, it's more like, let's explore it. Find something cool and see what we find. And sometimes you get some cool things that come out of these things.

Abhi: Do you think... I know you mentioned you had this passing interest in pursuing history when you were in college. What would an alternative Sergey look like? If you had not touched biology at all, what field do you think you would have focused on?

Sergey: It's a good question. I mean, I did find that I enjoyed history, in terms of understanding what happened in the past. Maybe this is also why I was interested in phylogeny because that's also history. I'm just sort of making this connection now, potentially. But one thing I found is that when I went to college and I started taking some of these history courses, I quickly realized I didn't have the right sort of, uh, I guess you could say background, because at least my colleagues were all like, they all learned Latin at some point. I was like, and I don't know that. So I quickly realized this is not something I was pre-trained to be able to handle. And so, but I really enjoy just learning about the history of science because like what people have done before, how people decide to work on DNA and so on and so on. That's kind of... and so, but I think it's kind of fun to see it from that perspective. I don't think I would be able to get into it now because I probably have to go back and learn Latin again, but...

Abhi: Well, at least for history of science, you probably don't need to learn Latin. Do you think you're an active consumer of, like the Genentech book that came out that explained the history of recombinant insulin? Do books like that particularly appeal to you and you enjoy reading about how scientists in the eighties and nineties really worked on things?

Sergey: Yeah. Yeah. I definitely find it really exciting just thinking about... because I think one thing that I found quite fascinating as you sort of look at all these different stories of how people made breakthroughs, often it wasn't because somebody actually had a very specific thing they wanted to test. Often it's like somebody does something completely random and they see some interesting unexplained signal and they start to pursue it, and they find and they get a really cool discovery. And so I think a lot of the most important science have been sort of made that way. Like people sort of see a random signal and they pursue it. And that of course sort of brings up the question of like, how do you actually get this research funded? Right? It's like, if you don't have a direct question and you just want to explore in the cloud and find something... but that being said, I mean there is research for... there are questions in terms of basic... basic research, I mean, there's funding for basic research. What I'm trying to say. And like for us, often what I do is we try to propose a general area to explore. And every time we start projects, we start exploring areas, but then if we see something interesting along the way, sometimes we change direction to that area.

Abhi: When I think of your lab, I consider it much more from the outside an applied biology lab. In your head, do you still consider yourself a basic research scientist?

Sergey: Yeah, I would say we're more... I like to think we're more basic. In terms of thinking about more theoretical problems, big picture problems. We do collaborate with a lot of people who do have specific applications. So for example, some people want to do protein design and we say, Hey, maybe some of these tools or some of these recent hacks we found could be useful for protein design. And then we collaborate with people like that. But yeah, I like to think that we're more on the basic sciences. Yeah.

Abhi: Do you have a... when it comes back to these... you pursue problems that you're often just feel innately interested in, and it just makes you want to pursue it for years on end. Do you think you have a good sense of taste for what makes for a good problem and what doesn't make for a good problem? I remember reading this article a while back about types of problems you should focus on during your PhD and what makes for something good to focus on for the next five to six years. What's your own internal sense for what those types of problems are?

Sergey: That's a good question. So I'm, disclaimer, I'm a new PI, so I haven't seen... I only have a sample size of one being myself. So I don't know if these are good answers or not. It's... I mean, I guess one could say that... I mean, I think it's good to start with some problem, regardless if it's that interesting or not. Because I think you sometimes just need to get yourself familiar with the field. Right? So like, you start with something, say, Hey, I don't know, protein docking. Right? And so regardless if you're interested in that problem or not, but as you start to explore the different tools, you start to realize different limitations of these tools. And then something you might come across, something interesting that you pursue further. And so, at least the way I've currently been doing this with my current students, we'll see, we'll see five years from now, if it's actually a good idea or not, is to say, Hey, let's just start working on something. Like, here are some interesting areas that I find interesting, but with the intent that at some point we'll see something else. And sort of thinking back now, when I look at all the things that we have published, I don't think there's been a single thing where this is what we thought we'll be publishing five years from now. It's always like we start with something and we're like, oh, we found something really cool along the way, and then we realized, hey, this could be applied to this. And so on. And so I think it's just like getting started, starting to explore. And I think curiosity is probably a really important component. Like you see something that doesn't make sense and you sort of keep pursuing it.

Abhi: So you typically don't have like a two to three year plan for what paper do I expect to publish in three years?

Sergey: Well, we do start with that. Like the idea is we start off with a plan. It's like, okay, here's a problem. But I guess what I mean is when you start that problem, you might end up deviating from that problem. And that's totally okay. At least the way I view it. It's like you start... you have a... I mean, I guess in a maybe more ideal world, one would say, okay, let's start with maybe it's good to have two problems, like one that's a little more safe, you know? Okay. If you do these steps, it'll work. And then you say, and then you sort of have like, maybe, I guess maybe similar to optimization, you have like a, I guess you could say, I was thinking more like simulated annealing. It's like you have different chains. Like you have sort of a fast moving chain, a slow chain, and then you could potentially do some kind of recombination, but Okay. But this is kind of getting a little... but I would say it's good to have a plan, just in case. Yeah. But I would say don't deviate, but don't be afraid to deviate from the plan. Because sometimes you might come across something cool or interesting or unknown and it's okay to shift over and explore that unknown. And that's usually where the most interesting impact will come from.

Abhi: Do you think it's generally when you're setting up this initial plan to start off with, do you think it's good to focus on incredibly ambitious, largely intractable problems, or it's good to start with like a pretty close-ended thing that you know if you apply enough engineering effort, you will have a paper at the end of it?

Sergey: Yeah, that's a good question. I don't think I have a good answer for you because... I think right now I've, at least anytime I've done any research myself personally, I've never said, okay, let's, if we do this, we'll get a paper. Okay. I was like, Hey, let's explore this area. I know eventually we'll get a paper about something, but whether or not we'll get the paper for that specific thing we started with... And at least that's the way I've been doing science whether or not it's a good idea, if that's a good idea or not, I'll find out later. But I feel these are kind of questions that might be more relevant to ask somebody who's been a PI for many, many decades, they could probably tell you, Hey, I've seen all my students and these are the things I experimented with, these are the things that worked and didn't work. But just coming from myself personally, I've always felt like it's good to have a goal in mind. But deviating from that goal is going to be an important component to actually find something really cool.

Abhi: I imagine being a PI, you're a relatively new PI, you've been around, I think at MIT since January 2024. And I imagine as PIs must do, you have to specialize a little bit? You can't kind of be all over the place, especially early on in your career. Do you think there are areas outside of protein structure determination and protein binders that you wish you could have an extra 12 hours of the day to focus on?

Sergey: Well, I would say if me personally, I mean we are somewhat protein-centric. But we are trying to expand a little outside of that in terms of thinking about genome scale related things. Because ultimately proteins don't exist in isolation. But even for the same protein, the RNA itself might actually have other components there that determine the structure. So what I mean by that is, for example, like different codon usage could potentially influence how a protein folds, just because they might stall the ribosome in different parts. There might be some things upstream or downstream of the sequence that maybe changes the influences the expression levels, or maybe even changes where things change. So I guess in some ways we are also exploring that other side.

Abhi: Do you think like your lab will start poking at RNA and DNA models in the near future?

Sergey: Yeah. Yeah. So, so we, for example, have been exploring models like Evo coming out from the Arc Institute trying to understand what are they learning, are they learning something different? And yeah, this is part of the, I guess you could say... I think we're kind of moving more and more and thinking about proteins in the context of a genome as opposed to proteins in isolation.

[01:21:06] What are DNA models like Evo learning?

Abhi: I'm curious, what's your take on Evo1, Evo2? What are your general thoughts about it?

Sergey: Yeah, it's quite interesting. So we, I mean, we've tried applying some of these techniques that we developed, like the categorical Jacobian to Evo. And we are currently not seeing that it's learning contacts in terms of protein interactions. We're still internally debating, is this because the approach that we develop, this categorical Jacobian, just doesn't work for these kind of models being autoregressive and being predicting single nucleotides? Or is it because the model itself is learning some other signal that's dominating the protein signal?

Abhi: Sorry. When you say that, are you saying it's unlikely Evo2 is actively relying on co-evolutionary signal?

Sergey: So what I mean is: to step back a little bit, when you do have a model that takes a sequence and returns the sequence, you can ask the question as you perturb the inputs, how do the outputs change? And so if you think two positions interact or are dependent on each other, when you perturb something, let's say in position 10, the logits or the outputs should change differently in position 1 versus position 10. And when you do this kind of perturbation experiment in protein language models, we clearly are able to recover contacts like saying these are interactions.

But when you do the same thing with models like Evo and Evo2, you see it's able to recover RNA interactions like RNA stems and so on. But we don't really see a strong signal for protein-protein residue-residue interactions. And so one hypothesis is like, well, maybe there's another dominating signal that's completely sort of washing out. Like it's possible it's still learning contacts, but maybe other forces at play, for example, encoding start and stop codons, encoding the starts and ends of genes, maybe that signal's much more powerful. And so it's kind of washing away the contacts. Gotcha. So maybe we just have to disentangle the signal. Or maybe it's not learning any contacts at all. And so this is sort of still an area that we're investigating.

I guess another hypothesis might be that maybe the... if you sort of look at any given codon, and you sort of... if you remove, for example, the third codon position, you could still recover what the amino acid is. So like, for example, masking the third nucleotide of a codon is not really informative in terms of learning coevolution because it recovers it. Same thing, even if the first position, like if you mess with the first position of the codon, you often stay within either hydrophobic or hydrophilic. You're actually not changing the property of the amino acid.

Abhi: So there's something off about the masking strategy that Evo2 is using that leads to these strange results?

Sergey: Well, that's one hypothesis. Like maybe the reason why for RNA works perfectly, because with RNA, you're actually every nucleotide, or I guess ribonucleotide is interacting with something else. Yeah. And so if you mask it, then of course you need to change the other guy to compensate it. Or if you, not mask, but if you mutate it, I guess. But with protein sequences on the DNA level, it's possible that maybe you could just recover a lot of the statistics of things that you've masked just based on local properties. Like if you're in this codon, you know you're going to be an alanine, so the third one is not going to tell you much difference. Or if you mask or if you mutate the first one, you could look at the other two to figure out what the first one should be. Even the middle position, like for example, transition versus transversion keep you in the same amino acid property. And so maybe it's just a little too easy to reconstruct the masked tokens without having to actually fully understand the protein interactions. But that's something we're still investigating, so we'll have to see.

Abhi: And kind of on this note of like potentially off masking strategy for Evo2, clearly modern day biology models have a lot of pathologies and edge cases where they don't quite work as well. Do you think the fault of that can be largely placed on the internals of the model itself? Like the engineers and scientists need to fix it? Or do you think it's often much more like a data problem? Because I think for a long time, the data problem was clearly focused on by a lot of people. And it seems like you can actually get pretty big gains by just messing around with the model, which is what your lab does a fair bit.

Sergey: I mean it could be that... so I guess another hypothesis of why we don't see contacts is that maybe the model just needs to get much larger. Yeah. So what I mean is like if we do believe that all language models are doing is just storing information or storing evolutionary statistics, there's probably a lot of evolutionary statistics that you need to store about the genome. And suddenly there's just too much things to store. And so you're probably going to store some of the, I guess you could say the low rank signals like conservation and so on. And then maybe as the models keep scaling, you might start to learn more of these sparse signals, which are contacts.

And so it kind of makes sense as the first signal you learn is conservation. And then as you scale... and that's actually what we see, for example, with some of the earlier protein language models. Like TAPE, a lot of times just learned conservation. But then as we moved on to ESM2 and you start to scale up, you start to see like, hey, now if you start comparing the different models that were trained, some of them don't learn any contacts and some of them start to learn contacts as you make them larger and larger. So it's also possible we're just like the very early stages, like Evo2 is just the very first early model. And that maybe it needs to be like 10 times bigger or more to finally start to learn those details.

Abhi: Do you align with the idea of DNA is all you need? Or do you think at some point you need to bring in structural information?

Sergey: It's... I mean, I guess one doesn't necessarily need to bring it explicitly because I guess the idea is like, if structure's important, then the model, the attention would learn those structural constraints implicitly. Like if that... like for example, if structure's important for the reconstruction task, it should be able to pick up on that information. And that's what we see with ESM2 and so on.

That being said, there's possibility that you could unify some distance space by introducing structure. And so what I mean by that is, like, for example, people have trained SaProt from a group in Westlake. And then ProST T5, for example, they started introducing structural information into language models. And the SaProt team, for example, found that sometimes you could do much better at predicting effects of mutations by introducing structure. But the structure resolution is so low. Like they're using 3Di tokens from Martin Steinegger's group. And so it's kind of like, if the resolution is so low, how is it able to use that information?

And so one of the current hypotheses is like, well, maybe what's happening there is that you currently, like maybe let's say in the language model space, you learn that this is protein A, this is protein B, and they're far away from each other. But in reality, they might have really similar structures. And by maybe introducing even some really low resolution structural information, you sort of start to align those spaces together. And so maybe the model is sort of learning different categories of proteins, and those spaces are not well aligned to each other, and you're sort of bringing them close together. And so now you can borrow statistics from neighboring families that you didn't borrow from before. And so I think structure could help maybe sort of learn a maybe a better latent space of protein sequences.

Abhi: So it feels unlikely that a single modality will dominate, like there'll be a single multibillion parameter model using only one modality of data that'll dominate every single benchmark.

Sergey: I mean, theoretically it should be possible. Like it should... technically, DNA should be all you need. But in practice, I think if you don't have a large enough model, or maybe if you don't have enough... something wrong with the... like maybe you get stuck in a local minimum. I think sometimes structure could help steer the training in the right direction. But theoretically it shouldn't need, like, you shouldn't need it, I think.

[01:29:11] The problem with train/test splits in biology

Abhi: Yeah. Makes sense. And one question one of your old students actually told me to ask you is, first context for the question: Making test-train splits for molecular and protein engineering is really hard. Since there are so many possible ways to leak data. You need to simultaneously think about homologous sequences, structure similarities, and functional similarities between proteins in your test set and your train set. What do you think most papers get wrong about splits in this field and how could it be improved?

Sergey: Well, I think one of the issues is that... we have, it's not an issue. I think it's good. We have lots of people coming from the computer science world into biological problems. And so I think it's great that a lot of people are coming and helping us solve all these problems. But you do have to sort of consider the relationship of things. But even when you do consider the relationship and things like where do you set the cutoff? And so this is where the debate in the field is like, well, you can use sequence identity of 30%. And this is usually people say, but where that number 30 comes from is usually people say, if it's 30 or higher, it's probably doing the same thing. But you can still have things less than 30 and do the same thing. It's more like 30 is more of a cutoff of something being positive, but it's not a cutoff of something being negative. And so I think people often saw the number 30 and they're like, okay, we're going to split at 30% identity. But in fact, you could have proteins that do the exact same thing, have the exact same structure, and with sequence identities as low as 10%. And so that's where then you guys say, Hey, I split by 30. But in reality, you still have an overlap in the train test set.

And I guess one could say, well, let's use structure, let's try to use structure here. But the problem with structure is that you also have a similar problem where maybe in one organism, there are domains that are in different orders. So for example, you might have the exact same essentially, I guess you could say protein, but this protein has a different arrangement of the domains or has a couple more extra domains, or has some disordered loops or something. And then you compare those two things and say, Hey, they look different. But in reality, they have some distantly related relationships because any given protein is actually made of a bunch of domains. And these domains could be in a completely different order between different organisms. And so if you do a structural superposition, you might also be misled to think that they're more different than they actually are. Yeah. So it's almost like you first have to split by domain and then do domains. But then the problem is that within domains you could also have rearrangement of secondary structures. Like, you'd have the same exact protein, but maybe the order of the secondary structures are actually in different orders. And so then... so that creates a bit of a problem, like how do you... but then at some point if you start to, like, you say, well, let's just split by amino acid. And then it's like, well, everything's related. So that, and so that's kind of I think one of the tricky things is you have to be a little careful in that space.

It depends what your claim is though. I think for example, to step back a little bit, I mean, before deep learning, there's always been a field of remote homology search. Like say, Hey, if I could find a similarity between one thing and something super distantly related, like less than 10% identical, and I could say that they're the same, it's actually still a very, very important problem to work on. Because you could say, well, if I can find remote similarities, maybe I can make hypotheses about those things to say, if this looks like this, maybe this also does the same thing. And so there is benefit to remote homology search. And so if you say, Hey, I have a model that does better remote homology search, that's actually okay. But if you say, Hey, I have a model that works because it learned physics, then suddenly now you have to be more careful how you're going to split your data. You have to say, I have to completely make a huge giant hole in the sequence space and say, does it still generalize there or not?

Abhi: Do you think there's... I haven't actually seen that many people do this, but I've thought in the past like, why can't you just use the embeddings of a language model as a way to stratify things? Is there like some pathology there as to why that actually... like the issue you pointed out with domains, is there some fundamental issue with relying on embeddings?

Sergey: Exactly. So it turns out it's because of domains. Okay. So if you take the embedding and you average it, you've essentially already lost the information about what the order of the domains are. Okay. Yeah. In some ways you could say that's a good thing. Like maybe you don't want the orders, like if you average you got rid of the order. But the thing is, sometimes they would have a scenario where in one organism the domains might actually be in a different protein. Like there could be three proteins coming together with different arrangements of domains. In another organism it could be one protein with all those domains stitched together. Like all the domains are fused in one protein. And so if you average, all those things will look different from each other. It's almost like you want to first take the original embedding and chop it down and then compare all the different chopped segments to each other. But as soon as you start to do that, then it becomes... then you essentially are back to aligning sequences. And if you start aligning, then if they're in different orders, you can't really align them anymore because you can't use dynamic programming. And so it creates a bit of a challenge there.

Abhi: Yeah. Like the way you have set up that dilemma feels unsolvable. Like what do you think researchers should be doing?

Sergey: Well, like in our example, for example, we had a recent paper with Nick Polizzi, where we had this paper AF2Bind. And so there, what we did was say, you know what? Let's make sure there's no overlap on the Pfam level, like the protein family level based on sequence HMM comparisons. Let's make sure there's no structural overlap, but also let's cut out the actual binding site and make sure there's no overlap in the binding site. And so we went through levels and levels and levels of trying to sort of say, make sure, make sure there's absolutely no similarity between these two things. And so then we were more safe to say, okay, these are probably not... but once you start doing that, like for example for protein-small molecule interactions, I think we found total, there's like only 500 proteins at the end of the day.

Abhi: Mm-hmm.

Sergey: Like independent samples that don't look like each other anymore.

Abhi: In terms of the binding site?

Sergey: Yeah. Yeah. Well, so it's like if you start to go through, say let's cluster based on sequence, structure, binding site, turns out there's not that many examples anymore.

Abhi: Gotcha.

Sergey: But of course you don't have to remove the data. Like you could still keep the members of that cluster within your training, but you quickly realize there's not as much data as you used to think there is.

Abhi: Do you think this stratification problem also pops up with DNA language models? I didn't actually look too closely as to how the Evo team stratified things and it's kind of unclear how DNA language models work in this capacity.

Sergey: Yeah. I think for language models, people are not as concerned in terms of train-test split. But what I mean by that is like, there are concerns. Like you definitely don't want just to memorize all the sequences. So you want to definitely remove some sequences to make sure you didn't memorize. But in... I guess if we come back and think about language models just storing protein families and their evolutionary statistics, then in some ways you don't actually want to remove sequences from an entire protein family. You want to remove just enough to confirm that you're not just memorizing sequences, but you don't want to remove enough that you sort of obliterate an entire protein family from the training.

Abhi: Well, the hope is that there's like transfer to like an unknown protein family, I guess this is the whole ambiguous versus non-ambiguous thing.

Sergey: Yes, exactly. So like, I think you're right. So if you want to make the argument that the protein language model has learned sort of a new space, then yes there you want to actually make sure you remove anything that's remotely similar to it. But if you say, Hey, I just want a language model that sort of stored all the information and statistics of all the protein families, then in some ways you don't actually want to remove too much information. Like you kind of want it to see sequences from every single protein family because you want to learn an embedding of all these protein families. So I guess it depends on like what you're using the model, what claim you want to make for those models.

Abhi: And I imagine if I'm to infer your viewpoint correctly, is it that... do you think trying to train a model such that it'll be able to divine like a new protein space entirely, like a new protein family entirely isn't actually all that fundamentally useful because most of the known protein families are in the space of things we actually care about?

Sergey: I mean, I would say it would be very useful. Okay. I mean, ultimately the goal is to be able to have a model that can generalize to all proteins. And so I think one example of plots that we like to make and other folks too is like the number of sequences versus the performance of the model, by number of sequences meaning how many sequences are there in the protein family? And currently, like most protein language models, I think all of them actually have like a curve this way. Right? So it's like if that family doesn't have that many sequences, the model has very poor performance. I mean, eventually we'd want the model just to be... this line to be flat, it shouldn't matter how many sequences there are in that family.

And one could imagine keeping like, I guess you could say, hiding some sequences or one could just alternatively just keep evaluating if that curve is shifting. Yeah. Like if that curve starts becomes something flat, then you're like, okay, it's learned something interesting, fundamental.

But this also sort of brings up the question sort of a fundamental difference in how sometimes people in the computer science world think about problems versus biologists. I think sometimes when I talk to my computer science colleagues and I tell them, Hey, we want to predict things out of distribution. And I think they get a little bit like, what? It's like that sort of goes against the fundamentals of machine learning. You want to learn things within distribution and you want to be able to sample things within distribution. And the thing is, in biology, often we want the out-of-distribution things. And so like how do you get a model to go out of distribution? And I think the things that I find recently more exciting is like, well, things where you could iteratively reason over things might be a way to sort of move into that space. But...

Abhi: Going back to this idea of many different seeds and like seeing where on the fitness landscape the model places you?

Sergey: Yep, yep, yep. Exactly. And I think there's been, in other fields, for example, people are now introducing like chain of thought and so on, where maybe now if you have a model that says, Hey, return something in one go, or one shot or zero shot, it sort of goes through and makes the best guess it could. But if you sort of iterate on that, maybe you have it be able to explore outside of its training. And in some ways that you can think of that, that's what essentially Alphafold is doing. Like Alphafold will say, I'm going to make a guess, that's like zero recycles, and then you iterate and you sort of move around. But maybe if you do many, many independent seeds. And I think that's actually what some of these models like o1 and o3 are doing, like they have many, many independent starting points and they explore. And so I think in some ways, I guess we could say we've been already doing that for a while in the protein world. And they're kind of catching up. But I guess one could also say, well, maybe some of the techniques learned in that space could be applied now in the protein world as well.

Abhi: Do you think there's that much utility in integrating... like instead of having... actually more straightforwardly, do you think there's useful information in integrating wet lab assay data when... I guess like one example of this is introducing binding affinity into Alphafold 3 instead of just having like two protein structures together, you also say like, oh, this is the KD that came alongside it. Do you think there's that much value in doing something like that, or empirically it doesn't turn out to actually be all that important?

Sergey: I think in practice it could be useful because right now the model, like for example, methods like Alphafold, assume anything you give it is a good thing. Like it's a real thing, it's a real protein. But turns out in some organisms, maybe you want it to have higher affinity. In some organism you want less affinity. Like if you're in a hotter environment or colder environment, maybe you want the protein to be less or more stable. And so right now the model just assumes everything you give it is a good thing. And so it sort of learns some interesting average of all that stuff, but being able to sort of tune that label during prediction or something or during design, could be actually useful.

Abhi: Do you think there's anyone trying to integrate almost like human labels? Like because alongside a lot of these protein structures that are deposited in the PDB, there are some semantic labels that go alongside it of like, this is a particularly well-resolved structure or like this residue was hard to resolve. Do you think integrating that sort of information, like bringing in semantic level information is actually all that useful? Or is it just kinda like so far beyond what anybody else in the field is focusing right now that it's not worth poking at too much?

Sergey: I mean, I think it's definitely an exciting direction to look into. I don't know if anyone's tried that yet. I mean, people have tried filtering on that labels. Like, for example, I guess the most recent example I can think of that was really cool is the Soluble-MPNN paper, where folks say, Hey, let's retrain ProteinMPNN, but only on soluble proteins. I guess a newer version of that is say, well, what if you just provide a label? Say, is this protein soluble or non-soluble? Yeah. Like right now, the model sort of implicitly... like the default model, ProteinMPNN, even Alphafold by default sort of sees that sequence and maybe tries to infer if it's soluble, non-soluble. But now if you retrain on that kind of label, or introduce this label, you could essentially now tune and say, Hey, this is a protein in the membrane, and maybe you should be folding it differently or filtering the constraints differently. So I think there's already been demonstration of this in the context of Soluble-MPNN, this is Justas and Bruno Correia and so... Yeah. Yeah.

Abhi: On the subject of Soluble-MPNN. One question I realized I should have asked much earlier is what are your thoughts about sequence-structure co-design versus like one step structure, second step sequence?

Sergey: We recently... so Yilun in my group, we recently put up a preprint where we sort of explore this a little bit, where we co-design both at the same time. And our current hypothesis is that there is actually a benefit to co-design from the perspective that, like for example, let's say if you start with a structure and you come up with a sequence for it, maybe there's a better structure that can encode even a better sequence. And so having ability to move the structure closer to the sequence space, and then finding a better sequence, there's some benefit to that. Because maybe there's... like it's possible for this fold that you could find even a better free energy sequence. But unless you see that structure, you won't be able to come up with a sequence. So I think there's benefit in that space.

But it doesn't necessarily solve the problem of does the sequence fold into that structure and no other structure. So you still need to have the, I guess you could say the structural prediction module in place.

So I guess there's been some work by folks like Jason Yim at MIT where he looked at say, Hey, could we co-design things? But I feel like when you want to co-design things, what you really want to do is you want to co-design sequence, structure, and folding. Yeah. Because the way I like to think about it is when you're designing a protein, you're not designing a sequence or structure. You're designing essentially a folding landscape. Yeah. And one of the problems... so it's not about co-design of sequence and structure, it's co-design, or I guess you maybe tri-design, I don't know if you want to say it that way, of sequence, structure, and folding. And it's kind of hard to think about how models like diffusion could incorporate folding because they're considering that one structure, but they're not considering the ways that protein would fold into that structure, or how a protein sequence would fold into that structure during the process. And it's not super clear to me how you would actually incorporate that at the moment.

Abhi: Like, as opposed to pure structure models?

Sergey: Well, I guess there have been approaches where people say, well, let's diffuse the sequence and structure at the same time. And then the question is like, are you both satisfying the folding and inverse folding? And so there I would argue like, no, you're not satisfying inverse folding. Because inverse folding requires that you come up with a sequence that only folds into one structure and no other structure. And so, unless in your diffusion, you're somehow accounting for all the other ways the sequence can fold, you're not actually optimizing for the inverse folding problem.

Abhi: I guess ProteinMPNN is trying to satisfy that.

Sergey: Well, this is what I would argue. It's not really yet. Okay. Uh, well, so I think there's been... unfortunately in the field, people have been using the word inverse folding incorrectly because formally inverse folding, previously, and I think folks like Kendall have defined it as finding a sequence that folds into one structure and no other structure, but also at the same time making sure it's actually accessible.

Abhi: Like, thermodynamically accessible?

Sergey: Or I guess I was thinking more on the kinetic side, but what I guess I was... like during folding, you want to make sure it folds. Like, for example, there's not a huge barrier. Yeah. Unless you have a chaperone. Maybe that's one way to lower the barrier. And so that's what formally inverse folding was defined as. I think unfortunately folks in the computer science world when they're like, Hey, let's develop a method that takes a structure, returns a sequence. We're just going to call it inverse folding. And I mean, it is a nice term, but unfortunately it's not actually the real inverse folding from what the term was originally defined. And so I think in the ProteinMPNN paper, they were pretty careful not to call it inverse folding, but I think other folks who've developed similar methods have called it inverse folding. And so I think on, at least on Twitter or X, I always, every time somebody says inverse folding, I try to correct them and say, Hey, this is not really inverse folding. You're not actually satisfying the inverse folding question, I guess you could say.

Abhi: I guess though, but if you use that definition, no structure-to-sequence or sequence-structure co-design method is satisfying that metric.

Sergey: Well, this is what I would argue: inverting a structure prediction model is sort of maybe implicitly trying to satisfy that condition. So like you say, I have a method that takes a sequence and predicts a structure, and if I invert that model, then as you're changing the sequence, you're essentially every step along the way implicitly checking, are you satisfying that condition?

Abhi: Gotcha.

Sergey: And so I'd say that is closer to inverse folding than a method that takes a structure and tries to predict a sequence from that.

Abhi: Gotcha. Um. And so one thing I am still a little bit unclear about is why do you think that structure-sequence co-design models are not able to do this?

Sergey: Well, because they're not at any point evaluating along the way if that sequence won't fold into something else.

Abhi: Gotcha. Whereas the sequence-to-structure model is during a training process doing that. And then if you invert it, you hope the one-to-one relationship stays the same kind of?

Sergey: Exactly. Exactly. Exactly. Gotcha. Okay. Now, I mean, people who actually study protein folding might tell you again, no, that's not true because the model doesn't explicitly fold proteins. Yeah. And so maybe you do actually need a model that does protein folding first to be able to do that. But we think it is sort of implicitly maybe pushing in that direction.

[01:49:07] What Sergey would do with $100 million

Abhi: One of the last questions I had is, you've discussed a lot on Twitter about recent funding cuts to academia in general and specifically concerns about your own lab. One, I kind of want to talk about the best case scenario of like, let's say you did have a few hundred million dollars you could spend on any type of basic or applied research you wanted. What would you want to work on?

Sergey: That's a good question. I think for me, if I had lots, lots of money, I always try to think if there are some kind of experiments we could do, like wet lab experiments to collect information about protein folding. I think that's one of the things, like right now we have absolutely no experiment... I mean, there have been people like... I'm trying to remember his name. It's a guy in Canada... it's not Lewis Kay, it's like K. Lewis or something. Okay. Maybe I'll start with that question again. So, so I think... let me think of a good way to explain this.

Abhi: When you say measure protein folding, is that like NMR level, like the actual process of the protein folding? Yeah. Okay. And you would want to just collect thousands of these results?

Sergey: Yes, yes. Okay. So, I think one of the problems right now is: let's say we... okay, so let's just say we step back and say we want to solve the actual protein folding problem. And by protein folding problem, what I mean is starting from an extended chain, find the right conformation of the structure, step by step. And actually thinking about all the steps required to get there. One of the issues is that we don't actually have any truth. Like we can't really train such a model.

I mean, there were experiments where people look at transition states. That's the researcher, I was trying to remember the name of, Lewis Kay, I think his name is. But that just gives you, for just like one or two proteins, I think they did some NMR transitions that they've measured there that are sort of what they believe to be intermediate folding states. And so I think if there was lots of money, I am wondering if there are ways that we could develop experiments that could actually measure large scale, about how different proteins fold.

And I've been sort of thinking about how we would go about doing this and it's a little tricky because it's like you have to kind of measure it on an individual molecule level to some extent. Yeah. Because it turns out every protein will fold differently. So I don't know if there's a one particular way. Maybe if there are some ways to synchronize it. Like, for example, people have done these molecular tweezer experiments. And so in some ways you can unfold and refold the protein by pulling on the N and C terminus. If there's some way we could image those things.

Abhi: Like just a line of amino acids to the final state.

Sergey: Yep. Yep. Yeah. I mean, the other problem is also pretty exciting. Like, hey, how do you extract all the different dynamics a protein could have, or maybe all the different conformations it could fold into. But I guess what I'm saying is like, it would be great if we could somehow actually get step-by-step snapshots of the structure folding. And so if I had money, I think I'd throw it all on that.

Abhi: In some sense, is that not dynamics?

Sergey: It is, it is. It is dynamics. Yeah. Yeah. Yeah. But usually it just happens once and it stays there. Yeah. And so then with NMR, you're kind of stuck watching that one structure vibrating and moving around. But the question is like, could we sort of take snapshots of it going from an extended down to the final structure?

Abhi: Do you think the... like let's say we did have a perfect NNP that was capable of taking this string of amino acids and seeing femtosecond to femtosecond how does it fold into a final structure? Yeah. Would you still want this NMR dataset? Is there something useful about measuring it in the real physical world that you cannot get from a simulation?

Sergey: I mean, if we do have a good simulator that could actually do the whole process. I mean, I guess I was thinking more like could we generate data to be able to train a better simulator to be able to do this? Gotcha. Okay. Because I guess I'm under the impression that the current, even if you had infinite compute time and MD trajectory might never have found the solution just because there are sort of inaccuracies in the energy function and so on. But if we assume that everything's correct in MD and then you're right. I mean, if we have infinite compute, we just do that.

Abhi: Yeah. But like, yeah, I think empirically you are correct in that they are nowhere near that place of being able to go from...

Sergey: I mean there, I think D. E. Shaw showed they could do like really, really small domains and they could get them to fold up, but that's... we're quite far from that. I mean, I think we could approximate a lot of this with neural networks, but we sort of need some of these intermediate steps that we could train on. And so I think that's... but the question is like maybe why would you even need this? Like if we already have methods that can predict a lot of protein structures' end states? Yeah. Like the end state. Right, right. But I guess I'm under the impression in order to be able to predict some of these really, really complicated protein structures, it would be good to be able to see some of those intermediates because maybe you actually want to push the model to explore those step by step.

Abhi: Returning back to the NMR research approach, are you currently pursuing any attempts to really scale up wet lab efforts in your own lab? Or are you largely right now entirely focused on computational work?

Sergey: I guess just to clarify one point, I wasn't saying NMR was necessarily the solution to the problem. I was just saying this is one area that maybe be able to get there. But besides that, so we're not trying to do NMR right now. Hannah Wayment-Steele is actually starting a lab where they're going all in on the NMR stuff. So she'd be a cool person to talk to at some point. But at least in my group, we've always thought of it more... I've always thought of it as more of a sort of theoretical computational group. But that being said, we have started some efforts in the space of sort of building up maybe some robots that can do experiments for us. Like maybe some automated lab kind of things. But this is just more the exploratory phase at the moment. So, but primarily we're a computational group and we collaborate with people. Yeah.

Abhi: Do you think there's that much value... you're really interested in the pathway from start to end protein structure. Do you think there's that much further value in mass collecting end structure data? Like further... let's say we double the PDB overnight. Do you think it would actually make these models all that much better?

Sergey: I think they could. So, for example, we have, I mean we have a lot of structures for prokaryotic organisms, and we can now predict a lot of prokaryotic organism proteins. And because we have lots and lots of sequences, we can get coevolution from them. But I think one of the challenges has been for eukaryotic organisms where you have multiple domains and we still don't quite know how those domains come together. And often when you make a multiple sequence alignment, they would cover one domain but not the other domain or separately cover each domain. And so then when you give these to Alphafold, it sort of arbitrarily places these domains in 3D space. And so these sort of large multi-domain proteins is areas where we actually have no information how they come together.

Abhi: Gotcha.

Sergey: And so I think if we... but besides that, there are also a lot of proteins that are not multi-domain, but only have a few sequences. And there's sort of debate like, are these mostly disordered and that's why we don't have that many sequences? Or maybe they actually are structured or maybe they're only structured upon binding another protein? So that, I think, I think there's still a lot of proteins that we, it would be useful to have, especially for this problem of trying to go from single sequence to structure where we have no evolutionary information. And so the more we can collect of those, the more we can sort of create data sets to be able to maybe start venturing into this problem.

Abhi: So if you have the choice between just randomly doubling the PDB overnight versus doubling it in the areas where proteins are either... protein structures are either very large or they're suspected to be disordered. You'd prefer it in that latter category?

Sergey: Uh, I mean, I guess I said both, but I think for me personally, I would say, I mean, for me, the more exciting part is regions that currently have really low pLDDT in Alphafold that might actually be ordered. Yeah. Like we think they're disordered, maybe because there's a lot of correlation between pLDDT and disorder, but I think a lot of that low pLDDT is not because of disorder, but because just this lack of MSA information. And so I think what would be... if someone had infinite money to give us, it'd be kind of great to say let's go ahead and try to solve all those structures that we think are actually ordered, but have really low pLDDT.

Abhi: Gotcha.

And I mean, in some ways people used to do that back in the day with like structural consortiums or say, Hey, if there are no homologs, but I think now we can use Alphafold to quickly tell us, okay, what problems are worth maybe putting efforts into solving.

Abhi: Triage and decide these are the structures I actually want, and everything else is information that the model already kind of implicitly knows.

Sergey: Yeah. I guess in some ways it's almost like active learning, I guess you'd say. Yeah. Predict everything in UniProt, look for everything that Alphafold failed on. Those are the proteins we should be trying to predict the structures for. Yeah. But that being said, I think we just have to be careful because there are cases where the pLDDT might appear to be high. But if you start to look at the PAE matrices, the predicted alignment errors, there might be no information how the domains come together. And so even cases where we think we know all the domains, it'll be great to know how they come together. And I think those are also useful problems.

Abhi: This is one thing I've been... again, a little bit of a deviation from what we've been talking about. One thing I've long been confused about is there's this implicit trust in what Alphafold2 is confident about and what it's not confident about, or what it claims to be. And it feels like people don't actually seem to run into adversarial optimization all that often. Why is that? It feels like in almost every other field, if you try to really trust the model in one specific metric and just keep doing that over and over again, you eventually run into adversarial edge cases, but that doesn't seem to happen with Alphafold2. Is there a strong reason you suspect why?

Sergey: Um, I think it depends on the problem. So, for example, there have been people who have been showing that, for example, if you try to predict protein-peptide interactions, you get a lot of false signal in terms of you can have really, really high PAE at the interface, but the prediction is actually completely wrong. Like you have really high... and but I think it depends how you got there. And what I mean by how you got there is in some cases, like for... one thing we saw recently in some of our work is if you try to score things with Alphafold, and you use a bunch of recycles, it sometimes gets confident about things that are completely wrong. But if you dial down the recycle, like go to recycle zero, you actually see that it's actually a bit more correlated with known measurements because there's the question is: did Alphafold think they interact because there's some evolutionary information or maybe some motif matching or just it sort of reinforced itself and now it's really, really confident? Yeah. And so... I mean, it's true that a lot of times the confidences are pretty good. But I think the reason why people do trust it is because often they're not subjecting Alphafold to sort of unknown problems. So what I mean by that is often when people say Alphafold predicts something, they already know these two things interact or they know this protein folds into a structure and they want to know what does it actually look like? How do they interact? Where it fails is when you don't know if they interact and you're asking the question like, Hey, do these two things interact? And sometimes it will give you the false impression that they do interact.

Abhi: And...

Sergey: And I think part of the reason why this happens is because Alphafold was only trained on positive data.

Abhi: I was about to mention this. Yeah, what does negative data look like in this case?

Sergey: Well, so negative data would be like, let's say if you purposely take, hey, two proteins that should not interact, can you fine tune methods like Alphafold or RosettaFold? There's a researcher, Qian Cong, at UT Southwestern. Um, that has been actually exploring this a bit. I think they had a preprint recently where they say, Hey, let's purposely mispair things. And then tell Alphafold that these should not interact. Yeah. And she saw some... her group saw some success in that area, where they could potentially try to now do a better job at picking up what does and doesn't interact.

And I think something like that could also work in the design space where right now, any sequence you give Alphafold, the assumption is that this thing will fold into a protein. Yeah. And so even if the sequence is very suboptimal, it will find a way to sort of internally fix itself, which is great for structure prediction. Like you want a model to sort of maybe only see a few key residues and make the right prediction. But for a design problem, you actually want it to be sensitive to point mutations. Like you want to say, Hey, you put a, I don't know, hydrophilic in the core, it should not be predicted well. But it kind of assumes everything you give it is actually a valid answer.

Abhi: Yeah. That's a cool idea. Has anyone shown that... like you said this one UT Southwestern professor... it feels like it hasn't really percolated to the field at large. Do you think long term there will be like positive sets and negative sets for structured data or it's unlikely that that will be the case?

Sergey: Um, I mean, I would assume that people are going to be doing this more and more. I mean, I think for structure prediction it doesn't seem necessary because everything is positive. That's right. But I think for example, for people that do care about PPI prediction, like what's interacting with what... I mean this is why, part of the reason Qian Cong was doing this because she actually wanted to know, hey, which protein interacts with which protein? In which case she saw a lot of false positives from just looking at the confidence metrics. And so she tried to fix that problem there.

But I would imagine in other cases where maybe you want to say which ordered or disordered region actually folds into a protein. There might... like, for example, people do see, for example, with Alphafold 3, you give it regions that are completely disordered and it will still predict a bunch of helices. And there that's where some negative training could be useful. And to some extent they did that. Like the DeepMind team did do that. They actually took Alphafold2 predictions and used that for training to say, repeat the same extended sheet, extended loops and so on. But it turns out there's some kind of mode thing where if you make the protein too large, it sort of snaps back into making helices everywhere. And so that didn't completely work out the way they were hoping.

Abhi: The... I forget... there was a really good reason why they needed to use Alphafold2 predictions to bootstrap Alphafold 3.

Sergey: Well, the issue was that... so the problem with diffusion is that, or at least training with diffusion is that it sort of learns some distribution about how things tend to interact with each other. And it's only trained on ordered things. And so its prior is to make everything ordered. And so one of the things was like, if you start to use...

Abhi: It is unable to be not confident about anything?

Sergey: And it's still not confident. Like if you look at the confidence metric, it's still not confident. But I think the problem was that a lot of researchers sometimes don't even look at the confidence. They look at the structure and say, Hey, if it's a giant long disordered loop, then it's probably not confident. But the problem with Alphafold 3 was, at least before they did this sort of trying to correct it with Alphafold2 predictions, every single prediction always had helices. Yeah. Like anything you give like completely random sequence, it'll always put secondary structure. And so then they're like, okay, how do we fix that? Well, maybe we could try to get it to reconstruct how Alphafold2 used to do, like if it was not confident, it sort of extended out into a loop. Yeah. And so they started training on Alphafold2 predictions to try to get around that problem. Um, it seemed a little hacky, but it seemed like...

Abhi: It feels like a bizarre design. Surely there has to be a better way to do it than just bootstrapping from...

Sergey: I'm not sure how you would do it. So I think the problem is diffusion. The problem is you need it to give you the wrong answer to tell it it's doing something wrong. Yeah. But the problem is it's always... the last step is always the right answer. And so I think with like structure module, you can have it have, I don't know, the loop get clashing with everything and you add like a clash loss.

Abhi: Yeah.

Sergey: But like how do you add a clash loss in diffusion?

Abhi: That's a fair point. Yeah. Yeah. I guess having this physics-based oracle would help things a lot.

Sergey: Yeah, yeah, yeah. Yeah. Well, it's like, it's almost like you want it to move into... so I think this is why they had to put in the Alphafold2 structures in, because there, it's like you train it, say, Hey, if there's no information, extend it. And it worked to some extent, but it turns out if you make the protein too large, it just starts making helices again everywhere. Yeah. So it's sort of learned to do that for small proteins, but not for very, very large proteins.

Abhi: That makes sense. Yeah. And I think I am largely out of questions. Was there anything else you wanted to talk about?

Sergey: Uh, I think we're probably good.

Abhi: Okay, cool. Thank you for coming onto the show, Sergey.

Sergey: Of course. No problem. Glad I could chat.