Predicting protein structure, episode 3

A chat with conversation with some members of the Rost lab  at the Technical University of Munich. Dr. Maria Littmann, postdoctoral fellow, and PhD students Konstantin Weissenow and Michael Heinzinger and Dr Burkhard Rost, principal investigator.  (Art: J. Jackson)

Maria Littmann
In December, I was still thinking, okay, so there's a big company doing this now, but they will do it behind closed doors, and we will never really figure out what they are doing. So it's probably not going to be that relevant for us. So it's going to be a big headline in the news and everybody's going to talk about it. But we still have to find a solution that works fast and that we can actually use in academia in our context. And so that's why I would actually say that for me, the big moment regarding AlphaFold2 was when they announced that they are publishing those structures for UniProt, and they actually release the human proteome because that's something that I never expected.

Vivien
That’s Dr. Maria Littmann a postdoctoral fellow in the lab of Burkhard Rost at Technical University of Munich. Dr. Littmann works on ways to predict which protein residues. Parts of a protein, bind DNA, metal or small molecules. She uses amino acid sequences to make these predictions. More generally here, she is talking about AlphaFold2. AlphaFold is a computational approach from DeepMind Technologies that has changed the way and the speed at which protein structures can be predicted.

Protein structure prediction is the Nature Methods method of the year for 2021 because of the stir this approach is causing. You can find a bundle of commentaries on the Nature Methods site about how deep learning, especially AlphaFold, is shaping protein structure prediction. And structural biology more generally and maybe even biology itself.

And I have a story there, too, for which I spoke to a number of scientists including Dr Littmann, Dr Burkhard Rost, the principal investigator and PhD students in the lab Konstantin Weissenow and Michael Heinzinger. This podcast series is a way to share a bit more of what I have heard.

It’s a challenging problem to predict three-dimensional protein structure from amino acid sequence. Many academic labs have worked on this and have been working on this for a long time including Dr. Rost and his team.

AlphaFold has entered this space with a computational approach that has been trained on the datasets in the Protein Data Bank. AlphaFold predicts protein structure using machine learning. And there are other platforms that also handle protein prediction this way such as RosettaFold.

Basic researchers who mainly look at genomes, might not terribly interested in protein structure prediction. But many other scientists are. Here’s Burkhard Rost, who used to be at Columbia University before moving to Munich. Just saying that because he mentions his institute in New York

Burkhard Rost [2:30]
Making structures. Whenever a drug is developed, you need a structure. You need a structure of biologists to be involved in it, except for mRNA. That's a different story. But for the regular drug development, this is absolutely true. But when you do cell biology, genetics, you have heard about structures.
You have some ideas about structures in New York and my Institute. I saw that there are many people who did amazing things. We never really understood what the structure means, and they never really cared. And that continues to be the case. So in that sense, to what extent will now these models change biology at large?

You have to really see how you can use structure in your research. If your focus is genome, all you want to know is it an important mutation? Yes or no. You don't care about the molecular biology or the functional, the mechanistic aspect of it. But this is a step that will now happen. So to make structures, And the first people who will be relevant here are structural biologists.

Vivien
AlphaFold is of interest to structural biologists and it will take a moment to see how the larger scientific community will use the many predicted protein structures.

Scientists use machine learning to predict all sorts of things in many different areas. If you are not working with proteins, say, you are a computer scientists applying machine learning on something entirely different, it might be hard to appreciate the excitement about this new capacity to predict protein structures.

Burkhard Rost has an example from one of his former students, Guy Yachdav who used Game of Thrones in his teaching about machine learning in biology.

Burkhard Rost
One of my students, Guy Yachdav had a great idea. So he had to teach JavaScript, some programming language, web and machine learning to students. And he felt the cool thing is I have biology.

So I'm going to have a big group of students. And indeed, he had a sign up of 100 students. And then he got this course. And the students were absolutely frustrated. At the end of one year, they never understood biology because it's way too complex for computer scientists. They want to have clear data. And what they learned is nothing is clear, nothing is known. So they were just frustrated. They never understood the machine learning, never understood the database, JavaScript, nothing. They were only frustrated. So guy thought, why not try a problem that is easier.

That is easier. And he came up with, why don't I predict who is going to die next in Game of Thrones? And that was an incredible splash.

By the way, there were two. So they were both successful. A few years later, we did not predict who is going to die next, but who’s going to that was in the last season. So the first one was before the fourth season.

I believe so. We predicted that Snow. I don't know whether, you know, Game of Thrones, that Snow would not be dead. So he seemingly died in the season before. And the method predicted that he wouldn't be dead. It came out before the season. So we had no background. And for me, had I done that, I would have been such a loser because at that point, I'd never seen Game of Thrones, right.

Vivien
AlphaFold has made some noise in the way it has entered the protein structure prediction field. It was developed by the AI startup, DeepMind Technologies, that was bought by Google and so it’s part of Alphabet.

In 2018, Alphafold1 did well in the Critical Assessment of Protein Structure Prediction (CASP), a competition in which scientists test how well their methods do in computationally predicting protein structures. And in 2020 AlphaFold2 was far better than any other platform competing at CASP.

There’s a database now called the European Bioinformatics Institute-AlphaFold database that is filling up with computationally predicted protein structures for the research community to use.

You can hear more about AlphaFold, what it does and what it cannot do in other podcasts in this series. I spoke with Dr. Janet Thornton of the European Bioinformatics Institute and Dr David Jones of University College London. And with Dr. Helen Berman the former director of the Protein Data Bank and current co-architect of the next chapter of the Protein Data Bank.

Back to Burkhard Rost and some of the scientists in his lab. I asked them what AlphaFold means to them and how it will shape their careers. Am going to start out with some thoughts Burkhard Rost has about what AlphaFold does.

AlphaFold combines evolutionary information about proteins with machine learning. And it’s a principle he has worked on, too.

Burkhard Rost [4:10]
Essentially, the idea of combining evolutionary information with machine learning, that is, in fact, something that I introduced into the literature. In fact, for secondary structure prediction, I was the first to sort of realize.

And that ultimately is the breakthrough of AlphaFold on a grander scale, with bigger machines, with bigger evolutionary information, with bigger machine learning, that is now AI. So that's the big step. But the prediction requires 100% in alignment without an alignment. So without this evolutionary information, you couldn't do it. That's where the information is that they use. And that's where a lot of the smartness is how to use it better. It is also true. Smartness is one word, but they actually could not have done it. Five years ago, the machines were too small, the AI was too small and the evolutionary information was too little. So a few years ago, we just did not have enough sequences. As simple as that.

Vivien
For him this evolutionary information is not needed for predicting protein structure. In this field what some labs including his use is a technique from computational processing of natural language or NLP.

Burkhard Rost [7:10]
So the idea behind it is now from NLP. So when you do speech recognition these days, you have these automated machines. T5 you may have heard is one thing from Google. And essentially the way they do that is they make the machine learn sentences. So they feed it with Wikipedia or bigger parts of sentences.

And then essentially the machine extracts grammar, right. That is what is happening now. We have used the same. We are only one group, but we are one of the groups who has used the same concept for the protein language. So in terms of protein sequence out comes a protein sequence in between, there's a lot of computation going on and really a lot of computation. But the point is that we learn from these sequences. The machine in between my head in between learns the grammar of protein languages. You build models that are language models of proteins.


Vivien
One aspect that helped Deep Mind Technologies team jump way ahead of others in CASP14 was the way AlphaFold tackled the challenge of protein structure prediction.

In 2017, Google Brain scientists presented at the Conference on Neural Information Processing Systems something that was later published as a paper called ‘Attention is all you need.’ The concept is part of the big jump that AlphaFold2 took at CASP14. Attention is a computational concept that in this case tells the computer what to pay attention too when, it is taking a sequence of amino acids and predicting what three-dimensional protein structure that sequence will become.

Burkhard Rost [9:50]
Ultimately, what happens now is that the machine somehow has learned the grammar of the sequence. Right? Grammar of the sequence. What it means is context. So if I'm sitting on residue number 42 and that's an alanine and at 81 is also an alanine, but they have different

And this machine has learned these environments. Now where the word attention comes into play here is by learning the environment I learned essentially for 42, I see the entire protein around it, right. Is a sequence. I have no idea about the three dimensional object. That's not what they learned. They just learned this thing. But then they sort of tend to put more emphasis on what is in the immediate environment. And that's where attention comes in. Crucially because attention now allows you to essentially not only put all that information in, but to teach the system 84 is more important than 69. Yes, you're in an environment and everything somehow about the environment is important. But put your attention on certain parts of the environment so that's the attention mechanism.

Vivien
When a machine has had as input experimental protein data from the Protein Databank protein, it can extrapolate the grammar of amino acid sequence.

Burkhard Rost [11:10]
The story gets cool, and we are not really understanding what is happening. But the point is, historically, I tended to argue that there is something like evolutionary constraints that is in these alignments that Google uses for AlphaFold. Evolutionary constraints is because you learn that residue number 42, certain exchanges are possible. For the same alanine that is some other position, the exchanges are different. And this exchange pattern that you have that is what Google uses for that's sort of what I introduced for secondary structure prediction with a method that then was called PhD.

Longer story, why it was called PhD. And then later it was called Prof. That is quite obvious why it was called Prof. Then I had arrived at Colombia. Obviously, my role had changed. But anyway, so the idea is these methods used exactly this sort of exchange pattern. And I thought that the machine can. Reality is the machine can extract what exchange pattern from this alignment, from this evolutionary environment is important, which is not. Okay. Now, this story is slightly different. If I just learn sequences, how do I know that this thing learns anything about 3D?

How do I know this thing learns anything about function, anything about families? In fact, it doesn't. Nevertheless, we can predict conservation. We can predict conservation; means at a particular position 41 will I observe, always alanine not? That is written in that sequence without having ever taught a family, without ever having presented all these related proteins. How did that thing learn that? We actually don't know? To which extent did the thing learn that far from here? I'm saying these systems and that's a big fight I have with David Jones, for instance.

I'm saying since more than a year, actually, David says they have learned evolutionary information. I say they haven't because they were never taught that way.

But more and more I'm also running around and telling people, they never learned evolution, but we see evolutionary aspects. So my reasoning is because they essentially have learned, let's say, physical constraints and evolutionary constraints and physical constraints are the same thing at the end of the day. It's just some constraint on what your protein can do. And by exploring this vast space of two billion protein sequences, you sort of extract that information into these very big machines. By the way, again, this is something that two years ago we could not have done.

Vivien
Every two years there is a CASP competition, the Critical Assessment of Protein Structure Prediction. AlphaFold was the surprise at the competitions especially at CASP13 in 2018 and CASP14 in 2020

Burkhard Rost [14:10]
Ever since CASP1 one, there never was a winner of CASP who had not been somehow in the field. So whoever won, somehow, even at CASP1, those who sort of won were in the field for some time. And so there was a tradition you can only successfully be a good person in this field, a successful person in this field if you really know what you're doing.

And many sort of learned, so became successful. Many started as outsiders and sort of got into it and from CASP to CASP improved every two years. 13. Fast forward to CASP 13. That was the first time that anybody was a sort of winner who came from the outside.

They were hit, the organizers were hit by, it was a truck that hit them or a freight train, I don't know what it was. And in fact, the other reality is that at CASP13. No matter what the shock was, AlphaFold1 was not clearly heads-on the best method, right? It was for every single protein in CASP13. There was another method that was better. It was incredible because it was done merely by AI, or to some extent largely by a team of people. It’s not quite true because David, for instance, was involved in that. So they had inside information. They were not really outsiders, but they were way more outsiders than any other previous winner.

Vivien
Burkhard Rost himself has not been involved with CASP for a few years but his lab has had projects that they sent into the competition and they work in areas that are connected to predicting protein structure. There’s Konstantin Weissenow and Michael Heinzinger.

Burkhard Rost [16:00]
I've been out of CASP for many years. So the first time that we really actually do structure prediction, in some sense, is through Konstantin, through Michael over the last few years, two years, to be more

Vivien
Was there Ember? I was looking at the abstracts

Burkhard
There was Ember that Ember is just from this year.

Vivien
The difference between CASP13 and CASP14 was quite substantial.

Burkhard Rost
I do not believe that anybody at the point of 13 predicted 14 to be 14 the way it was. The event 14 was again another freight train.

And this reality, there are worlds between the Alpha1 and AlphaFold2. The worlds in between AlphaFold1 is an amazingly accurate prediction. It's incredible that machine learning can do that. But they essentially did it by muscle, by intelligence of using AI and by a lot of muscle.

AlphaFold2 is a completely different product. It's full of interesting ideas. There's a lot of expertise in there. These are not people who have never done structural biology. There are many intelligent ideas. So this is a perfect example for joining the best of both worlds.

The problem that you talk about is: now. So since now, Alphafold2 solved the protein structure prediction problem full stop. Since now that has happened. The next issue comes up. Not only what do I do next, but also what is Google going to do next? What is DeepMind going to do next? And what should I stay out of? Because they're going to hit me in that one.

Vivien
Konstantin Weissenow is a PhD student in the Rost lab. He has been working on protein structure prediction software.

Konstantin Weissenow [17:45]
So I remember back when CASP 13 hit. So this was in 2018. This was when I was in the middle of my master thesis. Actually, I was not a PhD student at that time. And, basically December was the time when the results are published. So this was right in the middle of my master’s thesis, and I was working in exactly the same direction. So I was using deep learning to predict, in this case contacts and distances.

I was, of course, very impressed but it didn't really have a huge impact on me. Firstly, because I was still in my master thesis. I didn't quite worry about potential PhD at that point, at least. And also because, as Burkhard mentioned, it was a different system than AlphaFold2.

Alphafold1 was still quite traditional in its approach. So it was certainly possible from the information they gave out even before the paper to try to reverse engineer the system and have one's own take on it. This is essentially what I tried to do. So I tried to adapt some things during my master thesis in my project to incorporate ideas that other fault have been putting out there. But really the big impact came with CASP14 team with AlphaFold2 when I was already a PhD student working in that area.

And then there's this huge difference, which was going on there between AlphaFold2 and the rest of the field, which wasn't quite so pronounced in CASP13. Right. It was certainly the winner, but not with such a big difference. And CASP14 was a different story. And this really discouraged any further investment into the same area where we would be competing with AlphaFold2 directly.

Vivien
At the time of CASP14 DeepMind Technologies was not yet making the software code public.

Konstantin Weissenow
Absolutely. And people tried immediately to figure out how to actually reconstruct the whole thing. So there was an initiative going on of several CASP participants to try to organize and communicate and figure out if they could actually, well reverse engineer the whole system. And I don't think it came that far, at least not up to the point where DeepMind actually published material.

Vivien
Michael Heinzinger, also a PhD student student and a member of the lab remembers CASP 14 rather vividly.

Michael Heinzinger [20:10]
So I was more or less at the very end of wrapping up the story about protein language models, which was for us, the major thing at this time. But I remember pretty precisely the moment where I was sitting in front of this live stream of CASP14.

And even though that I personally was not that much involved in this competition, it was rather Konstantin's work at the point where all those experimentalists actually told that: ok,
hey, we have here in our system that is actually competitive with our experiments, this was the point where I also thought, okay, this is now a point in time that people might actually then read in history books years or decades after this point. Because this was the first time, really, that an AI system was competitive at something sophisticated as protein fold detection with 3D experiments. And this was just mind blowing.

Konstantin Weissenow
This kind of paradigm shift that we saw in CASP 14, when experimentalists suddenly started to use prediction systems, AlphaFold2 specifically, to actually get better models to actually get to a proper model for their experiments. So this was kind of not happening before, right? This was a novelty.

Vivien
Some say that academia might have gotten there with a fast way to predict many protein structures. And another aspect that has come up in my interviews is that perhaps the biggest advantage of AlphaFold2 was that the team had access to massive compute power.

Here’s Michael Heinzinger on these points first on the aspect that academia would have gotten to this point.

Michael Heinzinger [21:50]
Sure, at one point, just a matter of time. I also read these discussions that we as academia, would also have approached this point, surely. But there were estimates that we would have reached this point in five to ten years. I mean, it's difficult to say, it's some type of prediction, but I do not think that it's unrealistic. It would have just taken longer. And I have to very strongly disagree with the statement that it's just about compute. This is so much of an oversimplification of what these people achieved. Sorry. Strongly disagree.

Burkhard Rost
There's so much novelty in AlphaFold2, not one, Alpha Fold2. And again, as Konstantin pointed out, one, you could reengineer two was just too full of really cool new ideas that nobody had ever done like that. But here comes the next point. Nobody could ever have done that without the compute power of Google. We would have never thought in that direction. So all we try for typical CASP is we try to speed up the alignment generation since decades, nobody would have ever…So we used 200 million sequences.

Nobody would have dared 2.5 billion sequences. Nobody of us would have ever dared genetic algorithm because we try to get it done. Genetic algorithm means 100 times slower now. They did not only use genetic algorithm, they iterated over it. So they did that 10, 30, 40 times. Nobody would have ever come up with that idea, even if we had had dreamt somehow the reality is we all confined to somehow the space in which we can move. And this was a space so far away from us.
It is a lot about computing time, but there's much more to it. With computing time alone, we would have to have been able to do it. And without the computing time, we could also have not done it.

Vivien
For Burkhard Rost, what the Deep Mind Technologies team managed with AlphaFold2 is an achievement he dreamed about right at the beginning of his career.

Burkhard Rost [
I’ve been working on methods that sort of do something in that direction for three decades. I got to this because I'm somehow the one who sort of invented this idea of evolutionary information, but they have just taken it to a level. I mean, this is sort of my shtick since 30 years. So I've thought a lot. I've never never gone anywhere near what these guys have done in that respect.

Totally impressed, but totally, totally impressed. I'm utterly impressed by what they did. So when I started 30 years ago, I said that what I want to do is what they did. I would have very soon I realized the path is further away and I will never get there. And I changed my direction. And that is, in fact, how I got out of CASP. When I came to Colombia, my mission changed from trying to predict structure to whatever can I do that will be visible, will impact the large genomes.

So that was the time 2000, at which we sequenced entire organisms. And my feeling was I want to make statements that are relevant for entire proteomes, entire genomes and not individual proteins. And that sort of changed my thing. But had I worked on it for 20 years, I would not have come anywhere closer to where we were in 2016. And I'm utterly impressed.

Vivien
AlphaFold has impressed. It interrupts work underway and possibly, careers, it changes viewpoints about what is possible.

Burkhard Rost [25:25]
Now let's sort of fast forward December 20. Big surprise. But we get into March 21, and there still is this feeling, yes, they got there really impressive. But this costs $100,000 per prediction per protein structure. This costs as much as the experiment. Yes, they sort of help the experimentalists, but it's an equal sort of pay. And then we sort of still feel if we could do that faster, we can still do things that Google cannot do. And that is the story of Konstantin. Konstantin was preparing in July to predict to give a paper out there for the first time ever in history predict the structure for every single protein in human.

And boom, we got scooped again. And this is when it's just before Google published this entire proteome for human and many others. And over the course of the next year, they will essentially publish every single structure prediction for the entire UniProt 250,000,000, right. That's a collaboration with the EBI, and that's just outstanding. Now, I believe from here on, the issue really is what can we do with it? Now is the issue what can we do alternative tasks? How can we use that? How useful is it?

But again, I want to make another point. Our group has not been hit once because for one reason we were not really completely into CASP that are still not solved. We were into other things that are still not solved. But two, because the kinds of things that we do. Currently, Google does not use language models. Yes, they at some point will do used language models. But currently the level that they reach is reached by throwing a lot of things at it Konstantin is predicting the entire human protein with a machine under his desk where he's sitting, and that machine runs for eight days.

And then it gets essentially the entire thing for you. That's incredible. Essentially. Let's put it differently. Our schtik now has become: I invented combination of machine learning and evolution information. I want have uninvent it. So let's just do it without. Let's do the same thing without and Google AlphaFold just perfected that idea. Let's now see what we can do without.

(Note: Feb 14, 2022   
John Jumper, senior staff scientist at DeepMind, reached out: "An AlphaFold prediction doesn't cost $100,000 per prediction, but orders of magnitude less." He doesn't have a specific number but estimates the cost is "a few dollars for one protein." It depends on the protein, infrastructure used and other aspects. And it's "much less for larger batches of proteins due to caching etc." 
Burkhard Rost explains in reaction to this comment that in this podcast he is talking about $100,000 related to development and training cost of AlphaFold2, also the use of TPUs,  not the phase in which AlphaFold2 is applied and for example as in the paper "  Highly accurate protein structure prediction with AlphaFold " the pipeline has been engineered to use GPUs. 
Here snippets from the paper
"Inferencing large proteins can easily exceed the memory of a single GPU. For a V100 with 16 GB of memory, we can predict the structure of proteins up to around 1,300 residues without ensembling and the 256- and 384-residue inference times are using the memory of a single GPU. The memory usage is approximately quadratic in the number of residues, so a 2,500-residue protein involves using unified memory so that we can greatly exceed the memory of a single V100. In our cloud setup, a single V100 is used for computation on a 2,500-residue protein but we requested four GPUs to have sufficient memory."
and here:
"We train the model on Tensor Processing Unit (TPU) v3 with a batch size of 1 per TPU core, hence the model uses 128 TPU v3 cores. The model is trained until convergence (around 10 million samples) and further fine-tuned using longer crops of 384 residues, larger MSA stack and reduced learning rate (see Supplementary Methods 1.11 for the exact configuration). The initial training stage takes approximately 1 week, and the fine-tuning stage takes approximately 4 additional days."
Please stay tuned, am finding out more on this --Vivien)

Vivien
Maria Littmann, a postdoctoral fellow in the lab is keen on exploring what she can do with the many protein structures now being predicted by AlphaFold. At the time of the interview Maria Littman had just begun her fellowship.

Maria Littman [28:00]
I’m officially a postdoc because I defended my PhD thesis in June , I’m in a more complicated situation scientifically. Because I’m eight months pregnant. I will be stepping down from my scientific work and focus on my family.

That’s also why, currently I'm not that much affected AlphaFold2 because all of the things to come, at least for the next couple of months, I won't be there.

But I think afterwards it's going to be very interesting to see how we can actually what Burkhard just mentioned, integrate the availability of these models into our work and extend on what we've been doing for the past years.

But I think afterwards it's going to be very interesting to see how we can actually what Burkhard I just mentioned, integrate the availability of these models into our work and extend on what we've been doing for the past years.

Vivien
Although she was not working directly on protein structure prediction, she, too, had a CASP14 moment of surprise.

Maria Littmann [28:50]
My CASP 14 moment similar to also, Michael and Konstantin. I was, of course, hit when they published those results, and I just saw this figure that AlphaFold 2 is beating everybody by magnitudes and they actually solve the protein folding problem. But in December, I was still thinking, okay, so there's a big company doing this now, but they will do it behind closed doors, and we will never really figure out what they are doing.

So it's probably not going to be that relevant for us. So it's going to be a big headline in the news and everybody's going to talk about it. But we still have to find a solution that works fast and that we can actually use in academia in our context. And so that's why I would actually say that for me, the big moment regarding AlphaFold2 was when they announced that they are publishing those structures for UniProt, and they actually release the human proteome because that's something that I never expected.

So I always assumed that they are trying to earn a lot of money with that and that they will only give out these structures if you pay a lot of money and that we will never have access to it, and that we have to rely on other methods to predict structures. And when they actually announced that they are going to make it publicly available, I think that for me, it changed because that means that this whole ‘okay, we don't have high resolution structures for most proteins’ is just not true anymore. So we could now actually revisit whatever we have done using sequences and actually say, ‘okay, can we improve if we put a structure on top of it?’

Burkhard Rost
What Maria predicts, input is only sequence right? She predicts, here is. My left hand is a binding residue and my right hand is a binding residue. It becomes a binding site when in 3D they come together and that's the part only from sequence. She cannot predict. She says this is about DNA binding. This is a DNA binding. Is it the same DNA binding site? And for that she needs a model. So in that sense, her method would benefit from that. And that is something that we are going to look at.

Maria Littmann
I work on prediction of binding residues and it's actually focus on residues, because if you only use the sequence and you don't know what the actual binding site looks like. Because you need the structure for that. And now it's basically the point where I think it will be very interesting to look if you can actually move from binding residues to binding sites and then also actually have the possibility in a structure to identify there's more than one binding site, because currently we can just say, okay, it's binding.

But we don't know if it's binding multiple things, for example. And I think that's just possible with AlphaFold2 in general with just making structures available for large proteomes. I mean, also with methods like the one from Konstantin. So we are at this point where we can actually predict structures on a large scale and can use those for our methods. And I think that's something that I didn't consider when I started my PhD, for example, four years ago.

Vivien
AlphaFold is shifting things for many labs and for many individual researchers. Here’s Michael Heinzinger.

Michael Heinzinger [31:40]
Regarding the change of plans based on AlphaFold2 Yes. Absolutely. This changed a lot in a sense that suddenly we have a second source of high quality labels available that we could potentially use for transfer learning. For example, this is what my work with the protein language model is mostly focused on. And we mostly tapped into this extremely rich source of information of these large, unlabeled protein sequence databases. And this worked only so well because we just have so many unlabeled sequences available. But suddenly we have the second orthogonal source of information available with 3D structures that we could potentially combine with the first source of information.

And I think one of the most important next steps is how to best utilize this wealth of structure information that we now have at hand. I mean, this database of EBI and AlphaFold2 keeps growing. You can watch it, how it grows into the millions. And at one point it will be rather a question. How do we best utilize it for something like ligand prediction? Because the structures having these structures is amazing. It's nice to look at, but without any labels, it's only for very few experts very useful. And also these experts usually then focus only on only very few proteins. But in order to facilitate the analysis of more proteins and make it also easier available to maybe non-experts that are not trained for decades, you need labels. This makes it just a lot easier overlaying these amazing 3D structures of AlphaFold2 with some other sort of predictions makes the analysis just a lot

Vivien
Labels means the electrostatic information, the atomic information?

Michael Heinzinger
For example, DNA binding. So I mean, otherwise you're just sitting in front of this 3D thingy and you can turn it wiggle it. It's nice, and you can color it according to some Alpha helix or beta sheet. But that's then mostly it in order to know whether there is some ligand finding, you need some other source of label. This is what I was referring to. This label. Ligand binding is just the easiest example.

Burkhard Rost
Let me jump in in terms of language, Vivien, I'm in this field for 30 years, and within a year, I lose the language. And I don’t understand what these guys are talking about.

This AI is developing so quickly. There is a new generation, Michael is sort of in-between the two camps. So Michael, at this point is one of the few who can communicate to both of them. But this NLP natural language processing people, they use so many different words. And so we tried to--one of our manuscripts has been published in an AI journal, IEEE Journal. And it's almost impossible. So one reviewer was from the computational biology. One reviewer was from NLP, and one was from AI. And each one of them complained about the language because they didn't understand what we talk about. So this word label is just one of those things. Yes, you have an experimental annotation for some feature.

Vivien
Konstantin Weissenow developed a software called EMBER that had the capability to predict the structure of the entire human proteome. Here’s Konstantin Weissenow:

Konstantin Weissenow [35:10]
In our case, we do not host many predicted structures, what we have is the set of predictions for the human proteome which I did in July, where we were beat by AlphaFold2 by about two weeks.

It wasn't quite that interesting to really put it out on a larger scale. So we just host it our infrastructure. And apart from that, we just had the method available. So anyone who wants to run it based on just single sequence input can just do it. you can do this on a normal PC.
So if you just do it for a handful of proteins, it's so fast, it doesn't really need to be pre-computed and put anywhere.

Burkhard Rost
So the point again is, first of all, again, we have been scooped. And if you are scooped, there's no point trying to get your stuff that is less good available to people. Second point, the main angle that we have the main advantage that we have is speed. And that speed is exactly what Konstantin said. Anybody can run it when they do a design issue. What would happen if I changed that particular residue? What's the strongest possible effect? And for that. You don't need a database. You don't need the storage of these predictions. What you essentially need is the tool in your hands. And that is the case.

Vivien
Google has a lot of compute resources. And the DeepMind team used and uses those resources. Estimates from some groups, at least at the beginning, was that each predicted protein structure cost an estimated $100,000 per protein. It’s an estimate.

Burkhard Rost [36:30]
That was the estimate. Google never stood up and said, this is what it costs. And it is also complicated because Google has how can you translate that to a cost of the machine that you can buy?

So Google has particular processors. They're called TPU, as opposed to a CPU, to a GPU. They're called TPU, and only Google has them. And for some of the stuff that Michael is doing, we absolutely need TPUs. So our success would not be possible without the TPUs from Google, and we cannot buy them. You cannot just: I ordered a TPU, put it into my lab. Impossible. Right. So we cannot really say what the price at the time until December was the 100 number. There was an estimate. I predict that they will get to the same level with much less energy, with much less, ultimately, they will take it from us and run and use fewer computers and get the same performance.

Vivien
You cannot buy these TPUs, the Google processors but you can use them on a Google Cloud. And there are versions of AlphaFold that have made predicting protein structures a little less compute-intensive. Here’s Michael Heinzinger:

Michael Heinzinger [37:45]
I think what they published during CASP was a lot more compute intensive. So there they actually ran predictions for ten different models. So they ran an ensemble of models and then merged their predictions, the outcome of those models. What is currently used by the EBI to make this large scale available is a single model as far as I know. So they realized that the benefit of running ten models in parallel was minor. Given the additional compute. You would need so I think there is also a difference between what they had to spend back then and what they are spending right now.

Maria Littmann
So I just wanted to add that I completely agree with Michael that this large database of structures available, at some point, I think that our prediction methods become much more important because we have a lot of unlabeled structures then, and it might be interesting just to say, okay, we can now predict functional aspects that we don't know for those structures, but in the end is what we are interested in because we want to understand how protein

And I think then combining those predicted structures with our predictions is probably a great opportunity, especially for the work that our lab has been doing. And also one more generic statement about a positive statement about AlphaFold2 . For me, it was always inspiring that something like this is actually possible at all. So I think more from a personal perspective. If you tell people I'm a computational biologist, it's often like, what are you doing? And you say we're trying to predict things that others can do in and experiment much better.

And then everybody said, yeah, we actually need that. I'm not asking a biologist to do an experiment, and now we can actually show that what we are trying to achieve is possible so that we can actually predict something even if it's not really us. But currently just Google. We can basically prove to people that computational biology can actually achieve something that is really helpful, and it's not just happening in academia and interesting to study, but it's actually something that could impact science in the bigger picture.

And I think that's something that I learned from the success of AlphaFold2, that something we never thought would be possible was possible in an instance. And I think that's something that we can hope for might be possible for other parts of computational biology. And I think that's something that is really cool.

Burkhard Rost
Let me give you one aspect of what is relevant from Maria's perspective. She said that I just want to sort of translate that. And Michael implied that. So Maria works on a method that essentially predicts binding residues for nucleotides, for metals and for small molecules. What she doesn't do is sugars. What she doesn't do is proteins, but some fraction of all the binding of proteins is included in that right. Now she worked on her method for a few years, and then she finished it.

And then she wanted to see what new data has been added over the last two years with respect to experiments on binding residues for any organism, not human, any organism out there, two years of experiments, any organism. And then we have always some redundant information we have to throw that out. And at the end of the day for how many new proteins did we get information? What do you believe? Nonredundant. So really unique that we can actually test our method on

Vivien
A lot

Burkhard Rost
Give me a number

Vivien
20,000

Burkhard Rost

Vivien
46, not thousand?

Burkhard Rost
46 not thousand not million. 46. That is the yield after sort of moving out duplicates of two years of experiment, the entire world, any experiment, any organism. And that ultimately is exactly the point that she made. From these models, her kind of approaches benefit because there's so little out there and anything that she can get is on top of what Google does. So she's totally happy.

Vivien
Some scientists are happy about AlphaFold, some have been scooped. AlphaFold has surprised researchers in many ways.

And that was today’s episode of Conversations with scientists. Today’s episode was with Dr. Burkhard Rost from Technical University of Munich. With Dr Maria Littmann, a postdoctoral fellow in his lab, and Konstantin Weissenow and Michael Heinzinger who are PhD students in the lab.

And I just wanted to add because there’s confusion about these things sometimes the Technical University of Munich did not pay to be in this podcast. This is independent journalism produced by me in my living-room. I’m Vivien Marx, thanks for listening.