Predicting protein structure, episode 1

Protein structure prediction is hard, but AlphaFold, 'an AI' has tackled this big problem. Janet Thornton from the European Bioinformatics Institute and David Jones of University College London look forward, backward and all around on this subject.

Protein protein strucutre, episode 1
A conversation with Janet Thornton and David Jones

Janet Thornton
When I heard about this, I thought, yes, somebody's actually really made a significant progress. That was my first, and I was delighted because I didn't think this would that we'd get quite this far in my lifetime.

Vivien:
That’s Dame Janet Thornton from the European Bioinformatics Institute (EBI). She is the former director of the European Bioinformatics Institute and has long worked on the challenges of protein structure prediction.

She has received many awards one of which is a British order of chivalry: for her services to bioinformatics she was named Dame Commander of the Order of the British Empire.

I interviewed Dame Janet Thornton with David Jones of University College London, who has appointments both in computer science and in structural and molecular biology.

David Jones [1:00]
I’ll put my point, it’s extreme cautious optimism.

Vivien Marx
We talked about blobs, about models, about being compute-constrained and about what can happen when you are not compute-constrained. And I asked them about AlphaFold how this approach from Deep Mind Technologies--an AI startup that Google bought in 2014—how Deep Mind has changed and is changing the way protein structure is predicted and how it might be change science more generally. And what AlphaFold can and cannot do.

You will hear more from Dr. Thornton and Dr Jones in a bit. I interviewed them as I did a story for Nature Methods on the journal’s method of the year 2021, which is protein structure prediction methods. You can find a link to the story in the show notes. I am doing a number of podcasts on this topic, too, in this series ‘Conversations with scientists’.

Proteins are known from the food we eat. Many of us, too many of us, have encountered a protein that the virus SARS-CoV-2 uses to get into cells and unleash COVID-19. And it does so with harsh symptoms in people who are not vaccinated, whose body’s immune system cannot recognize the protein or fight the virus well.

There many, many proteins. Our bodies contain--and this is an estimate- a few hundred thousand different proteins. It’s hard to know the number for sure, many proteins are unknown. Proteins come in many different sizes, fulfill many different functions. They differ from one another in their sequence of amino acids, the biochemical units of which there are built. And proteins have all sorts of helices, sheets and folds. They are complicated 3-dimensional structures that move.

Predicting this protein structure from the amino acid sequence is an entire scientific field unto itself called protein structure prediction.

It’s not a new field but it’s the 2021 method of the year because of what AlphaFold can do, which is: it can yield the 3D structure of proteins from a given amino acid sequence. There’s a dedicated European Bioinformatics Institute-AlphaFold database that is filling up rapidly. But isn’t there already a protein database called the Protein Data Bank, the PDB. Indeed there is. And I spoke with one of the founders of the PDB about AlphaFold in a separate podcast.

Back to Janet Thornton and David Jones. Every two years there’s a competition, the assessment of protein structure prediction or CASP for short. It’s a competition where scientists show what their software systems can do, how well they can computationally predict the 3D structure of a protein from a sequence of amino acids.

At CASP 13 in 2018, AlphaFold did very well indeed. In CASP 14 in 2020 it blew the academic competition out of the water. My interviewees talk about this and you will hear more in this podcast and others.

Dr. Thornton and Dr Jones have known each other a long time. Actually David Jones did his PhD research with Janet Thornton. I interviewed them together and they had a lot to say to one another also on this subject of AlphaFold. This podcast has stretches in which they talk to one another and I was happy to just listen. But on occasion I did throw in a question or two.

I asked them what it was like for them when they first saw what AlphaFold can do.

Janet Thornton
[4:30]
I'll go first. I'll be the positive one I think (laughs)

David (04:21)
I think we saw it at a different time. You saw it before I did. And different context.

Janet Thornton
Yeah, I heard about it before it was published and saw some of the data because I was peripherally involved as an independent observer. And I was just delighted that progress had finally being made in a rather well quite striking way. I’d been used to seeing the CASP results, and I knew about the improvements, and it had sort of gradually gone up, but actually rather slowly over many years and it began to gather pace, I think, with the introduction of machine learning, probably in 2016 something like that.

But this seemed to me to be a really clear-cut improvement for the first time, and that one team had made significantly more progress than the other team, which again is unusual because usually the increase happened across the

Vivien
And it’s a community competition.

Janet
Everybody is very close to each other because they all reads each other's papers and they all know what's going on. And so It's quite a tight community of researchers who are, you know, most of them, I have to say, are really very clever people.

People who do this in general, are not second-rate scientists. They are the best of the best. And I'm not including myself in that. I hasten to add, these are people who really have been fighting with this problem for a long time. And so seeing this success was really very encouraging to me. Now, I've been aware of the problem, I guess, for more than 40 years and have worked, you know, when I started, there were just 20 structures, and so we kind of took them apart and dealt with them.

So we looked at specific aspects of protein structure, whether it be alpha helices or beta turns or tertiary structures, and tried to somehow make sense of those by teasing apart the key factors that determine structure from sequence. But I realized sometime in the 1990s that actually I felt I didn't have anything more to bring to the field because I didn't have any good ideas about what to do next. And many students would come to me and say, I want to work on protein folding, including David, actually.

But he was the exception in that he was full of new ideas, whereas most students didn't have a clue about how they might improve this situation. And so I very happily went off and started looking at functions and various other things and didn't really go to the CASP meetings. But every year I would get an update from David about or every other year about, well, what happened? How did it go? Kind of who won? Because that's always a question that people ask, although it's not supposed to be a competition, but nevertheless: How's it all going.

And then when I heard about this, I thought, yes, somebody's actually really made a significant progress. That was my first. And I was delighted because I didn't think this would that we'd get quite this far in my lifetime. I'd sort of given up a little bit. But nevertheless, well, I sort of started to think, well what's the impact of this going to be? Of course, the fact that it's a commercial company rather than an academic group in a way was quite disappointing. Because I know the academics and I would have very much wanted them to succeed.

But also one realized, you know, this company had access to a lot of compute and access to people who were real experts in machine learning at the forefront of that field. And so it really I was just pleased overall. And I thought this would have a beneficial effect on the field.

Now, perhaps I should hand over to David at that point, and he can have his more on the ground view. So my view is very much sort of high level and don't look at the details. And David obviously knows all the nittygritty details.

David Jones
I look at it guess more as someone was working in that field. I mean, I should say I worked with Deep Mind on the first iteration of AlphaFold, so I wasn't surprised. I would have put very good money on Deep Mind winning the next CASP experiment. That wasn't at a surprise. I didn't think anyone had caught up at that point in the field, so it wasn't surprising that they were at the top.

Obviously, the gap, the jump between the second place group and the first place group was a bit of a shock. I'll be honest with you. And you go through different emotions. I mean, obviously the first of all, it wasn't surprised because of the amount of improvement was a surprise. Initially, it was such a big jump. And I, this is something that has been racking my brains on this as a field we all have is that I expected there to be something in what they did that was something we just hadn't thought of doing.

I just thought they'd found a new source of information or some new. Because a few years back, probably it shouldn't be forgotten, I guess the biggest jump that really happened methodologically was this looking at the co-variation sequences, looking at multiple sequence alignments of related sequences, and then looking at correlations between different positions. And that's still part of AlphaFold2. I mean, that is still core to what it does. without that, it can't work and it doesn't work. And that's quite clear from the data that they've shown. other people have shown, so it's still dependent on this information.

And a number of groups Chris Sander and Debbie Marks, my group, and a few other groups were kind of involved in that ten years ago. Now it seems to me that so shocking, It's amazing that's ten years ago now. And it was clear that that was the direction in which a working solution would come from. I mean, it was clear that pushing hard along that line, and I don't want to take credit saying it's all down to me. Because one of the reasons I thought it was interesting to work with Deep Mind because they didn't know about this work going on themselves.

And one of the things that we discussed when I worked with them was these new developments, and I felt one of the reasons for doing it is I felt that it probably could be pushed a lot further than the academics had pushed it. I mean, we tried our best, but I did feel we were compute-constrained, and I went to them saying, ‘Well, it'll be really interesting to know what you'd be able to do here if we weren't constrained by compute. What would happen with that?’

So I think that's kind of the background to that, I suppose. It wasn't surprising that they would push it forward. What I hadn't realized, I guess, was essentially no new information was needed to get that level of improvement. And that was a shock. Now I'm still processing that a bit, really. The ultimately, AlphaFold2 is the same thing that everyone else was doing but in a new kind of frame, a new way of kind of representing that information and allowing it to mix. It's the key that you're bringing everything to play to get that solution that you want.

And I guess none of us have really appreciated how powerful that was. And that was a shock. So I was expecting, I think we all expect I. We saw these great results. We all went into shock, but we all went away saying, ‘yeah they haven't told us something there’. And there's something they haven't told us at the meeting. There must be something they're holding back that they've come up with a clever trick that we've found a bit like the covariation that was worked on ten years before.

They'd found a new type of a new source of information that we haven't seen before to make that huge leap. But it's turned out, as far as we know that doesn't exist. It's still basically just doing what we were already doing, but so much better, in a more principled way, I guess, in a more uniform way. It's bringing all the information together and weighting it correctly to make these predictions. And so that's taken some recalibration of my thinking along.

We didn't need anything else. We could have done it. And so, in theory, academics could have done it. I don't think the expertise in machine learning wasn't available, but we certainly didn't have the compute power to go through the number of experiments needed to get that to work and the way that they've done. And that's not to denigrate what they've done. It's just the difference between the way that academia works: one post doc in a project with the resources you can get from one grant and what you can do when you can leverage both the skills and the compute, they're both important, to basically tackling all parts of the problem at the same time.

And I don't mean that to be negative. It's just the difference in the way that things are working. So I suspect we would have eventually got to a similar solution in academia, but I think it would have taken ten years. So I think it may have been a slightly different solution, but I think the same machine learning would have come through, but we would have had faster computers, and so we could have done that with the resources that we've got at the universities, and we wouldn't have had to rely on that. But that's just a guess. I mean, maybe it would never have happened. I don't know.

Janet Thornton [15.25]
But I also think David, they did a very thorough job in that they included they improved as far as I can tell, homology modeling and the accuracy of sidechain placement, etc. As well as the source of base.

David Jones
Yeah, I'll be honest with you. I guess I've been looking at it. I say it's unclear at the moment how true that is, because we've been doing lots of experiments with AlphaFold2 ourselves and the side chain aspects don't seem to be as important. And that was my guess. My original guess was by dealing with the sidechain atoms in a very accurate way, they're able to get this extra. All I can say at the moment, I'm not convinced. It may be that's true, but we're seeing some results that suggest it's not as easy as that.

I thought that was my best guess that, you know, if we don't have the side chains there, but it turns out, for example, you can get Alphafold2 to build a very good model with all the residues set to tryptophan, for example. It’s not just the side-chain packing, but that would have been my guess. And I’m not saying it isn't that at the moment, it's early days, we only had the program for a short time in our hands. It’s not as crystal clear as it was perhaps as I thought it was. And you obviously thought it was, I think, but I may be wrong on

Janet Thornton [17.00]
But I think we can say that the accuracy, however, they got it, I don't really understand how they got it, but the accuracy, even for those homology models, seems to be better.

David Jones
Yes.

Janet Thornton
And I think, and that's really what's important in terms of using those models, for example, in tomographs or in EM data.

David Jones
Absolutely, I wasn't disagreeing with that point; it’s better across the board. But what I'll give you is the alternative possibility is that it's doing homology modeling, but at an ultrafine level. In other words, it’s not just taking a homologous structure and then building the side-chains on and then moving the loops around. It’s taking little pieces of everything it needs from the whole of PDB and building an ultra--like the jigsaw puzzle of the worst jigsaw puzzle in history, made up of tiny little pieces of the jigsaw. And that’s kind of what I feel is at the heart of it.

The sidechains, and the trouble is, everything helps. It’s hard to eliminate any part of it. I initially thought that the only thing that we haven't been doing was the sidechain modeling, and now I'm not so sure that's really all there is to it. I think more than the

Janet Thornton
The side-chain modeling is no good without getting backbone right.

David Jones
That's true.

Janet Thornton
You can say they are all so intimately intertwined. And that's to me, what this, so it's all in together has given us, is everything’s helping each other.

David Jones
I'm looking but at it from two fine a layer. For example, it still makes mistakes in side-chains that need to be fixed by another program, for example. So it’s not like it has a perfect idea of what the sidechains look like.

Janet Thornton
I have to say, David, in my head, I think of it an infinitely clever cut and paste algorithm.

David Jones
Again, that's not denigrating it.

Janet Thornton
It's not.

David Jones
It's probably the most accurate description you can make of it.

Janet Thornton
I'm sure that's sort of what it’s managed to do. That's why it's so good at all sorts of data. It can always find those details.
David Jones
I think we all kind of knew that a bit because people had often said in homology modeling the pieces are.

Janet Thornton
Yes

David Jones
You can always find a piece of a protein to model every piece of every other proteins essentially. And the problem was that we didn't have any. We tried, as you know.

Janet Thornton
Finding the right bits.

David Jones
That’s an idea that we had back and I was doing a PhD taking fragments of proteins and

Janet Thornton
I know

David Jones
and joining them together. And that sort of thing that had an impact early on AI from David Baker's work. We kicked it off. But then David’s produced Rosetta, and that was a big change. But that was our idea of what small fragments of proteins looked like. What AlphaFold can do is it can think in higher dimensions, so it can not just say this linear piece needs to be added to this linear piece but it can model interactions between disparate pieces, and it can solve a hyper-dimensional jigsaw puzzle. You like jigsaw puzzles in in your

Janet Thornton
It’s like a fractal jigsaw puzzle

David Jones
We'll go with that. We'll go with a fractal jigsaw puzzle.

Vivien:
A fractal jigsaw puzzle? Janet Thornton and David Jones were having a grand time explaining their thoughts and blowing my mind. As they spoke and went back and forth, I wondered how we might illustrate this story, what the heck does a fractal jigsaw puzzle even look like. I asked them how to imagine what this mathematical concept is.

David Jones 20:50
Mathematically, it's not fractional dimensions I suppose


No

David Jones
It’s just higher dimensions, really. But it's solving the problem, not just two in 3D. In fact, we solve it in homology modeling in 2D, really. We just look at the alignment we just get linear pieces to fit together. And then going further, you try and model it in 3D. But AlphaFold is representing in higher dimensions and then bring it back to 3D at the end. And I think that's why it solved the problem in that sense, it’s finding the right pieces amongst all the high dimensional space that it represents.

That's why machine learning was so important for that, because that's what machine learning does. It finds higher dimensional representations of the data that you're processing and without that, I think it would be very difficult to solve the problem that way or that well.

Vivien
David Jones consulted for Deep Mind as a temporary contractor. And he recalls before he even started this relationship, how he discussed the field of protein structure prediction more generally with the Deep Mind team.

David Jones [21:50]
I was called in after they’d already thought that maybe this is something they could look. So I think they hadn't planned. I mean, to be honest with you, I mean, I’m not giving any secrets away here. It wasn’t clear from the beginning whether they could do anything. I mean, that was quite clear. And quite a lot of my work with them was really about exploring that to see, you know, what could be done.

Vivien
The AlphaFold-team was a participant at CASP13, in 2018. And the team won. The margin by which that team won at CASP 2020 was much larger. Here’s David Jones from University College London.

David Jones [22:25]
I was hired as a consultant to work on this. But we had a long conversation before I signed anything, just discussing the field in general. And I did say to them, I felt. To be honest with you. My original impression was that anybody with a lot of compute power could do better in this area, that I was a limitation, obviously, I hadn't at that point factored in how much you get from the machine learning aspects of it. I've been using machine learning, as Janet knows, a long time right through my PhD days.

At that point, I perhaps hadn't caught up with the state of the art of that field at that point, and I hadn't factored in how further it had gone from the machine learning that we were doing.

And I said to Deep Mind at the time, I said to them even before we discussed, their ideas in protein structure were a bit not quite there, weren’t up-to-date, but the machine learning was way ahead of anything that we were doing, at least I was doing in bioinformatics, certainly in bioinformatics, I can’t speak for my colleagues in AI and machine learning, they may have known all about this already. But in bioinformatics, I felt that the machine learning that they were telling me about was at a higher order than anything that we were doing. I think we’ve caught up a bit now. I think generally. I think the field has now embraced this.

Janet Thornton [23:45]
Part of that is due to AlphaFold, actually, that it’s really made people appreciate the power of the new machine learning.

Vivien
An important aspect that drove the CASP14 jump was the way AlphaFold tackled the challenge of protein structure prediction. David Jones and Janet Thornton will talk about this next.
In 2017, Google Brain scientists presented at the Conference on Neural Information Processing Systems and they published their approach in a paper called ‘Attention is all you need.’ It plays a role in the big jump that AlphaFold2 took at CASP14.

David Jones [24:25]
The big difference was is this idea of using a language modeling, specifically the self-attention models of building these models that have been used in other to do amazing stuff in natural language processing. But at that point, the bioinformaticians hadn't got hold of that technology. But to be honest, in fairness to us, it didn't really exist at the beginning. So this famous paper that came from Google “Attention is all you need” has had a huge impact in terms of not just AlphaFold2, but in just about everything.

So now in machine learning, the only thing people really want to discuss is some new variation of attention models. And so it’s not the case that we weren’t up to date on, there wasn't anything to be up to date on, really, in fairness, at the beginning of the previous CASP experiments . This development. This is not making excuses for us in academia: This technical development happened essentially between not quite because nothing ever quite fits, but to a large extent, the excitement in computer science in this area actually spanned CASP13 to CASP14.

And so although I think as far as I can tell from DeepMind's own papers, they were working on this before, probably before CASP13 happened. It’s not like we were completely oblivious to it, like two years before that. No one was doing it two years before CASP13, including DeepMind. It just wasn’t on the map. So in that sense, it’s been a rapid flash of technology development that’s happened at the same time.

So it's a perfect storm, if you like, of bioinformaticians and I’m just trying to defend ourselves. It looked like we'd all been spending 30 years staring at the wall doing nothing. And DeepMInd came along and knew all about this for years. It literally wasn’t on the map back in 2016. No one was doing this, even in machine learning.

And it's only now, and it’s amazing what happens in four years or five years, now it's the only thing that people can think of or do, both in AI and machine learning, to be honest, and also in bioinformatics in protein structure. And now everyone's doing it because it's so powerful.

And it’s basically, it's the ultimate generalization of a neural network, since this is why it's so effective here that. People call this inductive bias. It’s how much the kind of equations you're using in the machine learning biases the predictions you make. So if you make assumptions on your data that are inherent in your model, it means you can only make outputs that fit those constraints. So before, we were all using convolutional nets, which are very good at images, at 2D things. And so obviously everything it predicted was 2D, and then we turned it into 3D.

But these attention models just essentially work at such a fine grain level, they can just mix data across all the data you feed in, they can mix it up in any way necessary to solve the problem. And without those constraints on the model, it makes me neural networks so much more powerful. And so that’s been a big jump up, we’ve had to learn a lot quickly. As with my colleagues and computer science, it's not something that, you know, that was known for ten years before, and then it's only now that us bionformaticians are catching up. This really was relatively new technology. It's certainly within the last less than five years that's for sure that this technology became apparent.

Vivien
One issue that comes up in discussions with scientists about AlphaFold is: is it science or is it engineering? The discussion becomes intense about what AlphaFold does, it becomes a bit philosophical, in a way, too. Here’s Janet Thornton first and David Jones joins in on this subject of science and engineering.

Janet Thornton [28.30]
What it does is to predict protein structure coordinates, and that's all it does. It doesn't tell us anything about the protein folding pathway. It isn't able to fold up a protein as nature does where you have one sequence and it folds up. So it doesn't inherently in that sense, it's not a scientific approach, it just learns from what it's seen But in my mind, a lot of science is exactly doing that. We see things and we learn from it, and we use that then to make predictions.

Weather forecasting is a perfect example. Then gradually, you know, things like Coulomb's law and all the electrostatics and the physics of protein folding in a way is captured in this neural net. Although it doesn't answer many questions, and in many ways, it is a technology. For me, looking at data and abstracting knowledge from that data is part of what science is about, and it takes different forms. And this is kind of a high level technical form of exactly that.

Vivien
David Jones had to step away from the conversation for a moment so he didn’t hear what Janet Thornton said about the question: is AlphaFold science or is it engineering. Spoiler alert here. He… you know, you know no spoilers. Here’s David Jones on that question.

David Jones [38:20]
Well, it's an argument that it's always tricky. I have a joint appointment where I work, and actually, so I work in both an engineering faculty and a science faculty. So I have to be careful, so I don't say the wrong thing for the wrong people. So the thing is, I think what's coming apparent is you can’t do the science without the engineering. And so you can't really say this is all engineering or this is all science because if the science is wrong, it doesn't matter how good your engineering

But then the science can be right. And if you have bad engineering, you're not going to get the right answer. So I do see both sides. And I don’t think AlphaFold2 two would have appeared if they were just good engineers, or it was just down to engineering, and there wasn’t some scientific input wherever it came from various people who contributed to that. But it's a difficult one because in one case, engineering makes things a reality. And the science builds the foundations on which that happens.

So nothing useful happens without engineering. But then if the science isn't done, there's nothing to engineer. There's no underlying thing to build on. I can't separate it in my head. Really, AlphaFold2 wouldn't have existed and DeepMind have very good number of them, and they're excellent people, what they call research engineers whose job is to engineer the solutions and find the best way of solving from a highly technical perspective, whatever the scientists. But they don't treat it like the engineers.

Because in science, we do have this thing about technicians versus academics and research. They had a very nice way of looking at it. We’re a team trying to solve this problem and the scientists do this bit of it, he engineers do this bit of it. Sometimes they cross over and I think that's the key. I think it's a label that gets misused as kind of pejorative.

Really what people say. It's just engineering. It's amazingly hard to do really good engineering. one of the things I still write my own code, and I'm still in awe of people who engineer large software systems that work and are reliable. It's incredibly hard.

I mean, it's hard to say it's hard and not the science because it's just different. Science is more inspiration-based, it's more ideas-based. You have a flash of inspiration, but it doesn't help to have a great idea if you can't realize it in a form that actually solves a problem. And that's what you need both for. So I'm sitting on the fence completely on that.

Janet Thornton [32:55]
So did I, David, but slightly different rationale.

Vivien
As I mentioned earlier in the podcast, David Jones completed his PhD research with Janet Thornton. And this phase came up as they talked about science and engineering.

Janet Thornton
David is totally exceptional in this respect, I find. When you have many students, you get all sorts, you get people really, who are brilliant programmers. And David he might not think he is because he knows better than I do. But to me, he was always one of the best programmers and implementers, an engineer producing things that work and people could use. On the other side, you have people who have all these ideas and it never comes to anything useful because they don't do the engineering. In my mind David is this

David Jones [33:50]
Nice of you to say it. The trouble is, I kind of meet my match I really these days, because what's changed is, the biggest problem is that the engineering scale is so much larger. I understand, I think now most of how AlphaFold works, but I can't imagine putting it, the difficulty of putting it all together and actually having a sensible result coming out at the end of it is almost unimaginable even to me.

I do still program myself, but obviously now they told you how to do it, but to have been the first people to put all those bits together. I mean, that's the bit I least understand. how many iterations of the model there were. And at what point what did it work? We heard from Demis Hasabis is that it got worse at some point. He started off. They went backwards for quite a long time, and then it started going. And that's the kind of process I think very interesting to know. I wasn't working with them.

I would love to have been working with them just to see that process happening and to see how that great result emerged. Because it most have emerged, I don't know whether it suddenly emerged just gradually just every day it was one percent better and they just Or was it just then suddenly shot up like a stock price or something. Suddenly there’s a massive improvement. And then maybe. It’s just the shape of that process of the engineering process that arrived at that result, I think would be very interested to know more about. And I hope they tell us that story. I mean, hopefully it will perhaps tell us that in talks or whatever in the future .

Vivien
This process, the making of and the AlphaFold system itself might have lessons for other areas of biology such as perhaps neuroscience or chemistry and for other scientific fields, too. David Jones explains his observation that many are now looking for their, as he calls it “AlphaFold2 moment.”

David Jones [35:55]
We'd like to know what the process is. we may not be able to do it on such a large scale, but I think the frame framework, but they're kind of philosophy of what they've done, I think could be applied to--and I 've spoke to Janet many times--any of this could be applied to so many other problems in biology and chemistry and medicine and other areas. It's not always going to be so easy because you didn't always have the actual datasets this data set available and things like

There are lots of reasons why it can't be done or even a metric to tell you whether your result is better than the other result. I mean, that's another factor. But I do think this way of engineering a solution that has the power of taking everything into account and then just optimizing the final whatever it is benchmark number or something, is a very powerful concept. And I think we're all going to be thinking along those lines for everything. I think everyone is looking now for ‘AlphaFold2 moment’ in their favorite area of biology and probably other areas of science as well.

It will happen, I think, but not all the time. I don't think every problem is as well posed, we don’t have the metrics and we don’t have the datasets available. But if and when that happens, this does offer I think will offer a way of similarly shooting the performance upwards. I think that would be interesting, Areas like medicine there's a lot of work to be done.

Both getting the data sets sorted out to the right level of quality and also getting the deciding on what the actual thing is that you're trying to improve. It’s hard to, if you you're trying to diagnose something. It’s hard to put that into a formula into a neural network to optimize it. So just trying to say, I want to treat a patient. What do you mean, numerically? What is better treatment for this patient? Because then you force the person to say: put it in terms of one number.

I want to know one number that says that this drug is better than this drug or this treatment is better than this treatment, and that's hard to do in many areas of science and biology particular I think.

Vivien
One aspect I brought up with Janet Thornton and David Jones is how they are telling their students and trainees about AlphaFold. How will they be integrating this into their teaching and mentoring, what do they think is most important about understanding this method?

David Jones [38:25]
A good question. I'm starting to plan my teaching for next year. Now, I have made the decision, I am going to try to teach our biology students how AlphaFold2 works, for example, in the kind at some level.. But that’s not going to be at the level of them being able to implement it. We all use tools. I use an iPhone. I couldn't tell you how to build one or use a mass spectrometer or whatever. You understand the basic principles, but you couldn't solve all the engineering and technical problems to actually get one to work.

So lots of things we use we don't have to understand them in that level, but it's working out what that level is for it to be useful in your work. So how much do they need to know? I think Janet probably has got to say on this, but I think it's understanding the limitations of the results and the data. And that’s what I feel the most important thing to get across to get enough of an understanding of the method so you understand when it won’t work or how well it's going to work and how much confidence you can have in it.

And I think that’s I think going to be very interesting. And I think the work that EBI is done making the data available is the first step to that, because it just shows everyone a very global picture of what the range of data is. This is AlphaF old2, but of course, any areas of science I guess you say that.

Vivien:
Talking about how they are going to teach others about AlphaFold brought them back to memories about how proteins are, for some--not for them mind you--but for some: proteins are just blobs.

David Jones [39:55]
What is a protein structure is the first question I think. And that sounds very rude to say to lot of biologists. I’ve had colleagues early on in my career when I was working on structural bioinformatics, when I got my first job independent job. And he just said, ‘I don't need to know anything about protein structure. As far as I'm concerned, they're just blobs that do things and they stick to other blobs. And that’s all I need to know.’ it was hard to kind of row back from that to argue back from that.

And I don't think that's the opinion of every biologist in the world, but it is an issue that I think. Janet please, about this I suppose that understanding goes back to my PhD with her, I guess thinking looking at protein structures and trying to make sense of them and what they tell you about biology and the mechanism. I think AlphaFold is telling us how little, how much progress we need to make in that area if you ask me that's a bottleneck now, I think. And that's going to be.

Janet Thornton [41:00]
I wouldn't want us to not mention the impact that this will have on experimental structure determination, especially in the cryo-electron microscopy and in the tomography. Being able to look in a cell and see blobs and then work out what blobs they are, whether it's a ribosome or something else. And at the moment, the resolution is rather well, infinitely higher than it was a few years ago. But it's still relatively low. But I really think that that is going to change. And these models really can be used to help to identify what this particular blob is.

And that's already nearly all the people all the over the last few years, most crystallographers have become electron microscopists, and they’re looking at bigger complexes, bigger sets or they’re doing the electron tomography. And for this, it’s a bit like image recognition, you know, you get a blob and you say, well, which protein is this and can I fit these coordinates into this blob and does it give me data? And there’s already a lot of evidence that that’s quite powerful. So I think that, to me, that is the very first impact that this is how because of course, it's the structural biologists who are most interested in these structures.

David Jones [42:45]
Oh, absolutely. I think what's useful about that a little bit is that I think. In structural biology, I would argue, too much time is spent on the technology it's worrying about. And I think once with all these models and with the improvements this, it’s time for us to go back and say, ‘Why are we solving these structures? What can we get out of these structures?’ And I think this has sharpened the focus on that. And I think that’s something harks back to my early days. I guess when it was clear thinking in our head why we were doing it. And then it all got distracted by other things, other technologies and increasing the volume of data and things like trying to solve more structures.

And ultimately, it still comes down to the fact that, you know, once these blobs become realized in 3D and you can actually see the atoms and can actually see there. You then got the problem of: now what do you do with it? That I think is really that's keeping me thinking at night: what are we going to do with all this stuff now? I think that's the thing

Janet Thornton [43:50]
So in fact, going back to a bit of history when David was consultant, the Deep Mind people came up to see me. At EBI do you remember?

David Jones
I do

Janet Thornton
And all they wanted to know was if we could do this, what would you use the structures for that was their only question. And we went through different things that you could do, which haven't changed, frankly. And they are still mostly quite difficult, and they don't fall in this nicely defined problem area. And there are many parts of biology actually a blob is enough.

Vivien Marx
For some scientific questions, a blob is enough and in other areas higher resolution is needed. Janet Thornton points out how having the human proteome is making her think differently and that may happen to others, too.

With protein structure prediction given that AlphaFold is here, it’s not like all the work on proteins is done. I asked what priorities they see next for studying proteins.

Janet Thornton [48:55]
Obvious stuff that everybody has said. Now it’s looking at protein- protein interactions. It’s looking at protein-DNA, it's looking at protein-small molecule interactions. It’s drug design. It’s looking at forming these big complexes and how they can operate. How these enzymes are distributed in the cell is also a very interesting question. So having if you like a complete proteome, I think, makes you think differently. If you think about when we had the complete genome.

I know when the human genome paper came out, I thought, wow, this is amazing. Because back in the day, I never even thought that that would happen, you know, anytime soon at all. And once it came out, I read the paper. And the paper wasn’ t interesting to me because it didn't cover any of the things that I was interested in. But it made me think, wow, human beings really only have 20,000 proteins. I mean, that is amazing.

And flies have got 16,000 or 15,000. And it's just amazing. And, of course, the complexity comes through all these interactions and trying to understand those. And that is quite difficult if you don't have protein structures, actually, because the interactions are all driven by the proteins.

David Jones
It's not always easy when you have them. (laughs)

Janet Thornton
It's difficult if you've got them, it's impossible if you don’t have them . But as I said, when David was dealing with his lawn, you know, it doesn’t answer the protein folding problem, how these things fold up. It doesn’t answer the flexibility. I mean, I’ve always thought we know these structures, even in the crystals, these things are very very flexible. And yet these structures do explain a lot of what proteins do. You know, if you look at a catalytic site, you can see, even though you know that it’s still flexible, it still gives you an insight into what’s going on.

And so and that, to me, has always been a surprise. Rather than saying, oh, this crystallography only gives you snapshots, and it's irrelevant. To me, the amazing thing is, these snapshots are really informative in terms of the biology. You know, it's just amazing.

David Jones [47:30]
Because you can actually control the experiment. You can actually do 'what if' experiments. You can actually what happens if you crystallize with this other chain or what happens if you crystallize it with this ligand bound, and that’s what at the moment, machine learning struggles with it struggles with. It gives you the best average guess as to what the proteins looks like based on sequence information alone and what it’s seen in PDB. But it can’t tackle the what If experiments that you really want to do in biology to make sense of it.

Janet Thornton
And it doesn't. It can't deal with the variants.

David Jones
That's another factor indeed.

Janet Thornton
I mean, I can tell you, oh, this is close to the core of the protein or the active site or whatever, it can tell you that, but it can't actually predict whether it's going to be benign or.

David Jones [48:30]
Indeed, it just takes one mutation and the protein won't fold. That's all it needs. It's borderline stability. So essentially the wrong mutation in the wrong place, and the protein simply won't fold. But AlphaFold won't see that because it'll just see one small change in a set of amino acid letters that’s been used as an input. And it doesn’t change its thinking at all on what the protein looks like. It just goes that looks like the proteins I’ve seen before, but that particular position in that particular is unknown because it hasn't seen all the variants.

It hasn't been trained on all the variants. It’s never seen an unfolded protein in training. It’s never been shown a protein that doesn't fold. It only sees the ones that do. And so it has no idea of how changes affect the crucial aspect of proteins stability, which is kind of what it all comes down to.

Janet Thornton
It's all down to delta delta G.

David Jones
It all comes down to that.

Janet Thornton
Very sadly free energy.

Vivien
Delta delta G is essentially an energy difference between a protein and a version of that protein that has a mutation, it’s slightly different, so there’s an energy difference between the wild type protein and the altered one. Alterations in proteins play a role in diseases and disorders, which brings Janet Thornton to the clinical realm and the role AlphaFold might play there.

Janet Thornton [48:45]
I would just like to come back to the clinical because although I agree with David that the excitement in the AI field has been building for the last five years, and people are aware, just with facial recognition and voice recognition, all these things. We all know, but to have it, and I think this is what Demis wanted, have it solve a scientific problem that had been a problem, to show its power. And I think that's what they've done by choosing this, very cleverly choosing, a problem that is amenable to machine learning, which I think this is.

But it really to the biological and medical world, This begins to say, look what you can do with machine learning. Look how powerful it is to capture. And whilst I agree that diseases are infinitely complex, protein structures are pretty complex, actually. And if we have the data, in forms that were, you know, this is I, really.

I think in the medical field, it does come down to having the data organized properly, making it accessible while keeping it safe. And then to me, the opportunities in that field for helping the clinicians to do their work even better without just their own personal experience. But with all this data having been mined for what it can tell them about the likely trajectory of somebody's disease, I think it is immensely powerful. And, you know, I think that to me, that will be one of the main outcomes from AlphaFold, actually, just the appreciation of the power of this to answer deeply complicated problems.

David Jones [52:50]
I wouldn't disagree with I guess only for it's always good to have a guess the opposite view a little bit. Not from what Janet I absolutely agree with, but it's just I. I think it may turn out that the protein folding problem, it's not really the protein-folding problem, but the protein modeling problem that they’ve tackled here is probably the lowest hanging fruit of the scientific challenges that could have been. I mean, it sounds like, well, why didn't you do it before, right away.

It had everything going for it in many respects. And I think Janet is absolutely right that it shows what could be done, which is always important. Having the confidence to go and try something is absolutely essential. But I still worry that a lot of lot of time could be wasted on areas where they're just not ready for machine learning yet. And they've got to slow down and

Janet Thornton
gather the data.

David Jones
You know, it is the biggest problem. There have been papers coming out looking, for example, at the various diagnostic efforts with AI and COVID, for example, looking at various imaging techniques. And it turned out that very little came out of that work because the data wasn’t just wasn’t up to doing the job. It was contaminated in various ways and too limited, too small data sets, they couldn’t exchange data. So there was lots of little studies.

It’s the usual thing that goes on in medical research. And so there's lots of harder problems to solve than the protein-folding problem, in the kind of philosophy and the way research is done and the data is handled.

But Janet had first-hand experience that with ELIXIR and things like that. There's a lot of work there. I think that gets underestimated and I'm sure, Janet would agree with

Janet Thornton [53:30]
I certainly agree with that.

Vivien
ELIXIR is the electronic exchange…

Janet Thornton
ELIXIR is a pan-European infrastructure for biological data. You know, getting that established was dealing with scientists in 25 or whatever different countries and getting them to agree on things. And these are very difficult. And I think actually, for the medical data it’s even more difficult because in part because of the structure of medical research, but also the need to keep the data secure and so matching and being able to handle that, is going to be really important. But I really do think that there are enormous opportunities for new students to build on this in sorts of ways.

David Jones
I agree

Janet Thornton
So I think it is really still I'm still, I don't know. It gives me a spring in my step when I sort of think about it.

David Jones
I'll put it down. I'll put my point as an extreme would say it's extreme cautious optimism.

Janet Thornton
It will take time.

David Jones
It will take time. One thing worries me a little bit the moment things are going to be too quickly, that too many things are happening in, it’s whipped rather than

Janet Thornton
Stepped. Yeah.

David Jones
I've joked to my group that I want to take two years off just sitting on a desert island somewhere just to take account of what's happened and just try to think about where we go next, because I just don't feel there’s time at the moment to even assess what’s happened.

Janet Thornton [55:30]
That's a completely different conversation that we don't have time for now. But I think there's a lot to be said for that to be careful and to go slowly towards using this new technology. But it has power.

David Jones
When used correctly It has a lot of power, and the trick is to use it correctly and to get the right data to feed it. It's the old joke in: computer science. garbage in, garbage out.

Janet Thornton [56:05]
And to ask the right questions. Absolutely. What you realize is that science is about asking those questions that are feasible and doable. And that’s the art of good science, timely science and that's what we need in this case.

David Jones
That sounds absolutely fantastic. That sounds I couldn’t say anything better than that, Janet, you've summarized the current situation perfectly. Science is fun.

Janet Thornton
And it's fun to talk about. We've had some good conversations about it.

David Jones
Doing it is sometimes a bit of a pain, but it is fun to sit down. We’ve missed a bit more of the last couple of years for obvious reasons that we've had. With AlphaFold, everything's happened on Zoom calls.

Two years ago or three years ago, we would have had probably three or four conferences on it sat around discussing it and throwing ideas around. We’re going back to that hopefully; we’ve missed that a bit. So it's all felt like we've been in limbo a bit with a big breakthrough and not had the to get hold of it and discuss it and argue about it and all the things that we do.

I take my science very seriously, but doesn't mean you can't have fun doing it. It's one of those things. You have some fun, but you take the actual business of delivering reliable results seriously and that’s what it will come down to. That's what Janet taught me. I’ve hopefully he carried the torch since then.

Janet Thornton
Indeed, David.

Vivien:
That was Conversations with scientists.
Today’s guests were Dame Janet Thornton from the European Bioinformatics Institute.

Dr. Thornton is the former director of the European Bioinformatics Institute. Among her awards is one for her services to bioinformatics, she has been named Dame Commander of the Order of the British Empire.

I spoke with her and Dr David Jones who is on the faculty of University College London, with appointments both in computer science and in structural and molecular biology.

And I just wanted to add because there’s confusion about these things sometimes the European Bioinformatics Institute and University College London didn’t pay to be in this podcast.

This is independent journalism produced by me in my living-room. I’m Vivien Marx, thanks for listening.