Predicting protein structure, episode 2

Data about proteins has a home in the Protein Data Bank (PDB). Structural data for over 180,000 proteins. Now, with AlphaFold from DeepMind Technologies, an Alphabet company, there's an EBI-AlphaFold database for structures generated computationally. What does this mean for the Protein Data Bank? And how does the PDB relate to AlphaFold? What's next? Helen Berman, co-founder of the PDB and co-architect of its next chapter shares her thoughts about the future, the past.

Predicting protein strucutre, episode 2
A conversation with Helen Berman

Helen Berman
AlphaFold is a triumph. And I will make the following analogy. When I was young, I worked in a cancer center. Nixon declared the War on Cancer and the whole cancer center, and all people in cancer research spent a lot of time thinking about what that actually meant. Various people said, Well, we have a ‘War on Cancer’ and we cure cancer, then what are we going to do? And I said, what a great thing to happen. I feel the same exact way about this. If it turns out that simple structures can be easily modeled and predicted, then we can work on more complicated things and more, big macromolecular machines. That's all good, o kay. That's my view.

Vivien
That’s Dr. Helen Berman. Helen Berman co-founded the Protein Data Bank, the PDB, in 1971 and directed it for quite some time. The PDB is a database that is the home to over 180,000 protein structures that have been resolved experimentally. There are many more proteins than these but proteins are quite hard to pin down. They are built from biochemical units, amino acids, and have complex three dimensional shapes—twirls and curls of many kinds. And they are not static, the structures move in various ways such as when they interact with other proteins.

Scientists have long worked on the tough nut problem of predicting three dimensional protein structure from amino acid sequence. Now that AlphaFold has come onto the scene, that challenge has become much easier. And there are other platforms that also handle protein prediction such as RosettaFold.

But AlphaFold has created a stir since it came out way ahead in the Critical Assessment of Protein Structure Prediction (CASP), a competition in which scientists test the prowess of their methods that computationally predict protein structures.

AlphaFold was developed by a company, Deep Mind Technologies that was bought by Google in 2014 and so it’s now part of Alphabet. There’s a database called the European Bioinformatics Institute-AlphaFold database that is filling up with computationally predicted protein structures.

You can hear more about AlphaFold, what it does and what it cannot do in other podcasts in this series. And I wrote an article for Nature Methods about it, too. For example, I spoke with Dr. Janet Thornton of the European Bioinformatics Institute and Dr David Jones of University College London. And I spoke with a number of junior scientists who explain how AlphaFold is shaping their science and their careers.

Back to the Protein Data Bank, the PDB. So there’s the PDB and the EBI-AlphaFold database. There’s PDB-Dev for structures determined with experimental and computational approaches. There’s a knowledgebase from the Protein Structure Initiative called Model Archive for in silico protein structures. So…there are now several homes for protein structures. To understand what comes next, also takes a bit of history. Here’s Helen Berman.

Helen Berman [3:30]
So the PDB began in 1971. It was definitely grassroots, a bunch of people like self who really thought we needed to have. I was a postdoc at the time, and I really was interested in the structures. I was interested in the protein folding. And I knew those of us who were activists knew that these data would have what we need to understand folding. So we knew that back then. We didn't know how and all that.

But we knew that. And then it was kind of a sleepy database. It just collected data and somewhat curated but not heavily curated. There was no one who said that you had to put data in the PDB until the 80s, when a bunch of very respected senior structural biologists, Dick Dickerson, Fred Richards, to name two, said it is basically immoral not to have these data required. Okay. So the structural biology community comes from this very strong idea of the public good from the very beginning.

So in order to figure out what to do, there was a bunch of committees set up, and the people who were producing the data had to come up with: what do we mean by mandatory deposition? What is the meaning of that? What data should be deposited. And so 1989, there were guidelines set up for the data. And then there were many, many discussions with journals because it was understood that if the journals didn't record. Journals are the place that you need to have all this stuff. And so the journals, one by one.

So the first one as the International Union of Crystallography Journals, and then JBC and eventually Nature and Science. And once Nature and Science came on then. So people always put their data in.

One of the things that were in the PDB until that point were some in silico models, and that presented a problem. So we had a workshop in 2005 where we brought together modelers and experimentalists to figure out what to do. And it was decided at that point 2005, we wrote a white paper, that the models, strictly in silico models should not be mixed in with the experimental models, because when people downloaded data, they might be taking in silico and then everything would get all mixed up.

So we came up with that recommendation, and at some point, the PDB then separated the in silico models and put them aside. Then, as part of the protein

Vivien
Does that mean that there is a separate PDB database?

Helen Berman
That's what I'm going to tell you about. Around 2000 or so the Protein Structure Initiative began . The idea was to basically create models of everything that was important. Okay. Using basically homology modeling. So you did structures that would yield the most possible models so that we could map all of the structures. That was the idea. As part of that initiative, we had something called the Structural Biology Knowledge Base. And in that Knowledge base, we set up a Model Archive. The model archive was run by Torsten Schwede. Torsten is at the Swiss Institute of Bioinformatics.

So he set up two things. One was the Model Archive. And the second was a protein modeling portal where all the different kinds of homology models could be obtained by all the different methods. And so that's the way that was all set up. And that was then a real home for models. Okay, that's one piece of data. The second piece of data is in around the year 2014, we began to realize that there were a lot of these integrative structural models, which meant that you had a lot of data and a lot of computing.

So it's a sort of hybrid way of doing structures. We began a project to figure out how to manage integrative models. And we have something called PDB Dev. So in PDB Dev, we have these models that are sort of a hybrid between computational and experimental. The idea, then, is that eventually this would be part of the PDB. Now, when AlphaFold came about, those models are strictly in silico.

So right now, there's multiple discussions, and I really don't want to go into all the details because there's multiple discussions about what is the right way to handle these in silico AlphaFold, and then there's Rosetta fold. There are going to be more and more. So there are discussions going on among the people who know the subject matter.

Vivien
From a naive standpoint, is it important to have these discussions for you all to have them because scientists who are not structural biologists need to kind of know where to go to get what and what they're going to get when they download.

Helen Berman
Yeah, and that's why we have to figure out the right way, because AlphaFold would never, ever have succeeded, ever if the models had been improperly mixed with the experimental, because AlphaFold depends on having well curated experimental data, okay. And if you don't have well, well curated experimental data, then you don't have something to train with. So the big challenge now is how this is all going to happen. Various people have come up with solutions.

Vivien
The solutions are all in the making. Dr Berman has recently retired from her faculty post at Rutgers University and, at least officially, from the Protein Data Bank. But in fact she is right in the middle of these discussions about the next chapter for the Protein Data Bank. Her days don’t exactly sound like she has retired.

The actual solutions and ideas are being kept under wraps but Helen Berman does share a little bit of what is going on.

Helen Berman [11:25]
Some people think, oh, we do it this way. And other people think, oh, we do it that way. And there's a lot of I mean, I'm involved in these discussions. Okay. It's daily. Discussions about what is the right way. So those of us who come from the J. D. Burnell public good point of view, which came in the 1930s, take this really seriously. Okay. You do not want to do something to mess up.

The other thing that we did that was really important with the PDB is in 2003, we set up something called the Worldwide PDB, where we agreed that the PDP-E, PDB-J, and RCS-PDB would work together to manage these structural data.

And this has worked extremely well. We also set up validation workshops so that each kind of data was being carefully validated. And the people who made the decisions about what these validation standards should be are the people who do the science. So the X-ray crystallographers figured it out for X-ray, NMR for NMR, EM for EM. This is evolving, and it continues. And now we're adding this new thing to the mix, okay. And so to me the most important thing from my point of view, having been involved in this for 50 years.

Vivien
Rock on.

Helen Berman
The most important thing is to make sure that people when they come to the PDB are getting what they think they're getting

So if they think they're getting well-curated experimental data, you want to make sure they can get it. And you can't say, oh, well, that doesn't matter. Buyer beware, unh-unh. You have to really do this right. So AlphaFold would never have succeeded had we not made that decision in 2005. Now we, the public we, whoever we are, have to make a new kind of decision.

And there's lots of egos involved. There's lots of points of view involved. And my view is we just have to figure out how to do this the right way. And it requires a little bit of thought. So lots of people are talking. Right now, it's one on one, one on one, one on one. Eventually, I'm hoping, because even though I'm retired, okay, you would never know it. Okay. This is something that we got all the way up to 50 years, and there's a triumph in AlphaFold. That's a triumph.

That is exactly what those of us who cared about making a PDB wanted to see happen. And it has now happened. Now we want it to happen even better. We want new things to happen. Those of us who are looking at it in a broader way.

So we're really in an incredible transition point. Now, I suspect within six months, there will be a solution to how to handle all this. I do have a prejudice about how it should be handled.

Vivien
Soooo what is the prejudice that she has about how it should be handled. Dr Berman and I are both New Yorkers so we wrangled a bit on this.

Vivien
What good would science be if there weren't sort of multiple opinions, right.

Helen Berman
There’s a huge number of opinions.

Vivien
I do understand she wants to keep the details of the plans and the discussions more or less under wraps.

But she did share some things that are flowing into how the new, um, am just going to call it ‘offering,’ is going to be built that relates to the PDB and the computationally generated proteins such as those from AlphaFold.

There’s a bit of history involved-- the Protein Structure Initiative, PSI for short, was an NIH-funded program all about generating protein structural data. It began in 2000 and , what is that word that is used, it was sunset in 2015. Closed down. Not everyone was happy about this of course. What was built for the PSI will be playing a role in the future of the PDB and the digital home for all those computationally predicted proteins. Here’s Helen Berman.

Helen Berman [16:00]
The infrastructure that we created for the models data, which was set up at Swiss Institute of Bioinformatics, that infrastructure was carefully preserved. So when they shut down PSI, we made sure to keep all the bones in a safe place. So I've been in discussions with Torston Schwede, who is still in Switzerland and is heavily involved in ELIXIR and all that stuff. Okay. So this could be revived, and then this could be a way to handle it. And we've been talking about different technical solutions for where to store the data.

There's two issues: where to store the data so that people will know that what they're getting are models that are not backed by experiment, they're predictions. And then the other issue is how to serve the data. Serving the data is an open field. Anybody can figure out different ways of serving the data. And that's good. That's a competitive thing. How do you make the data use able and all that kind of stuff? That's an open playing field. But the part about where to store the data, really, we need to have some kind of international agreement, and that's what we're trying to figure out. Some of us are trying to figure out.

Vivien
So for now there is the PDB and there is the EBI-AlphaFold Protein Structure Database. This duality is likely not going to remain. But as of right now as I produce this podcast, you have heard where things stand.

In thinking about the future more came up in our conversation about the PSI. History always shapes the future. The PSI generated structures on a pretty large scale.

Helen Berman [18:00]
That was one part of PSI. The other part of PSI had to do with: you want to select structures to work. The initial part of PSI, there was huge effort to track structures to work on that one could could leverage into models for more structures. And had PSI not been shut down, we would be there. In my opinion, it was a big mistake to shut down PSI.

It was really a very important initiative that actually taught us in terms of experiment, how to fast track structures and yet have higher quality, that's one of the things that came out of PSI that you could get higher quality experimental structures. Because people developed all these methods. And then the other thing was that you could model structures, do comparative modeling.
Helen We had a way to serve up all that.

Vivien
There are some who feel AlphaFold has crushed their work. And others worry maybe there is nothing left to research related to protein structure prediction now that ‘an AI can do it.’ To talk about this Helen Berman draws a parallel to the War on Cancer that US president Nixon declared in 1971.

Here’s a quick biographic excursion about Dr Herman. While a high school student, she worked in a lab at Barnard College, then as a Barnard college student she worked in a lab with Dr Barbara Low. She became fascinated with crystallography and became a crystallographer.

Helen Berman [19:40]
AlphaFold is a triumph. And I will make the following analogy. When I was young, I worked in a cancer center. Nixon declared the War on Cancer and the whole cancer center, and all people in cancer research spent a lot of time thinking about what that actually meant. Various people said, Well, we have a ‘War on Cancer’ and we cure cancer, then what are we going to do? And I said, what a great thing to happen. I feel the same exact way about this. If it turns out that simple structures can be easily modeled and predicted, then we can work on more complicated things and more, big macromolecular machines. That's all good, okay. That's my view.

Vivien
The PDB is open and always has been. And quite a number of people download it in its entirety.

Helen Berman [20:45]
The PDB is completely open. People download it all the time. The pharmaceutical companies download regularly because they don't want anyone to know what they're doing, so they don't do web searches.

So every week, every month, I don't know any more what it is, download the whole archive, the whole thing, and then they put it into their own, behind their own firewall and do whatever they're going to do with it. And that's what the PDB is for.

Vivien
There was one series of downloads that caught Helen Berman’s eye. Mind you downloads are not monitored in any way.

Helen Berman [21:20]
When I was still in New Jersey before I got marooned in California, which I'm getting used to. But we had a little board, a little electronic board, and we could see what kind of activity was going on, what kind of downloads were going on, what kind of hits were going on. And we could see it in real time. And I saw that there was something massive going on around the world. There was something in London, there was something in New York, there was something in San Francisco.

And I kept saying to people, shouldn't we find out who's doing all of this? I said, this can't be normal structural biology. This is something else. And I said, somebody thinks they're going to make money on this. I remember saying that. And people said, "oh, come on." I said, 'why are all these people downloading all this stuff so much?' You could see big downloads. But we never discouraged that because it's my view that that's what it was for. If you have an open database, then you're going to have people use. And if you don't have people use it, then why have it?

We have about two and a half million coordinate downloads per day. Okay, that's what we have. It's a huge usage day to day. I'm talking about whole coordinate sets. Okay, 2.3; 2 .5 something like

Vivien
That's global. Obviously, that's not just it's global.

Helen Berman
And the ww.PDB keeps all those stats. We try to keep up with them. It's harder to do. Okay. We have to be careful. We do not want and should not know who's doing what; I don't care. It's supposed to be for knowledge. I remember saying when people would say, ‘Are we allowed to do X, Y and Z?’ And I would say, and it would be a commercial person, I’d say if somebody can figure out how to make money and do a better job than us. Well, that means they're making a better product. So that's okay.

Vivien
That's great.

Helen Berman
That was my point of view. Remember, I'm a 60s person. I have a certain point of view about things, and that's what I believe.

Vivien
Being able to predict structure from sequence is something Helen Berman has long thought was going to be an enabler for the field.

Helen Berman [23:45]
As a graduate student in 1967, I wrote something. We needed a PDB precisely this to be able to predict structure from sequence . Okay. That's why we needed a PDB. And that was what I wrote in this thing for my PhD qualifier. I didn't get it out of my own head. I had visited MIT. I had met Cyrus Leventhal. I saw what was potentially possible, and it seemed to me that this is what we should be doing. But the only way we could do is if we had data.

Vivien
The PDB is a core part of Helen Berman’s biography. As she and others think about the next step in the PDB’s development, now that systems like AlphaFold exist, it’s interesting to remember that the PDB was initiated by a group of students and postdoctoral fellows.

Helen Berman
Remember the PDB was started by postdocs and trainees and graduate students. That's who was agitating for it way back.
I wasn't the only one. There were a few of us back then. This was the 60s. We were very young. We talked a lot. We were so excited by looking at the structures, and we thought, what can we do with all this? And I remember we had these meetings and we wrote petitions and we did all kinds of things to see if we could get the data out there.

It was the kids who did it. And then we had to convince this elderly guy, the 40 year old guy. We knew that somebody important had to make it happen, that we couldn't make it happen. But we had to convince him to do it. And we did. But the initial people involved were all very young.

Among the people who were involved among many people. But it was Gerson Cohen. Unfortunately passed away. Edgar Meyer passed away, myself, were people who were very active and collaborated sort of. And we had to do it all by snail mail. There was no email. And we would have these meetings about how do we make this happen? And we were all very young people, just beginning in our careers. And then we went to this Cold Spring Harbor meeting in 1971 and Walter had driven down from Brookhaven, and we kind of assaulted him and said, you know, we really need someone to do this.

And we knew we had enough sense to know that on our own, we couldn't do it. We needed somebody who had credentials.

We were writing letters and telling people what we thought had to happen. I think Edgar, Edgar, and Gerson were both, like beginning in their independent careers. Or one of them might have been a postdoc or research associate when this all began. I met Edgar when he was a postdoc. So we were young and I was a student. So that's how things really happened.

Yeah. That's absolutely the way it happened. And then we had the, as they say in my language, the chutzpah to go and say, this is what should happen. And that's what happened.

Vivien
One aspect that comes up regularly when talking about AlphaFold is that it might put some people out of work, it might cancel some scientific endeavors currently underway. For example software tools in the making. Helen Berman doesn’t know of anyone specifically who is being affected this way.

Helen Berman
I'm sure that's true. But that's not the gang I'm hanging out with, okay. But that's just normal. There are better ways to do things and you get used to it. And as you get older, as somebody who’s pretty old, you kind of say, okay, that's great, if there's a better way to do something than the way I originally thought it should be done, then that's a good thing.

Vivien
Talking about the possible impact of AlphaFold takes Helen Berman back in time to the crystallographer Walter Hamilton.

Helen Berman
This reminds me. Okay. In 1970, the person who ultimately founded the PDB was named Walter Hamilton, and he was at Brookhaven, and he wrote an article I think it was for Science called the Crystallographic Revolution. He used the word revolution. And what he was talking about was the fact that small molecules are so easy to determine that it's no longer a challenge, That was in 1970 for small molecules. The professional crystallographers really got on his. He was a young man at the time. He was not even 40.

The professional crystallographers really got on his case and said, 'You're going to get us out of business.' And he said, 'but it's true.' Okay. He's then the one that we convinced to set up the PDB, which he did and unfortunately passed away very shortly thereafter of some terrible disease. But at any rate, Walter set up the PDB. And now I think we are in exactly the same place with protein structure as we were with small molecule structure. And I think that's okay, because there's so many challenges, structural challenges and everything we learned about how you do structures rapidly and all that stuff will just inform what happens next.

So I don't see this as a negative thing at all. I just see: you can get rid of the easy stuff. And now everybody wants to, at least most of us want to, work on challenges. Okay. So there'll be new challenges. That's how I see it. There's a huge amount to do, and you just have to take the long view. And maybe those of us who are older can take a longer view.

Vivien
Big projects especially ones that involve data sharing can be quite challenging. One of them is called ELIXIR. It’s a pan-European organization to enable sharing of data and software in the life sciences across country boundaries. It was put in motion by Dr Janet Thornton, the former director of the European Bionformatics Institute. There is a separate podcast with Dr. Thornton and Dr. David Jones of University College London about AlphaFold and they talk a little about ELIXIR as well. Here’s Helen Berman:

Helen Berman [30:30]
Ultimately, ELIXIR is a model for how one should handle all those different kinds of data repositories and knowledge bases. In my opinion, it's brilliant. And I think the US should have modeled how they handle things that way. But they didn't, we didn't. So that's too bad. And I think she handled it brilliantly.

Vivien
Now that AlphaFold and other systems such as RosettaFold are here, what’s next? I have produced other podcasts in which I chat with junior scientists about their views on this. Huge challenges and unanswered questions await in structural biology. AlphaFold has not solved all the challenges that proteins present scientists with.

Helen Berman [31.15]
Oh, without question, There are these huge challenges, which in a way because there's going to be more time. There's going to be more creativity. I think it's a great thing. Okay. People will know, ok. If I have a structure that's a small domain, I can probably predict it. And I don't have to worry about that. So now I can worry about the bigger things. But there are still lots of challenges having to do with protein-protein interactions, protein-ligand interactions, very, very large structures, multi-domain structures, multi-state structures. There's all kinds of stuff that we need to face.

Vivien
In her lifetime Dr Berman has seen the possibilities of solving protein structure and predicting structure change radically.

Helen Berman [32.05]
I was trained as a small molecule crystallographer. It's basically anything. It could be an amino acid or it could be a drug. So when I was trained, those structures were very difficult to do. And they're done in a different way than we do protein structures. And they were very challenging.

So, I mean, for my PhD thesis, I had five such structures, and they were challenging because they just were. Because you had no idea what you were going to find. And there were various indicators to let you know when you had found the right answer. But by the time, now, the structures that took me maybe six months to a year each probably can be done in five minutes now.

Vivien
Walter Hamilton, the crystallographer who pioneered the PDB titled his paper in Science in 1970 The Revolution in Crystallography. Automation and computers, he wrote “have made X-ray structure determination a routine laboratory tool.” As you heard from Helen Berman the X-ray crystallographers got on his case for saying this, how routine the technique was becoming to analyze a complex chemical problem in a reasonable amount of time.

Helen Berman [32:25]
His background, he was a small molecule crystallographer. And people called him the crystallographer's crystallographer. He was trained at Caltech and at Oxford. He was a small town boy from Oklahoma, and he was full of energy and had lots of ideas. And the last thing he was working on before he died was amino acid structures using neutron diffraction in order to define precisely the positions of the hydrogens and all that, which is still an issue, okay. So that's what he was working on.

Vivien
AlphaFold has changed the landscape of protein structure prediction and it will likely continue to change it.

Helen
Yeah, I don't see any downside to what happened. Okay. I just don't. The only thing that has to be worked out is how best to collect the data and serve the data so that we don't ruin the tradition that we have with the PDB, which is be open, be collaborative, cooperative, fight it out behind the scenes and then have a united front. And that's what we did with the WW. Very important.

Vivien
That was Conversations with scientists. Today’s guest was Dr. Helen Berman, former researcher at Rutgers University, former director of the Protein Data Bank and current co-architect of the next phase of the Protein Data Bank.