One of the principal unsolved problems of science is the problem of protein folding, the problem of how simple strings of amino acids (called polypeptide chains) are able to form very rapidly into the intricate three-dimensional shapes necessary for protein function. Scientists have been struggling with this problem for more than 50 years. Protein folding is constantly going on inside the cells of your body, which are constantly synthesizing new proteins. The correct function of proteins depends on them having specific three-dimensional shapes.
Last month the New York Times had an article suggesting that the protein folding problem had been solved. But this insinuation is not at all correct. Not only has the protein folding problem not been solved, but the most systematic assessment of progress on this problem suggests that scientists are light-years away from solving it.
In DNA, proteins are represented simply as a sequence of nucleotide base pairs that represents a linear sequence of amino acids. A series of amino acids such as this, existing merely as a wire-like length, is sometimes called a polypeptide chain.
But a protein molecule isn't shaped like a simple length of copper wire – it looks more like some intricate copper wire sculpture that some artisan might make.
Below are two examples of the 3D shapes that protein molecules can take. There are countless different variations. Each type of protein has its own distinctive 3D shape.
The phenomenon of a protein molecule forming into a 3D shape is called protein folding. How would you make an intricate 3D sculpture from a long length of copper wire? You would do a lot of folding and bending of the wire. Something similar seems to go on with protein folding, causing the one-dimensional series of amino acids in a protein to end up as a complex three-dimensional shape. In the body this happens very rapidly, in a few minutes or less. It has been estimated that it would take 1042 years for a protein to form into a shape as functional as the shape it takes, if mere trial and error were involved.
The question is: how does this happen? This is the protein folding problem that biochemists have been struggling with for decades. It has been often said that when the protein folding problem is solved, scientists will be able to reliably predict the 3D shape of a protein from only its sequence of amino acids.
In December 2017 the New York Times had a story about attempts to create artificial proteins. The story seemed to announce a monumental success – that the protein folding problem had been solved. Here is what the Times article said:
But they’ve been stumped by one great mystery: how the building blocks in a protein take their final shape. David Baker, 55, the director of the Institute for Protein Design at the University of Washington, has been investigating that enigma for a quarter-century. Now, it looks as if he and his colleagues have cracked it.
This claim in the article inspired computational biologist Mike Inouye to send out a triumphal tweet proclaiming: “Mind blowing...the protein folding problem is essentially solved.” But there is a very elaborate systematic methodology in place for determining the progress made so far on the protein folding problem, and that methodology is currently telling us loud and clear that progress on this protein folding problem is very small, with the problem being 100 times more unsolved than the New York Times has suggested.
What is called the Critical Assessment of Protein Structure Prediction (CASP) is a competition to assess the progress being made on the protein folding problem. They have been running the competition every two years since 1994. You can read about the competition and see its results at this site. The first competition in 1994 was called CASP1, and the latest competition in 2016 was called CASP12. Particular prediction models are used to make predictions about the 3D shape of a protein. The competitors don't know the 3D shape, but only are given the amino acid sequence. The competitors make their best guess about the 3D shape, using some prediction model that is often computerized.
The competition is broken up into two categories, one category in which "template-based" modeling can be used, and one in which the predictions are supposed to be “template-free” approaches (also called de novo approaches or ab initio approaches). The latter approach is supposed to be not depending on a large database of proteins or a database of protein fragments (something that a cell doesn't have when a 3D protein shape appears).
While looked at the CASP web site, I found the paper here, which gives a graph summarizing what kind of success level was reported in the CASP competitions up until the CASP10 competition in 2012. The graph is below.
The GDT_TS shown on the left is something called the “global distance test,” a measure of how accurate a prediction is. A GDT_TS of 100 means a relatively accurate prediction, and a GDT_TS of only about 20 means a poor, inaccurate prediction.
We can see from the graph above that the same failure has plagued the prediction models in all of the competitions: the models work well with simple cases (trying to predict the 3D shape of a protein with few amino acids), but do not work well with more complex cases (trying to predict the 3D shape of a protein with many amino acids). Moreover, it seems that while some progress was made between CASP1 in 1992 and CASP4 in the year 2000, little progress was made between CASP4 in the year 2000 and CASP10 in the year 2012.
The paper here summarizes the results of “template free” protein-folding prediction in the 2012 CASP10 competition. This is the harder type of prediction, in which you are not supposed to use templates that are kind of patterns derived from studying many different cases of proteins. A cell itself does not use any such thing, so anyone claiming to have an explanation for how nature folds proteins shouldn't be resorting to such meta-data.
The paper found that the results were poor: “Even the most suc-
cessful one submitted best models for only four of the 19 FM targets and eight of the 36 ROLL targets.” It also found that “Many, if not most, good models appear to have been produced by template-based modeling or the related technique of server model selection and refinement.” This amounts to basically an accusation of widespread rule-breaking that resembles cheating. This part of the competition was supposed to be for “template free” predictions, but many of the competitors used templates anyway (like some swimming competitor cheating by sneaking in some freestyle strokes during a breaststroke competition). Even with this rule-breaking resembling cheating, the prediction results were poor.
Looking at the results from the latest and greatest competition in 2016 (CASP12), there seems to be no big recent progress. The page here shows the same type of poor numbers as shown in the graph above. The GDT_TS numbers are almost all very low.
From this examination, we can see that the New York Times story has misinformed us by insinuating that the protein folding problem has been “cracked.” Nothing of the sort has happened. Scientists cannot predict with anything close to accuracy the 3D shape of a typical protein from the sequence of amino acids found in a gene. Scientists have been knocking their heads on this problem for 50 years, and seem to be stalled at a very low level of success, in which only the shapes of very simple proteins can be reliably predicted. The median size of a human protein is 375 amino acids, and scientists cannot predict the 3D shape of a protein with such a size.
I may note that the small success that scientists have had in the area of protein structure prediction is based mostly on data-crunching techniques completely unavailable to a cell where protein folding occurs. The template-based approach involves pattern matching utilizing our knowledge of thousands of proteins. Even the techniques called “template-free” or de novo or ab initio do not live up to their original goal of being techniques using only the amino acid sequence. For these de novo or ab initio techniques also have a dependency on data obtained from analyzing many proteins, knowledge other than just the amino acid sequence. For example, the Rosetta technique makes use of a “fragments library” created by analyzing a large library of proteins. If scientists were to use the same knowledge limitations in a cell (having only the amino acid sequence and no other data), they wouldn't even be able to report the small degree of success in this area they have reported.
Let us imagine that astronauts were to travel to some strange planet. On the planet they might notice a very astonishing thing: whenever the astronauts chopped down trees, and put the logs in a long row, the logs conveniently assemble all by themselves into log cabins. This would be an indication that something very dramatic was occurring on this planet: perhaps the action of some mysterious unseen force, one with signs of intelligence.
A planet like this has been discovered. It is our own planet. The only difference is that rather than rows of logs conveniently forming by themselves into log cabins through some mysterious unknown effect, we see linear sequences of amino acids conveniently forming into three-dimensional protein shapes often much more elaborate than the structure of a log cabin. We should not at all assume that it is ordinary chemistry that produces this astonishing protein folding effect. If it were mere chemistry, the chemical rules producing such an effect would have been discovered long ago, and the protein folding problem would have been solved long ago.
Just as astronauts witnessing this log-cabin marvel should suspect that some mysterious force with signs of intelligence was behind the marvel they were seeing, we should suspect that the marvel of protein folding is an indication of some great, mysterious reality of nature far beyond our ken – perhaps some mysterious life force involved not just in protein folding but also in the comparable marvel of morphogenesis, where a fertilized egg mysteriously progresses to the complexity of a newborn infant. DNA (which is essentially just a long set of lists of amino acids) has nothing that can explain either of these marvels.
No matter what marvel a typical scientist may observe, he will attempt to squeeze the wonder out of it by describing it as something explicable by ordinary laws of chemistry and physics. Let us imagine a planet named Volpurnia where a strange thing always happens: whenever anyone jumps off of a cliff or high building, they always decelerate and land softly on the ground, without any damage. Would the scientists of Volpurnia say this was a sign of some mysterious providential force at work? Of course, not. They might instead call this “the law of harmless falls,” and say that it was caused by just run-of-the-mill physics. They might say, “Give us a few decades, and we'll explain it.” If you then returned 40 years later, and asked if they figured out what is causing this “law of harmless falls,” the scientists would say something like, “We haven't quite figured that out, so give us 40 more years.” The scientists who have struggled for decades to explain protein folding, with little substantial success, are like these scientists of Volpurnia; and the marvel of protein folding is no less astonishing than such a “law of harmless falls.”