One
of the principal unsolved problems of science is the problem of
protein folding, the problem of how simple strings of amino acids
(called polypeptide chains) are able to form very rapidly into the
intricate three-dimensional shapes necessary for protein function.
Scientists have been struggling with this problem for more than 50
years. Protein folding is constantly going on inside the cells of
your body, which are constantly synthesizing new proteins. The
correct function of proteins depends on them having specific
three-dimensional shapes.
Last
month the New York Times had an article suggesting that the protein
folding problem had been solved. But this insinuation is not at all
correct. Not only has the protein folding problem not been solved,
but the most systematic assessment of progress on this problem
suggests that scientists are light-years away from solving it.
In DNA, proteins are represented simply as a sequence of nucleotide base pairs that represents a linear sequence of amino acids. A series of
amino acids such as this, existing merely as a wire-like length, is
sometimes called a polypeptide chain.
But a protein molecule isn't shaped like a simple length of
copper wire – it looks more like some intricate copper wire
sculpture that some artisan might make.
Below
are two examples of the 3D shapes that protein molecules can take.
There are countless different variations. Each type of protein has its own distinctive 3D shape.
The
phenomenon of a protein molecule forming into a 3D shape is called
protein folding. How would you make an intricate 3D sculpture from a
long length of copper wire? You would do a lot of folding and bending
of the wire. Something similar seems to go on with protein folding,
causing the one-dimensional series of amino acids in a protein to end
up as a complex three-dimensional shape. In the body this happens
very rapidly, in a few minutes or less. It has been estimated that it would take 1042 years for a
protein to form into a shape as functional as the shape it takes, if
mere trial and error were involved.
The
question is: how does this happen? This is the protein folding
problem that biochemists have been struggling with for decades. It
has been often said that when the protein folding problem is solved,
scientists will be able to reliably predict the 3D shape of a protein
from only its sequence of amino acids.
In
December 2017 the New York Times had a story about attempts to create
artificial proteins. The story seemed to announce a monumental
success – that the protein folding problem had been solved. Here is
what the Times article said:
But
they’ve been stumped by one great mystery: how the building blocks
in a protein take their final shape. David Baker, 55, the director of
the Institute for Protein Design at the University of Washington, has
been investigating that enigma for a quarter-century. Now, it looks
as if he and his colleagues have cracked it.
This
claim in the article inspired computational biologist Mike Inouye to
send out a triumphal tweet proclaiming: “Mind blowing...the protein
folding problem is essentially solved.” But there is a very elaborate systematic methodology in place for
determining the progress made so far on the protein folding problem,
and that methodology is currently telling us loud and clear that progress
on this protein folding problem is very small, with the problem being 100 times more
unsolved than the New York Times has suggested.
What
is called the Critical Assessment of Protein Structure Prediction
(CASP) is a competition to assess the progress being made on the
protein folding problem. They have been running the competition every
two years since 1994. You can read about the competition and see its
results at this site. The first competition in 1994 was called
CASP1, and the latest competition in 2016 was called CASP12.
Particular prediction models are used to make predictions about the
3D shape of a protein. The competitors don't know the 3D shape, but
only are given the amino acid sequence. The competitors make their
best guess about the 3D shape, using some prediction model that is
often computerized.
The
competition is broken up into two categories, one category in which
"template-based" modeling can be used, and one in which the predictions
are supposed to be “template-free” approaches (also called de
novo approaches or ab initio approaches). The latter
approach is supposed to be not depending on a large database of
proteins or a database of protein fragments (something that a cell
doesn't have when a 3D protein shape appears).
While
looked at the CASP web site, I found the paper here, which gives a
graph summarizing what kind of success level was reported in the CASP
competitions up until the CASP10 competition in 2012. The
graph is below.
The
GDT_TS shown on the left is something called the “global distance
test,” a measure of how accurate a prediction is. A GDT_TS of 100
means a relatively accurate prediction, and a GDT_TS of only about 20
means a poor, inaccurate prediction.
We
can see from the graph above that the same failure has plagued the
prediction models in all of the competitions: the models work well
with simple cases (trying to predict the 3D shape of a protein with
few amino acids), but do not work well with more complex cases
(trying to predict the 3D shape of a protein with many amino acids).
Moreover, it seems that while some progress was made between CASP1 in
1992 and CASP4 in the year 2000, little progress was made between
CASP4 in the year 2000 and CASP10 in the year 2012.
The
paper here summarizes the results of “template free”
protein-folding prediction in the 2012 CASP10 competition. This is
the harder type of prediction, in which you are not supposed to use
templates that are kind of patterns derived from studying many
different cases of proteins. A cell itself does not use any such
thing, so anyone claiming to have an explanation for how nature folds
proteins shouldn't be resorting to such meta-data.
The
paper found that the results were poor: “Even the most suc-
cessful
one submitted best models for only four of the 19 FM
targets and eight of the 36 ROLL targets.” It also found that
“Many, if not most, good models appear to have been produced
by template-based modeling or the related technique of server
model selection and refinement.” This amounts to basically an
accusation of widespread rule-breaking that resembles cheating. This part of the competition was
supposed to be for “template free” predictions, but many of the
competitors used templates anyway (like some swimming competitor
cheating by sneaking in some freestyle strokes during a breaststroke
competition). Even with this rule-breaking resembling cheating, the prediction results were
poor.
Looking
at the results from the latest and greatest competition in 2016
(CASP12), there seems to be no big recent progress. The page here
shows the same type of poor numbers as shown in the graph above. The
GDT_TS numbers are almost all very low.
From
this examination, we can see that the New York Times story has
misinformed us by insinuating that the protein folding problem has
been “cracked.” Nothing of the sort has happened. Scientists
cannot predict with anything close to accuracy the 3D shape of a
typical protein from the sequence of amino acids found in a gene. Scientists
have been knocking their heads on this problem for 50 years, and seem
to be stalled at a very low level of success, in which only the
shapes of very simple proteins can be reliably predicted. The median size of a human protein is 375 amino acids, and scientists cannot predict the 3D shape of a protein with such a size.
I
may note that the small success that scientists have had in the area
of protein structure prediction is based mostly on data-crunching
techniques completely unavailable to a cell where protein folding
occurs. The template-based approach involves pattern matching
utilizing our knowledge of thousands of proteins. Even the techniques
called “template-free” or de novo or ab initio do
not live up to their original goal of being techniques using only the
amino acid sequence. For these de novo or ab initio techniques
also have a dependency on data obtained from analyzing many
proteins, knowledge other than just the amino acid sequence. For
example, the Rosetta technique makes use of a “fragments library”
created by analyzing a large library of proteins. If scientists were
to use the same knowledge limitations in a cell (having only the
amino acid sequence and no other data), they wouldn't even be able to
report the small degree of success in this area they have reported.
Let
us imagine that astronauts were to travel to some strange planet. On
the planet they might notice a very astonishing thing: whenever the
astronauts chopped down trees, and put the logs in a long row, the
logs conveniently assemble all by themselves into log cabins. This
would be an indication that something very dramatic was occurring on
this planet: perhaps the action of some mysterious unseen force, one
with signs of intelligence.
A
planet like this has been discovered. It is our own planet. The only
difference is that rather than rows of logs conveniently forming by
themselves into log cabins through some mysterious unknown effect, we
see linear sequences of amino acids conveniently forming into
three-dimensional protein shapes often much more elaborate than the
structure of a log cabin. We should not at all assume that it is
ordinary chemistry that produces this astonishing protein folding
effect. If it were mere chemistry, the chemical rules producing such
an effect would have been discovered long ago, and the protein
folding problem would have been solved long ago.
Just
as astronauts witnessing this log-cabin marvel should suspect that
some mysterious force with signs of intelligence was behind the
marvel they were seeing, we should suspect that the marvel of protein
folding is an indication of some great, mysterious reality of nature
far beyond our ken – perhaps some mysterious life force involved
not just in protein folding but also in the comparable marvel of
morphogenesis, where a fertilized egg mysteriously progresses to the
complexity of a newborn infant. DNA (which is essentially just a long set of
lists of amino acids) has nothing that can explain either of these
marvels.
No
matter what marvel a typical scientist may observe, he will attempt to squeeze
the wonder out of it by describing it as something explicable by
ordinary laws of chemistry and physics. Let us imagine a planet named
Volpurnia where a strange thing always happens: whenever anyone jumps
off of a cliff or high building, they always decelerate and land
softly on the ground, without any damage. Would the scientists of
Volpurnia say this was a sign of some mysterious providential force
at work? Of course, not. They might instead call this “the law of
harmless falls,” and say that it was caused by just run-of-the-mill
physics. They might say, “Give us a few decades, and we'll explain
it.” If you then returned 40 years later, and asked if they figured
out what is causing this “law of harmless falls,” the scientists
would say something like, “We haven't quite figured that out, so
give us 40 more years.” The scientists who have struggled for
decades to explain protein folding, with little substantial success,
are like these scientists of Volpurnia; and the marvel of protein
folding is no less astonishing than such a “law of harmless falls.”
No comments:
Post a Comment