Wednesday, January 30, 2019

The Science-Flavored Guesswork Known as Phylogenetics

Our scientists often give us visual displays designed to impress us with their grasp of nature. Such visuals should often be taken with a large grain of salt. An example is the type of “composition of the universe” pie graph that claims the universe is about 72% dark energy, 23% dark matter and 5% regular matter. As discussed here, the case for dark matter is wobbly. Moreover a 2016 study has cast doubt on the research used to make the claim that the universe is 72% dark energy, raising doubts about whether dark energy even exists.

Another type of scientific visual we should have little trust in are those visuals showing a kind of “tree of life” that supposedly shows how one type of life evolved into another. Such visuals are generated using what is called phylogenetics, which involves attempts to compute the ancestry of living things from studying their genomes.

There is a gigantic amount of data involved in the genome of a single organism. Comparing the genomes of many different organisms for similarities becomes a task too data-intensive for a person to do in his own head or on paper. When you get into the task of estimating hypothetical inheritance trees, the number of possibilities becomes so gigantic that the task becomes something so difficult that it is often handled by computers.

The idea of doing computer analysis on genomes may sound very impressive, but there are several reasons why this type of analysis does not in general provide convincing evidence that some species  had a particular ancestry.

1. Phylogenetic programs assume common descent rather than prove it.

The computer programs used for phylogenetic analysis are not programmed to analyze the likelihood that a particular set of species share a common ancestor. Instead, such programs typically assume from the beginning that such species do share a common ancestor, and the programs busy themselves with trying to compute the most probable inheritance tree that can link such species. 

2. Phylogenetic programs compute a “most likely” tree of evolution, but such a tree is not a likely “tree of evolution.”

One must be careful to distinguish between the concept of “most likely” and “likely.” “Likely” means having a probability of greater than 50%. But “most likely” means more likely than any other possibility. It is very common for a “most likely” possibility to be unlikely, with a probability of less than 50%. For example, if you choose a random word from a book, the “most likely” choice is the word “the.” But such a choice is not a likely choice, and has a likelihood of less than 10%.

In the case of a phylogenetic software, it will not produce an inheritance tree that is likely to be correct. It will merely produce an inheritance tree that may be the “most likely” among many different alternatives that the software explores. But such a tree may still be very unlikely to be accurate.

3.With any complicated inheritance tree problem, there is a “combinatorial explosion” that prevents phylogenetic programs from being able to try all possibilities, so the software resorts to a fragmentary exploration of the solution space.

Anyone who has studied computer science knows that when there are many variables or data points, the number of possible arrangements increases exponentially. A classic example is what is known as the traveling salesman problem. If a salesman has to travel to 20 cities, then the total number of possible travel routes is roughly 20 factorial, which is too large a number to compute.

Given more than 200 species, the possible number of inheritance trees to be considered becomes so great there is no possible way for any computer program to compute all the possibilities. So phylogenetic programs typically resort to a shortcut. They simply allow you to try a certain number of possibilities, and rate each one for its likelihood. The one with the best rating is singled out as the winner. But that's not a method that should inspire confidence. The winner is unlikely to be the actual inheritance tree for the set of species, whenever there are many species being considered.

4. Too few living species have had their genomes analyzed for phylogenetic programs to be very reliable.

According to this government web site, “more than 250” animal species have had their genomes analyzed. The problem for phylogenetic programs is that this is but a tiny fragment of the total number of living species, which has been estimated as 8 million. Consequently, we don't have the data to be reliably calculating an inheritance tree based on so few genomes. Perhaps after very many thousands of genomes have been cataloged, such analysis may be more reliable.

5. We don't have any DNA data for even 1% of the species that previously existed.

The reliability of phylogenetic programs is proportional to how much DNA data we have for extinct species that lived long ago. But we have very, very little DNA data for species that lived long ago. The half-life of DNA is only 521 years, meaning every 521 years half of the DNA information will disappear. So we have no DNA information for species such as dinosaurs. There is no truth to the idea that dinosaur DNA has been preserved because insects that bit dinosaurs have been preserved in amber.  That's a fantasy of a "Jurassic Park" movie. When phylogenetic programs try to place dinosaurs in a phylogenetic “tree of life,” they must use guesses about what the DNA of dinosaurs looked like. Similar guesses must be made about almost all of the species being considered.

6. We should have little confidence in phylogenetic programs, given their extremely complicated algorithms that are anything but straightforward.

A document on molecular phylogenetics says this: “The likelihood calculations required for evolutionary trees are far from straightforward and usually require complex computations that must allow for all possible unobserved sequences at the LCA nodes of hypothesized trees.” The same document shows an equation for calculating likelihood, the type of equation used by such a program. It looks as complicated as one of the more complicated equations used in Einstein's theory of general relativity. See here to look at  some of the extremely complicated math involved.

When computer programs are based on extremely complicated algorithms, there will very often be bugs in the program – either because of an error in the complicated algorithm or because of a failure in accurately translating the complicated algorithm into computer code such as Java. For example, a recent study found bugs in software used to analyze brain scans, and estimated that thousands of scientific studies using such software may be inaccurate. The more complicated an algorithm, the greater the likelihood it will not be accurately implemented in bug-free computer code.

A paper entitled “The State of Software in Evolutionary Biology” reviewed various computer programs used in phylogenetics, and concluded “the software quality of the tools we analyzed is rather mediocre.” A later paper entitled "The State of Software for Evolutionary Biology" stated, "The software engineering quality of the tools we analyzed is rather unsatisfying." It is a huge problem in science that software programs used for scientific analysis are often written by scientists who dabble in computer programming, and the quality of their work is often second-rate.  We should no more expect high-quality code from a scientist dabbling in computer programming than we should expect to get high-quality house-building and plumbing from a professional musician who dabbles in making houses. 

7. We should have little confidence in phylogenetic programs, because there is no way to test the output of such programs.

As a general rule, our confidence in a type of software should be proportional to the degree to which the software has passed tests. For example, if some baseball prediction software were to predict that a particular player would have a batting average next season of .314, and the player did produce exactly such a batting average, and the same type of prediction succeeded for other players, that would be a good sign that the software was reliable. But in the case of phylogenetic software, there is no way to test its outputs. Although certain types of consistency checks and statistical checks can be applied to the output of phylogenetic software, we have no way of verifying that a "tree of life" or an inheritance tree produced by such software is historically accurate. Anyone in the software industry knows that untested software is not something you should have much confidence in.

8. Lateral gene transfers cast doubt on the reliability of  phylogenetic estimates.

Here is a quote from a 2016 scientific paper:

One of the several ways in which microbiology puts the neo-Darwinian synthesis in jeopardy is by the threatening to “uproot the Tree of Life (TOL)” [1]. Lateral gene transfer (LGT) is much more frequent than most biologists would have imagined up until about 20 years ago, so phylogenetic trees based on sequences of different prokaryotic genes are often different. How to tease out from such conflicting data something that might correspond to a single, universal Tree of Life becomes problematic. Moreover, since many important evolutionary transitions involve lineage fusions at one level or another, the aptness of a tree (a pattern of successive bifurcations) as a summary of life’s history is uncertain.

The paper then goes on to say this:

Students of animals and plants have long accepted that incomplete lineage sorting, introgression, and full-species hybridization pose difficulties for the sorts of trees that Darwin might have had us draw. But it is microbes, with their promiscuous willingness to exchange genes between widely separated branches of any “tree,” that have most seriously jeopardized the neo-Darwinian synthesis.

9. Disagreement about mutation rates undermines the reliability of phylogenetic estimates.

The output of a phylogenetic program may rely on some estimate regarding a rate of mutation. But there is great disagreement about the rate of mutation in the past. A scientist quoted in Nature News says this about the “DNA clock” used in phylogenetics:

The fact that the clock is so uncertain is very problematic for us,” he says. “It means that the dates we get out of genetics are really quite embarrassingly bad and uncertain.”

10. Phylogenetic estimates based on microRNAs or fossils conflict with other phylogenetic estimates.

The quote below is from a 2012 article published in the mainstream publication Nature:

A molecular palaeobiologist at nearby Dartmouth College, Peterson has been reshaping phylogenetic trees for the past few years, ever since he pioneered a technique that uses short molecules called microRNAs to work out evolutionary branchings. He has now sketched out a radically different diagram for mammals: one that aligns humans more closely with elephants than with rodents. “I've looked at thousands of microRNA genes, and I can't find a single example that would support the traditional tree,” he says. The technique “just changes everything about our understanding of mammal evolution.”
The mainstream scientific paper "How reliable are human phylogenetic hypotheses?" gives a troubling answer to such a question.  It tells us that "phylogenetic hypotheses regarding humans and their fossil relatives" have "never been subjected to external validation." When the authors tried to do such a validation, they found that "phylogenetic hypotheses based on the craniodental data were incompatible with the molecular phylogenies."   This led them to conclude that "existing phylogenetic hypotheses about human evolution are unlikely to be reliable." 

Below is a visual from a 2016 paper "A new view of the tree of life." In this paper this visual comes underneath a headline "A current view of the tree of life."  You may notice that the strange shape has no actual resemblance to a tree, although it looks a little like some erupting fireworks sparkler stick that I would use as a young boy on the fourth of July.  



No doubt computational phylogenetics will continue to be very popular. Although such analysis seems to add little to our knowledge, it's a nice easy way to make a living if you are an evolutionary biologist. Rather than having to do the messy and frustrating work of trying to dig up fossils, an evolutionary biologist can just comfortably sit in an office and crunch genome data. It's a lot easier than writing software, where there is typically the requirement that your computer work must actually achieve some useful innovation. A scientist specializing in phylogenetics can just grind out hypothetical “trees of life” or “ancestry trees” year after year, with very little disturbance from people objecting to his work or analyzing his methods. So if you are an evolutionary biologist making a living doing such comfortable work in a clean office, you will vigorously defend the value of what you are doing. The last thing you want is to have to go out in the mud and get your fingernails dirty.

No comments:

Post a Comment