Header 1

Our future, our universe, and other weighty topics


Monday, May 21, 2018

Memory Experimenters Have Giant Claims but Low Statistical Power

Last week the BBC reported a science experiment with the headline “'Memory transplant' achieved in snails.” This was all over the science news on May 14. Scientific American reported it with a headline stating “Memory transferred between snails,” and other sites such as the New York Times site made similar matter-of-fact announcements of a discovery. But you need not think very hard to realize that there's something very fishy about such a story. How could someone possibly get decent evidence about a memory in a snail?

To explain why this story and similar stories do not tell us anything reliable about memory, we should consider the issue of small sample sizes in neuroscience studies. The issue was discussed in a paper in the journal Nature, one entitled Power failure: why small sample size undermines the reliability of neuroscience. The article tells us that neuroscience studies tend to be unreliable because they are using too small a sample size. When there is too small a sample size, there's a too high chance that the effect reported by a study is just a false alarm.
An article on this important Nature article states the following:


The group discovered that neuroscience as a field is tremendously underpowered, meaning that most experiments are too small to be likely to find the subtle effects being looked for and the effects that are found are far more likely to be false positives than previously thought. It is likely that many theories that were previously thought to be robust might be far weaker than previously imagined.

I can give a simple example illustrating the problem. Imagine you try to test extrasensory perception (ESP) using a few trials with your friends. You ask them to guess whether you are thinking of a man or a woman. Suppose you try only 10 trials with each friend, and the best result is that one friend guessed correctly 70% of the time. This would be very unconvincing as evidence of anything. There's about a 5 percent chance of getting such a result on any such test, purely by chance; and if you test with five people, you have perhaps 1 chance in 4 that one of them will be able to make 7 such guesses correctly, purely by chance. So having one friend get 7 out of 10 guesses correctly is no real evidence of anything. But if you used a much larger sample size it would be a different situation. For example, if you tried 1000 trials with a friend, and your friend guessed correctly 700 times, that would have a probability of less than 1 in a million. That would be much better evidence.

Now, the problem with many a neuroscience study is that very small sample sizes are being used. Such studies fail to provide convincing evidence for anything. The snail memory test is an example.

The study involved giving shocks to some snails, extracting RNA from their tiny brains, and then injecting that into other snails that had not been shocked. It was reported that such snails had a higher chance of withdrawing into their shells, as if they were afraid and remembered being shocked when they had not. But it might have been that such snails were merely acting randomly, not experiencing any fear memory transferred from the first set of snails. How can you have confidence that mere chance was not involved? You would have to do many trials or use a sample size that guarantees that sufficient trials will occur. This paper states that in order to have moderate confidence in results, getting what is called a statistical power of .8,  there should be at least 15 animals in each group. This statistical power of .8 is a standard for doing good science. 

But judging from the snail paper, the scientists did not do a large number of trials. Judging from the paper, the effect described involved only 7 snails (the number listed on lines 571 -572 of the paper). There is no mention of trying the test more than once on such snails. Such a result is completely unimpressive, and could easily have been achieved by pure chance, without any real “memory transfer” going on. Whether the snail does or does not withdraw into its shell is like a coin flip. It could easily be that by pure chance you might see some number of “into the shell withdrawals” that you interpret as “memory transfer.”

Whether a snail is withdrawing into its shell requires a subjective judgment, where scientists eager to see one result might let their bias influence their judgments about whether the snail withdrew into its shell or not. Also, a snail might withdraw into its shell simply because it has been injected with something, not because it is remembering something. Given such factors and the large chance of a false alarm when dealing with such a small sample size, this “snail memory transfer” experiment offers no compelling evidence for anything like memory transfer. We may also note the idea that RNA is storing long-term memories in animals is entirely implausible, because of RNA's very short lifetime. According to this source, RNA molecules typically last only about two minutes, with 10 to 20 percent lasting between 5 and 10 minutes. And according to this source, if you were to inject RNA into a bloodstream, the RNA molecules would be too large to pass through cell membranes.

The Tonegawa memory research lab at MIT periodically puts out sensational-sounding press releases on its animal experiments with memory. Among the headlines on its site are the following:
  • “Neuroscientists identify two neuron populations that encode happy or fearful memories.”
  • “Scientists identify neurons devoted to social memory.”
  • “Lost memories can be found.”
  • “Researchers find 'lost' memories”
  • “Neuroscientists reverse memories' emotional associations.”
  • “How we recall the past.”
  • “Neuroscientists identify brain circuit necessary for memory formation.”
  • “Neuroscientists plant false memories in the brain.”
  • “Researchers show that memories reside in specific brain cells.”
But when we take a close look at the issue of sample size and statistical power, and the actual experiments that underlie these claims, it seems that few or none of these claims are based on solid, convincing experimental evidence. Although the experiments underlying these claims are very fancy and high-tech, the experimental results seem to involve tiny sample sizes so small that very little of it qualifies as convincing evidence.

A typical experiment goes like this: (1) Some rodents are given electrical shocks; (2) the scientists try to figure out where in the rodent's brain the memory was; (3) the scientists then use an optogenetic switch to “light up” neurons in a similar part of another rodent's brain, one that was not fear trained; (4) a judgment is made on whether the rodent froze when such a thing was done.

Such experiments have the same problems I mentioned above with the snail experiment: the problem of subjective interpretations and alternate explanations. The MIT memory experiments typically involve a judgment of whether a mouse froze. But that may often be a hard judgment to make, particularly in borderline cases. Also, we have no way of telling whether a mouse is freezing because he is remembering something. It could be that the optogenetic zap that the mouse gets is itself sufficient to cause the mouse to freeze, regardless of whether it remembers something. If you're walking along, and someone shoots light or energy into your brain, you might stop merely because of the novel stimulus. A science paper says that it is possible to induce freezing in rodents by stimulating a wide variety of regions. It says, "It is possible to induce freezing by activating a variety of brain areas and projections, including the hippocampus (Liu et al., 2012), lateral, basal and central amygdala (Ciocchi et al., 2010); Johansen et al., 2010; Gore et al., 2015a), periaqueductal gray (Tovote et al., 2016), motor and primary sensory cortices (Kass et al., 2013), prefrontal projections (Rajasethupathy et al., 2015) and retrosplenial cortex (Cowansage et al., 2014).”

But the main problem with such MIT memory experiments is that they involve very small sample sizes, so small that all of the results could easily have happened purely because of chance. Let's look at some sample sizes, remembering that according to this scientific paper, there should be at least 15 animals in each group to have moderate confidence in your results, sufficient to reach the standard of a “statistical power of .8.”.

Let's start with their paper, “Memory retrieval by activating engram cells in mouse models of early Alzheimer’s disease,” which can be accessed from the link above after clicking underneath "Lost memories can be found." The paper states that “No statistical methods were used to predetermine sample size.” That means the authors did not do what they were supposed to have done to make sure their sample size was large enough. When we look at page 8 of the paper, we find that the sample sizes used were merely 8 mice in one group and 9 mice in another group. On page 2 we hear about a group with only 4 mice per group, and on page 4 we hear about a group with only 4 mice per group. Such a paltry sample size does not result in any decent statistical power, and the results cannot be trusted, since they very easily could be false alarms. The study therefore provides no convincing evidence of engram cells.

Another example is this paper by the MIT memory lab, with the grandiose title “Creating a False Memory in the Hippocampus.” When we look at Figure 2 and Figure 3, we see that the sample sizes used were paltry: the different groups of mice had only about 8 or 9 mice per group. Such a paltry sample size does not result in any decent statistical power, and the results cannot be trusted, since they very easily could be false alarms. No convincing evidence has been provided of creating a false memory.

A third example is this paper with the grandiose title “Optogenetic stimulation of a hippocampal engram activates fear memory recall.” Figure 2 tells us that in one of the groups of mice there were only 5 mice, and that in another group there were only 3 mice. Figure 3 tells us that in two other groups of mice there were only 12 mice. Figure 4 tells us that in some other group there was only 5 mice. Such a paltry sample size does not result in any decent statistical power, and the results cannot be trusted, since they very easily could be false alarms. No convincing evidence has been provided of artificially activating a fear memory by the use of optogenetics.

Another example is this paper entitled “Silent memory engrams as the basis for retrograde amnesia.” Figure 1 tells us that the number of mice in particular groups used for the study ranged between 4 and 12. Figures 2 and 3 tell us that the number of mice in particular groups used for the study ranged between 3 and 12. Such a paltry sample size does not result in any decent statistical power, and the results cannot be trusted, since they very easily could be false alarms. Another unsound paper is the 2015 paper "Engram Cells Retain Memory Under Retrograde Amnesia," co-authored by Tonegawa. When we look at the end of the supplemental material, and look at figure s13, we find that the experimenters were using a number of mice that was equal to only 8 in one study group, and 7 in another study group.  Such a paltry sample size does not result in any decent statistical power, and the results cannot be trusted, since they very easily could be false alarms. 

We see the same "low statistical power" problem in this paper claiming an important experimental result regarding memory. The paper states in its Figure 2 that only 6 mice were used for a study group, and 6 mice for the control group. The same problem is shown in Figure 3 and Figure 4 of the paper.  We see the same  "low statistical power" problem in this paper entitled "Selective Erasure of a Fear Memory." The paper states in its Figure 3 that only 6 to 9 mice were used for a study group, That's only about half of the "15 animals per study group" needed for a modestly reliable result.  The same defect is found in this memory research paper and in this memory research paper.  A 2019 paper on memory research here has the same defect. 

The term “engram” means a cell or cells that store memories. Decades after the term was created, we still have no convincing evidence for the existence of engram cells. But memory researchers are shameless in using the term “engram” matter-of-factly even though no convincing evidence of an engram has been produced. So, for example, one of the MIT Lab papers may again and again refer to some cells they are studying as “engram cells,” as if they could try to convince us that such cells are actually engram cells by telling us again and again that they are engram cells. Doing this is rather like some ghost researcher matter-of-factly using the term “ghost blob” to refer to particular patches of infrared light that he is studying after using an infrared camera. Just as a blob of infrared light merely tells us only that some patch of air was slightly colder (not that such a blob is a ghost), a scientist observing a mouse freezing is merely entitled to say he saw a mouse freezing (not that the mouse is recalling a fear memory); and a scientist seeing a snail withdrawing into its shell is merely entitled to tell us that he saw a snail withdrawing into its shell (not that the snail was recalling some fear memory).

The relation between the chance of a false alarm and the statistical power of a study is clarified in this paper by R. M. Christley. The paper has an illuminating graph which I present below with some new captions that are a little more clear than the original captions. We see from this graph that if a study has a statistical power of only about .2, then the chance of the study giving a false alarm is something like 1 in 3 if there is a 50% chance of the effect existing, and much higher (such as 50% or greater) if there is less than a 50% chance of the effect existing. But if a study has a statistical power of only about .8, then the chance of the study giving a false alarm is only about 1 in 20 if there is a 50% chance of the effect existing, and much higher if there is less than a 50% chance of the effect existing. Animal studies using much fewer than 15 animals per study (such as those I have discussed) will result in the relatively high chance of false alarms shown in the green line.

false positive

The PLOS paper here analyzed 410 experiments involving fear conditioning with rodents, a large fraction of them memory experiments. The paper found that such experiments had a “mean normalized effect size” of only .29. An experiment with an effect size of only .29 is very weak, with a high chance of a false alarm. Effect size is discussed in detail here, where we learn that with an effect size of only .3, there's typically something like a 40 percent chance of a false alarm.


To determine whether a sample size is large enough, a scientific paper is supposed to do something called a sample size calculation. The PLOS paper here reported that only one of the 410 memory-related neuroscience papers it studied had such a calculation.  The PLOS paper reported that in order to achieve a moderately convincing effect size of .80, an experiment typically needs to have 15 animals per group; but only 12% of the experiments had that many animals per group. Referring to statistical power (a measure of how likely a result is to be real and not a false alarm), the PLOS paper states, “no correlation was observed between textual descriptions of results and power.” In plain English, that means that there's a whole lot of BS flying around when scientists describe their memory experiments, and that countless cases of very weak evidence have been described by scientists as if they were strong evidence.

Our science media shows very little sign of paying any attention to the statistical power of neuroscience research, partially because rigor is unprofitable. A site can make more money by trumpeting borderline weakly-suggestive research as if it were a demonstration of truth, because the more users click on a sensational-sounding headline, the more money the site make from ads. Our neuroscientists show little sign of paying much attention to whether their studies have a decent statistical power. For the neuroscientist, it's all about publishing as many papers as possible, so it's a better career move to do 5 underpowered small-sample studies (each with a high chance of a false alarm) than a single study with an adequate sample size and high statistical power.

Postscript: In my original post I used an assumption that 15 research animals per study group are needed for a moderately persuasive result. It seems that this assumption may have been too generous. In her post “Why Most Published Neuroscience Findings Are False,” Kelly Zalocusky PhD calculates (using Ioannidis’s data) that the median effect size of neuroscience studies is about .51. She then states the following, talking about statistical power:

To get a power of 0.2, with an effect size of 0.51, the sample size needs to be 12 per group. This fits well with my intuition of sample sizes in (behavioral) neuroscience, and might actually be a little generous. To bump our power up to 0.5, we would need an n of 31 per group. A power of 0.8 would require 60 per group.

If we describe a power of .5 as being moderately convincing, it therefore seems that 31 animals per study group is needed for a neuroscience study to be moderately convincing. But most experimental neuroscience studies involving rodents and memory use fewer than 15 animals per study group. 

Zalocusky states the following:

If our intuitions about our research are true, fellow graduate students, then fully 70% of published positive findings are “false positives”.This result furthermore assumes no bias, perfect use of statistics, and a complete lack of “many groups” effect. (The “many groups” effect means that many groups might work on the same question. 19 out of 20 find nothing, and the 1 “lucky” group that finds something actually publishes). Meaning—this estimate is likely to be hugely optimistic.

Post-Postscript: The latest example of a memory experiment failing to actually prove anything (because of its too-small-sample size) is a study in Nature that has been hyped with headlines such as "Artificial memory created."  The study has the inaccurate title, "Memory formation in the absence of experience." The study fails to prove any such thing occurred. When we look at the number of animals involved, we often find that the study fails to meet the minimum standard of 15 animals per study group.  In Figure 1 we learn that two of the study groups consisted of only 8 mice. In Figure 2 we learn that two of the study groups consisted of only 10 mice.  In Figure 3 we learn that one of the study groups consisted of only 7 mice.  Moreover, the methodology used in the study is so convoluted that it fails to provide clear and convincing evidence for anything interesting.  The only evidence of memory recall is that the mice supposedly avoided some area,  something that might have occurred for any number of reasons other than a recall of some memory.  A robust test of an artificial memory would test an actual acquired skill, such as the ability to navigate a maze in a certain time. 

No comments:

Post a Comment