In an attempt to point out that not every article that makes it into peer review survives the scrutiny of the science community, New Yorker author Jonah Lehrer apparently goes a little further than he intended, and says so here. The Truth Wears Off begins with a number of examples of when the effects described in peer reviewed articles don’t seem to be real, notably in medicine, the life sciences, and psychology. Lehrer gives some examples from physics, as well.
To some, it appears that the effect first seen declines over time. Examples:
• people shown a face and asked to describe it showed a lower ability to recognize the face (verbal overshadowing) two decades ago, but the effect shrank dramatically year after year.
• Anti-psychotic drugs tested in the 1990s appear to be less effective today. Note: the article does not examine whether the schizophrenics in this study are similar to schizophrenics studied a decade ago—this includes severity of and type of symptoms, and any other treatments they may have received.
• In an ESP test from early last century, some initially appeared to show paranormal ability, but further tests failed to substantiate this result.
• A purported correlation between female barn swallows and symmetry in their mates led to a number of studies finding similar results for swallows and other species, but the correlation has since disappeared. Michael Jennions found that a large number of results in ecology and evolutionary biology demonstrate this decline effect.
In an apparent misunderstanding of the process, Lehrer discusses the problem when “rigorously validated findings” can no longer be replicated as a problem with science. Most scientists would assume there is a problem with both the findings and the sloppiness that leads to a large number of poor results.
Lehrer then discusses a few problems in the article, but does not tease out the importance of each:
• Journals and scientists look for results that disagree with the orthodoxy. Scientists are less likely to submit null results to journals and journals are less likely to print them. Once the orthodoxy changes (from symmetry is irrelevant to symmetry is important to female barn swallows), confounding results become interesting. Note: This is considered a real phenomenon, but Lehrer gives little idea as to whether this is a problem with 0.5% or 95% of articles submitted. Climate change skeptics—if results are submitted to peer review which are contrary to scientific orthodoxy on climate change, these results will get prominent play, if they make it through peer review.
• The barn swallow studies were not double blind studies, with different people measuring feather length and assessing behavior. When it came time to round up or down, errors crept into measurements that differed by millimeters. Similarly, published acupuncture results vary by country, in part because the person testing for the effect knows whether acupuncture has been used.
• A number of studies, such as those finding genetic effects on hypertension and schizophrenia, were so badly done that the results are meaningless. One review of 432 such results found the vast majority worthless. Note: This is considered an important problem in some fields of science, notably medicine, and also my field, education. See comments below for what those in the life sciences and medicine think. There appears to be little support for Lehrer’s including physics experiments in his article.
• Lehrer assumes that all the later-refuted results were analyzed statistically in an appropriate way. Note: Statisticians do not, see Andrew Gelman’s comment below.
Are there reasons that explain these results besides the one favored by many, that science is a crapshoot? The person who told me of this article certainly feels that way; he picks and chooses among scientific results, except when he knows scientists are wrong and so goes with other analysis.
Lehrer says, “We like to pretend that our experiments define the truth for us. But that’s often not the case. Just because an idea is true doesn’t mean it can be proved. And just because an idea can be proved doesn’t mean it’s true. When the experiments are done, we still have to choose what to believe.” Only science doesn’t prove so much as disprove, and what is left standing gains credibility.
Lehrer does not provide enough information or context so that we can make sense of what he says. He repeats what everyone in science already knows: that research in some fields, and some peer review, is of lower quality, and that while a number of peer review results turn out to be uninteresting, this is much more often true in medicine and some of the life sciences. The one important point I got from the article, that results that no longer appear to be true are still used by some doctors, disappears among the noise.
Not mentioned is that people whose exposure to science comes primarily from articles on medicine see reason to doubt medical science, and many extrapolate to other fields of science. Those who prefer to doubt science will find justification in this article.
Comments from others
Jerry Coyne believes that his field, evolutionary biology, has a problem, in part because so few eyes look at each result.
I tend to agree with Lehrer about studies in my own field of evolutionary biology. Almost no findings are replicated, there’s a premium on publishing positive results, and, unlike some other areas, findings in evolutionary biology don’t necessarily build on each other: workers usually don’t have to repeat other people’s work as a basis for their own. (I’m speaking here mostly of experimental work, not things like studies of transitional fossils.) Ditto for ecology. Yet that doesn’t mean that everything is arbitrary. I’m pretty sure, for instance, that the reason why male interspecific hybrids in Drosophila are sterile while females aren’t (“Haldane’s rule”) reflects genes whose effects on hybrid sterility are recessive. That’s been demonstrated by several workers. And I’m even more sure that humans are more closely related to chimps than to orangutans. Nevertheless, when a single new finding appears, I often find myself wondering if it would stand up if somebody repeated the study, or did it in another species.
But let’s not throw out the baby with the bathwater. In many fields, especially physics, chemistry, and molecular biology, workers regularly repeat the results of others, since progress in their own work demands it. The material basis of heredity, for example, is DNA, a double helix whose sequence of nucleotide bases codes (in a triplet code) for proteins. We’re beginning to learn the intricate ways that genes are regulated in organisms. The material basis of heredity and development is not something we “choose” to believe: it’s something that’s been forced on us by repeated findings of many scientists. This is true for physics and chemistry as well, despite Lehrer’s suggestion that “the law of gravity hasn’t always been perfect at predicting real-world phenomena.”
Lehrer, like Gould in his book The Mismeasure of Man, has done a service by pointing out that scientists are humans after all, and that their drive for reputation—and other nonscientific issues—can affect what they produce or perceive as “truth.” But it’s a mistake to imply that all scientific truth is simply a choice among explanations that aren’t very well supported. We must remember that scientific “truth” means “the best provisional explanation, but one so compelling that you’d have to be a fool not to accept it.” Truth, then, while always provisional, is not necessarily evanescent. To the degree that Lehrer implies otherwise, his article is deeply damaging to science.
Note: most scientists in physics, chemistry, and molecular biology, so far as I know, agree.
David Gorski, an advocate of science-based medicine, says that people in medicine have been talking about a number of these issues for years, however, Lehrer goes too far in generalizing poor medical studies into problems with science.
Jennions’ article was entitled Relationships fade with time: a meta-analysis of temporal trends in publication in ecology and evolution. Reading the article, I was actually struck by how relatively small, at least compared to the impression that Lehrer gave in his article, the decline effect in evolutionary biology was found to be in Jennions’ study. Basically, Jennions examined 44 peer-reviewed meta-analyses and analyzed the relationship between effect size and year of publication; the relationship between effect size and sample size; and the relationship between standardized effect size and sample size. To boil it all down, Jennions et al concluded, “On average, there was a small but significant decline in effect size with year of publication. For the original empirical studies there was also a significant decrease in effect size as sample size increased. However, the effect of year of publication remained even after we controlled for sampling effort.” They concluded that publication bias was the “most parsimonious” explanation for this declining effect.
Personally, I’m not sure why Jennions was so reluctant to talk about such things publicly. You’d think from his responses in Lehrer’s interview that scientists would be coming for him with pitchforks, hot tar, and feathers if he dared to point out that effect sizes reported by investigators in his scientific discipline exhibit small declines over the years due to publication bias and the bandwagon effect. Perhaps it’s because he’s not in medicine; after all, we’ve been speaking of such things publicly for a long time. Indeed, we generally expect that most initially promising results, even in randomized trials, will not ultimately pan out. In any case, those of us in medicine who might not have been willing to talk about such phenomena became more than willing after John Ioannidis published his provocatively titled article Why Most Published Research Findings Are False around the time of his study Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. Physicians and scientists are generally aware of the shortcomings of the biomedical literature. Most, but sadly not all of us, know that early findings that haven’t been replicated yet should be viewed with extreme skepticism and that we can become more confident in results the more they are replicated and built upon, particularly if multiple lines of evidence (basic science, clinical trials, epidemiology) all converge on the same answer. The public, on the other hand, tends not to understand this.
Gorski also discusses the effect of subject popularity on calculations of error rates. Commenters look at the challenges Lehrer presents from physical science, and do not support his conclusions.
It’s always good to run your results by someone who is very good at statistics. Andrew Gelman, statistician, says,
The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount. I know that Dave Krantz has thought about this issue for awhile; it came up when Francis Tuerlinckx and I wrote our paper on Type S errors, ten years ago.
My current thinking is that most (almost all?) research studies of the sort described by Lehrer should be accompanied by retrospective power analyses, or informative Bayesian inferences. Either of these approaches–whether classical or Bayesian, the key is that they incorporate real prior information, just as is done in a classical prospective power analysis–would, I think, moderate the tendency to overestimate the magnitude of effects.
Note: I don’t understand statistics, or Gelman’s solutions, but I learned early on that poor statistics is the downfall of many a conjecture.
PZ Myers, biologist
Early in any scientific career, one should learn a couple of general rules: science is never about absolute certainty, and the absence of black & white binary results is not evidence against it; you don’t get to choose what you want to believe, but instead only accept provisionally a result; and when you’ve got a positive result, the proper response is not to claim that you’ve proved something, but instead to focus more tightly, scrutinize more strictly, and test, test, test ever more deeply.
Steven Novella, neurologist, discusses how the naive, the skeptical (scientists mostly fit in this category), and the deniers see science, then says,
Lehrer is ultimately referring to aspects of science that skeptics have been pointing out for years (as a way of discerning science from pseudoscience), but Lehrer takes it to the nihilistic conclusion that it is difficult to prove anything, and that ultimately “we still have to choose what to believe.” Bollocks!
John Horgan sees this as the decline of illusion. He is not a big fan of truthiness.
Lehrer’s reference to physics was checked by Charles Petit. He quotes Lawrence Krauss,
“The physics references are (deposit scatological bovine expletive here) … the neutron data have fallen, reflecting under-estimation of errors, but the lower lifetime doesn’t change anything having to do with the model of the neutron, which is well understood and robust … And as for discrepancies with gravity, the deep borehole stuff is interesting but highly suspect. Moreover, all theories conflict with some experiments, because not all experiments are right.” / LMK
The January 5 NY Times has an article, Journal’s Paper on ESP Expected to Prompt Outrage.
It repeats a number of the points made by people cited above:
Claims that defy almost every law of science are by definition extraordinary and thus require extraordinary evidence. Neglecting to take this into account — as conventional social science analyses do — makes many findings look far more significant than they really are, these experts say. …
Peer review is usually an anonymous process, with authors and reviewers unknown to one another. But all four reviewers of this paper were social psychologists, and all would have known whose work they were checking and would have been responsive to the way it was reasoned.
Perhaps more important, none were topflight statisticians. “The problem was that this paper was treated like any other,” said an editor at the journal, Laura King, a psychologist at the University of Missouri. “And it wasn’t.”
Many statisticians say that conventional social-science techniques for analyzing data make an assumption that is disingenuous and ultimately self-deceiving: that researchers know nothing about the probability of the so-called null hypothesis.
In this case, the null hypothesis would be that ESP does not exist. Refusing to give that hypothesis weight makes no sense, these experts say; if ESP exists, why aren’t people getting rich by reliably predicting the movement of the stock market or the outcome of football games?
Instead, these statisticians prefer a technique called Bayesian analysis, which seeks to determine whether the outcome of a particular experiment “changes the odds that a hypothesis is true,” in the words of Jeffrey N. Rouder, a psychologist at the University of Missouri who, with Richard D. Morey of the University of Groningen in the Netherlands, has also submitted a critique of Dr. Bem’s paper to the journal.
Attempts to replicate the results have failed.