Reliability and replication in science

This fall after my sabbatical I will be teaching Graduate Statistics. Thus I have spent some time this year looking at the state of research methodology in science, particularly in psychology. I mentioned in my last column that sometimes scientific discoveries are retracted because further work or analysis shows simpler, alternate explanations for the findings. It is also sometimes the case that when other scientists try to repeat – that is, replicate – the original experiment, it does not come out the same. Replicability is one of the main pillars on which science rests; evidence must, at least in principle, be such that anyone with the right skills and equipment could repeat the same finding again. If something cannot be demonstrated repeatedly, then the original finding is considered suspect.

In August last year a group of researchers named the Open Science Collaboration published an article in Science, reporting on their attempts to replicate (as exactly as possible) 100 published peer-reviewed experiments in cognitive and social psychology. They found that a relatively large percentage of the experiments did not replicate – depending on how you measured it, over 60 percent. Needless to say, this finding attracted a lot of attention and concern. While this study is probably the most complete attempt to replicate a large body of findings in one discipline, similar failures to replicate have been found in other areas of science, like cancer research and drug testing. These failures have raised questions about how we do science.

Part of the problem is that the statistical tools we use in science do not give absolute answers but rather probability estimates. Probability estimates sometimes lead to wrong conclusions, called false positives. Very simplistically, we often say that if our findings are less than five percent likely to have occurred by chance, then there is something other than chance going on, that is, a real effect. With this standard, five percent of the time we will be wrong. On the other hand, if there is a real effect and we try to replicate it, sometimes we will not be able to do so, called a false negative.

Behind the disparity

What is unsettling about the recent attempts to replicate is that the number of unrepeatable studies is much higher than would be expected if there were only false positives and negatives. This high number has led to collective soul searching among scientists as to why there are so many failures to replicate. Part of the problem may be summed up in the old phrase “publish or perish”: if you do not successfully advance science, measured by scientific publications, your career prospects in science are reduced. Without publications you are less likely to be promoted, receive tenure or be awarded research grants.

This pressure to publish has led to people abusing statistical processes to find results that are “real” and “significant” (unlikely to occur by chance) and thus publishable. There’s the stopping rule problem: if results are not significant after a first analysis, scientists can continue to collect more data until the results do become significant – and then stop. And the data trolling problem: sometimes data can be analyzed several ways, and only the analyses that produce significant results are reported. A third, “filing drawer” problem is that if an experiment does not come out with significant results, it is never published. Replications are also deemed to be less creative than new findings, so they are less likely to be attempted. This list of problems demonstrates that scientists are human and like all humans have failings; we all fall short and have to work in a world filled with brokenness that ultimately only Christ can heal.

Scientists are becoming aware of these statistical problems. One response is to look for collaborating evidence. While complete replications of studies may not normally happen, parts of earlier experiments may be included in subsequent work and should provide results consistent with the earlier work. For example, if I find that male and female Laurier University students differ in their love of chocolate, one would expect that similar differences would be evident at other universities, and among non-university students. Further, the chocolate difference should lead to reliable differences in behaviour. These partial replications or extensions of the original finding would either confirm its general importance and correctness or would suggest that it is a Laurier-specific finding or a false positive. Thus one check on research is that individual findings should probably be accepted only if they can be fit into the larger fabric of scientific evidence.


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *