The northern hemisphere has just about reached its maximum tilt away from the sun, which means many academics will soon get a few days or weeks off to . . . revise statistics! Winter holidays are the perfect time to sit back, relax, take a fresh introspective at the research you may have been doing (and that which you haven’t) and catch up on all that work you were too distracted by work to do. It is a great time to think about the statistical methods in common use in your field and what they actually mean about the claims being made. Perhaps an unusual dedication to statistical rigour will help you become a stellar researcher, a beacon to others in your discipline. Perhaps it will just turn you into a vengefully cynical reviewer. At the least it should help you to make a fool of yourself ever-so-slightly less often.

First test your humor (description follows in case you prefer a mundane account to a hilarious webcomic): http://xkcd.com/882/

In the piece linked above, Randall Munroe highlights the low threshold for reporting significant results in much of science (particularly biomedical research) and specifically the way these uncertain results are over and mis-reported in the lay press. The premise is that researchers perform experiments to determine whether jelly beans of 20 different colours have anything to do with acne. After setting their p-value threshold at 0.05, they find in one of the 20 experiments that there is a statistically significant association between green jelly beans and acne. I would consider the humour response to this webcomic a good first-hurdle metric if I were a PI interviewing applicants for new students/post-docs.

In Munroe’s comic, the assumption is that jelly beans never have anything to do with acne and that 100% of the statistically significant results are due to chance. Assuming that all of the other results were also reported in the literature somewhere (although not likely to be picked up by the sensationalist press), this would give the proportion of reported results that fail to reflect reality at an intuitive and moderately acceptable 0.05, or 5%.

Let us instead consider a slightly more lab-relevant version:

Consider a situation where some jelly beans do have some relationship to the medical condition of interest, say 1 in 100 jelly bean variants are actually associated in some way with acne. Let us also swap small molecules for jelly beans, and cancer for acne, and use the same p-value threshold of 0.05. We are unlikely to report negative results where the small molecule has no relationship to the condition. We test 10000 different compounds for some change in a cancer phenotype in vitro.

Physicists may generally wait for 3-6 sigmas of significance before scheduling a press release, but for biologists publishing papers the typical p-value threshold is 0.05. If we use this threshold and perform our experiment and go directly to press with the statistically significant results of the experiment, 83.9% of our reported positive findings will be wrong. In the press, a 0.05 p-value will often be interpreted as “only 5% chance of being wrong.” This is certainly not what we see here, but after some thought the error rate is expected and fairly intuitive. Allow me to illustrate with numbers.

As expected from the conditions of the thought experiment 1%, or 100 compounds, of these have a real effect. Setting our p-value at the widely accepted 0.05, we will also uncover purely by chance non-existent relationships between 495 (0.05 * 99000 with no effect) of the compounds and our cancer phenotype of interest. If we assume that the probability of failing to detect a real effect due to chance are complementary to detecting a fake effect, we will pick up 95 of the 100 actual cases we are interested in. Our total positive results will be 495 + 95 = 590, but only 95 of those reflect a real association. 495/590, or about 83.9%, will be false positives.

Such is the premise of a short and interesting write-up by David Calquhoun on false discovery rates [2]. The emphasis is on biological research because that is where the problem is most visible, but the considerations discussed should be of interest to anyone conducting research. On the other hand, let us remember that confidence due to technical replicates does not generally translate to confidence in a description of reality, e.g. the statistical confidence in the data from the now-infamous faster-than-light neutrinos from the OPERA detector (http://arxiv.org/pdf/1109.4897v4.pdf) was very high, but the source of the anomaly was instrumentation and two top figures from the project eventually resigned after overzealous press coverage pushed the experiment into the limelight. Paul Blainey et al. discuss the importance of considering the effect of technical and biological (or more generally, experimentally relevant) replicates in a recent Nature Methods commentary [3].

I hope the above illustrates my thought that a conscientious awareness of the common pitfalls in one’s own field, as well as those one closely interacts, is important for slogging through the avalanche of results published every day and for producing brilliant work of one’s own. This requires continued effort in addition to an early general study of statistics, but I would suggest it is worth it. To quote [2] “In order to avoid making a fool of yourself you need to know how often you are right when you declare a result to be significant, and how often you are wrong.”

Reading:

[1]Munroe, Randall. Significant. XKCD. http://xkcd.com/882/

[2] Colquhoun, David. An investigation of the false discovery rate and the misinterpretation of p-values. DOI: 10.1098/rsos.140216. Royal Society Open Science. Published 19 November 2014. http://rsos.royalsocietypublishing.org/content/1/3/140216

[3] Blainey, Paul, Krzywinski, Martin, Altman, Naomi. Points of Significance: Replication. Nat Meth (2014) 11.9 879-880. http://dx.doi.org/10.1038/nmeth.3091

## One thought on “Why it always pays (95% C.I.) to think twice about your statistics”