Why it always pays (95% C.I.) to think twice about your statistics

IMG_20141208_191145

The northern hemisphere has just about reached its maximum tilt away from the sun, which means many academics will soon get a few days or weeks off to . . . revise statistics! Winter holidays are the perfect time to sit back, relax, take a fresh introspective at the research you may have been doing (and that which you haven’t) and catch up on all that work you were too distracted by work to do. It is a great time to think about the statistical methods in common use in your field and what they actually mean about the claims being made. Perhaps an unusual dedication to statistical rigour will help you become a stellar researcher, a beacon to others in your discipline. Perhaps it will just turn you into a vengefully cynical reviewer. At the least it should help you to make a fool of yourself ever-so-slightly less often.

First test your humor (description follows in case you prefer a mundane account to a hilarious webcomic): http://xkcd.com/882/

In the piece linked above, Randall Munroe highlights the low threshold for reporting significant results in much of science (particularly biomedical research) and specifically the way these uncertain results are over and mis-reported in the lay press. The premise is that researchers perform experiments to determine whether jelly beans of 20 different colours have anything to do with acne. After setting their p-value threshold at 0.05, they find in one of the 20 experiments that there is a statistically significant association between green jelly beans and acne. I would consider the humour response to this webcomic a good first-hurdle metric if I were a PI interviewing applicants for new students/post-docs.

In Munroe’s comic, the assumption is that jelly beans never have anything to do with acne and that 100% of the statistically significant results are due to chance. Assuming that all of the other results were also reported in the literature somewhere (although not likely to be picked up by the sensationalist press), this would give the proportion of reported results that fail to reflect reality at an intuitive and moderately acceptable 0.05, or 5%.
Let us instead consider a slightly more lab-relevant version:

Consider a situation where some jelly beans do have some relationship to the medical condition of interest, say 1 in 100 jelly bean variants are actually associated in some way with acne. Let us also swap small molecules for jelly beans, and cancer for acne, and use the same p-value threshold of 0.05. We are unlikely to report negative results where the small molecule has no relationship to the condition. We test 10000 different compounds for some change in a cancer phenotype in vitro.

Physicists may generally wait for 3-6 sigmas of significance before scheduling a press release, but for biologists publishing papers the typical p-value threshold is 0.05. If we use this threshold and perform our experiment and go directly to press with the statistically significant results of the experiment, 83.9% of our reported positive findings will be wrong. In the press, a 0.05 p-value will often be interpreted as “only 5% chance of being wrong.” This is certainly not what we see here, but after some thought the error rate is expected and fairly intuitive. Allow me to illustrate with numbers.

As expected from the conditions of the thought experiment 1%, or 100 compounds, of these have a real effect. Setting our p-value at the widely accepted 0.05, we will also uncover purely by chance non-existent relationships between 495 (0.05 * 99000 with no effect) of the compounds and our cancer phenotype of interest. If we assume that the probability of failing to detect a real effect due to chance are complementary to detecting a fake effect, we will pick up 95 of the 100 actual cases we are interested in. Our total positive results will be 495 + 95 = 590, but only 95 of those reflect a real association. 495/590, or about 83.9%, will be false positives.

Such is the premise of a short and interesting write-up by David Calquhoun on false discovery rates [2]. The emphasis is on biological research because that is where the problem is most visible, but the considerations discussed should be of interest to anyone conducting research. On the other hand, let us remember that confidence due to technical replicates does not generally translate to confidence in a description of reality, e.g. the statistical confidence in the data from the now-infamous faster-than-light neutrinos from the OPERA detector (http://arxiv.org/pdf/1109.4897v4.pdf) was very high, but the source of the anomaly was instrumentation and two top figures from the project eventually resigned after overzealous press coverage pushed the experiment into the limelight. Paul Blainey et al. discuss the importance of considering the effect of technical and biological (or more generally, experimentally relevant) replicates in a recent Nature Methods commentary [3].

I hope the above illustrates my thought that a conscientious awareness of the common pitfalls in one’s own field, as well as those one closely interacts, is important for slogging through the avalanche of results published every day and for producing brilliant work of one’s own. This requires continued effort in addition to an early general study of statistics, but I would suggest it is worth it. To quote [2] “In order to avoid making a fool of yourself you need to know how often you are right when you declare a result to be significant, and how often you are wrong.”

Reading:

[1]Munroe, Randall. Significant. XKCD. http://xkcd.com/882/

[2] Colquhoun, David. An investigation of the false discovery rate and the misinterpretation of p-values. DOI: 10.1098/rsos.140216. Royal Society Open Science. Published 19 November 2014. http://rsos.royalsocietypublishing.org/content/1/3/140216

[3] Blainey, Paul, Krzywinski, Martin, Altman, Naomi. Points of Significance: Replication. Nat Meth (2014) 11.9 879-880. http://dx.doi.org/10.1038/nmeth.3091

Advertisements

Why is there no confidence in science journalism?

UncertainScienceStoriesMS1939Small

Living in the so-called anthropocene, meaningful participation in humanity’s trajectory requires scientific literacy. This requirement is a necessity at the population level, it is not enough for a small proportion of select individuals to develop this expertise, applying them only to the avenues of their own interest. Rather, a general understanding and use of the scientific method in forming actionable ideas for modern problems is a requisite for a public capable of steering policy along a survivable route. As an added benefit, scientific literacy produces a rarely avoided side-effect of knowing one or two things for certain, and touching upon the numinous of the universe.

Statistical literacy is a necessary foundation for building scientific literacy. Widespread confusion about the meaning of such terms as “statistical significance” (compounded by non-standard usage of the term “significance” on its own) abounds, resulting in little to no transferability of the import of these concepts when scientific results are described in mainstream publications. What’s worse, this results in a jaded public knowing just enough to twist the jargon of science to support their own predetermined, potentially dangerous, conclusions (e.g. because scientific theories can be refuted by evidence to the contrary, a given theory, no matter the level of support by existing data, can be ignored when forming personal and policy decisions).

I posit that a fair amount of the responsibility for improving the state of non-specialist scientific literacy lies with science journalists at all scales. The most popular science-branded media does little to nothing in imparting a sense of the scientific method, the context and contribution of published experiments, and the meaning of statistics underlying the claims. I suggest that a standardisation of language for describing scientific results is warranted, so that results and concepts can be communicated in an intuitive manner without resorting to condescension, as well as conferring the quantitative, comparable values used to form scientific conclusions.

A good place to start (though certainly not perfect) is the uncertainty guidance put out by the Intergovernmental Panel on Climate Change (IPCC). The IPCC reports benefit from translating statistical concepts of confidence and likelihood into intuitive terms without sacrificing the underlying quantitative meaning (mostly). In the IPCC AR5 report guidance on addressing uncertainty [pdf], likelihood statements of probability are standardised as follows:

LikelihoodIPCCguidance

In the fourth assessment report (AR4), the guidance [pdf] roughly calibrated confidence statements to a chance of being correct. I’ve written the guidance here in terms of p-values, or the chance that results are due to coincidence (p = 0.10 = 10% chance), but statistical tests producing other measurements of confidence were also covered.

confStandardAR4

The description of results via their confidence rather than statistical significance, which is normally used, is probably more intuitive to most people. Few people in general readership readily discern between statistical significance, i.e. the results are likely to not be due to chance, and meaningful significance, i.e. the results matter in some way. Likewise, statistical significance statements are not even very well established in scientific literature and vary widely by field. That being said, the IPCC’s AR4 guidance threshold for very high confidence is quite low. Many scientific results are only considered reportable at a p-value of less than 0.05, or 5% chance of being an experimental artifact in the data due to coincidence, whereas the AR4 guidance links a statement of very high confidence to anything with less than a 10% chance of being wrong. Likewise, a 5-in-10 chance of being correct hardly merits a statement of medium confidence in my opinion. Despite these limitations, I think the guidance should have been merely updated to better reflect the statistical reality of confidenceand it was a mistake for the guidance for AR5 to switch to purely qualitative standards for conveying confidence based on the table below, with highest confidence in the top right and lowest confidence in the bottom left.

confAR5Standards

Adoption (and adaptation) of standards like these in regular usage by journalist could do a lot to better the communication of science to a general readership. This would normalise field-variable technical jargon (e.g. sigma significance values in particle physics, p-values in biology) and reduce the need for daft analogies. Results described in this way would be amenable to meaningful comparison by generally interested but non-specialist audiences, while those with a little practice in statistics won’t be any less informed by dumbing-down the meaning.

Edited 2016/06/25 for a better title, added comic graphic. Source for file of cover design by Norman Saunders (Public Domain)
23 Aug. 2014: typo in first paragraph corrected:

. . . meaningful participation in participating in humanity’s trajectory. . .

References:

Michael D. Mastrandrea et al. Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties. IPCC Cross-Working Group Meeting on Consistent Treatment of Uncertainties. Jasper Ridge, CA, USA 6-7 July 2010. <http://www.ipcc.ch/pdf/supporting-material/uncertainty-guidance-note.pdf&gt;

IPCC. Guidance Notes for Lead Authors of the IPCC Fourth Assessment Report on Addressing Uncertainties. July 2005. <https://www.ipcc-wg1.unibe.ch/publications/supportingmaterial/uncertainty-guidance-note.pdf&gt;