Once again, the term “statistical significance” in null hypothesis significance testing is under fire. The authors of that commentary favor renaming “confidence intervals” as “compatibility intervals”. This term emphasizes that the intervals represent a range of values that would be reasonable values to obtain under certain statistical assumptions rather than a statement about subjective beliefs (for which Bayesian statistics are more appropriate with their credible intervals). However, I think the term needing replacement most goes back even further. In our “justify your alpha” paper, we recommended abandoning the term “statistically significant”, but we didn’t give a replacement for that term. It wasn’t for lack of trying, but we never came up with a better label for findings that pass a statistical threshold.
My previous blog post about how to justify an alpha level tried using threshold terminology, but it felt clunky and ungainly. After thinking about it for over a year, I think I finally have an answer:
Replace “significant” with “discernible”.
The first advantage is the conceptual shift from meaning to perception. Rather than imbuing a statistical finding with a higher “significance”, the framework moves earlier in the processing stream. Perception entails more than sensation, which a term like “detectable” might imply. “Discernible” implies the arrangement of information, similar to how inferential statistics can arrange data for researchers. Thus, statistics are more readily recognized as tools to peer into a study’s data rather than arbiters of the ultimate truth – or “significance” – of a set of findings. Scientists are thus encouraged to own the interpretation of the statistical test rather than letting the numbers arrogate “significance”.
The second advantage of this terminological shift is that the boundary between results falling on either side of the statistical threshold becomes appropriately ambiguous. No longer is some omniscient “significance” or “insignificance” imbued to them automatically. Rather, a set of questions immediately arises. 1) Is there really no effect there – at least, if the null hypothesis is the nil hypothesis of exactly 0 difference or relationship? It’s highly unlikely, as unlikely as finding a true vacuum, but I suppose it might be possible. In this case, having “no discernible effect” allows the possibility that our perception is insufficient to recognize it against a background void.
2) Is there an effect there, but it’s so small as to be ignorable given the smallest effect size we care about? This state of affairs is likely when tiny effects exist, like the relationship between intelligence and birth order. Larger samples might be needed or more precise measures could be required to find it – like having an electron microscope instead of a light microscope. However, with the current state of affairs, we as a research community are satisfied that the effects are small enough that we’re willing not to care about them. Well-powered studies should allow relatively definitive conclusions here. Here, “no discernible effect” suggests the effect may be wee, but so wee that we are content not to consider them further.
3) Are our tests so imprecise that we couldn’t answer whether an effect is actually there or not? Perhaps a study has too few participants to detect even medium-sized effects. Perhaps its measurements are so internally inconsistent as to render them like unto a microscope lens smeared with gel. Either way, the study just may not have enough power to make a meaningful determination one way or the other. The increased smudginess of one set of data compared to another that helped inspire the Nature commentary might be more readily appreciated when described in “discernible” instead of “significant” terms. Indeed, “no discernible effect” helps keep in mind that our perceptual-statistical apparatus may have insufficient resolution to provide a solid answer. Conversely, a discernible finding might also be due to faulty equipment or other non-signal related causes, irrespective of its apparent practical significance.
These questions lead us to ponder whether our findings are reasonable or might instead simply be the phantoms of false positives (or other illusions). Indeed, I think “discernible” nudges us to think about why a finding did or didn’t cross the statistical threshold more deeply instead of simply accepting its “significance” or lack thereof.
In any case, I hope that “statistically discernible” is a better term for what we mean when a result passes the alpha threshold in a null hypothesis test and is thus more extreme than we decided would be acceptable to believe as a value arising from the null hypothesis’s distribution. I hope it can lead to meaningful shifts in how we think about the results of these statistical tests. Then again, perhaps the field will just rail against NHDT in a decade. Assuming, of course, that it doesn’t just act as Regina to my Gretchen.