I joined a panoply of scholars who argue that it is necessary to justify your alpha (that is, the acceptable long-run proportion of false positive results in a specific statistical test) rather than redefine a threshold for statistical significance across all fields. We’d hoped to emphasize that more than just alpha needed justification, but space limitations (and the original article’s title) focused our response on that specific issue. We also didn’t have the space to include specific examples of how to justify alpha or other kinds of analytic decisions. Because people are really curious about this issue, I elaborate here my thought process about how one can justify alpha and other analytic decisions.
The whole notion of an alpha level doesn’t make sense in multiple statistical frameworks. For one, alpha is only meaningful in the frequentist school of probability, in which we consider how often a particular event would occur in the real world out of the total number of such possible events. In Bayesian statistics, the concern is typically adjusting the probability that a proposition of some sort is true, which is a radically different way of thinking about statistical analysis. Indeed, the Bayes factor represents the relative likelihood of an alternative hypothesis to that of a null hypothesis given the data obtained. There’s also no clear mapping of p values to Bayes factors, so many discussions about alpha aren’t relevant if that’s the statistical epistemology you use.
Another set of considerations comes from Sir Ronald Fisher, one of the original statisticians. In this view, there’s no error control to be concerned with; p values instead represent levels of evidence against a null hypothesis. Crossing a prespecified low p value (i.e., alpha) then entails rejection of a statistical null hypothesis. However, this particular point of view has fallen out of favor, and Fisher’s attempts to construct a system of fiducial inference also ended up being criticized. Finally, there are systems of statistical inference based on information theory and non-Bayesian likelihood estimation that do not entail error control in their epistemological positions.
I come from a clinical background and teach the psychodiagnostic assessment of adults. As a result, I come from a world in which I must use the information we gather to make a series of dichotomous decisions. Does this client have major depressive disorder or not? Is this client likely to benefit from exposure and response prevention interventions or not? This framework extends to my thinking about how to conduct scientific studies or not. Will I pursue further work on a specific measure or not? Will my research presume this effect is worth including in the experimental design or not? Thus, the error control epistemological framework (borne of Neyman-Pearson reasoning about statistical testing) seems to be a good one for my purposes, and I’m reading more about it to verify that assumption. In particular, I’m interested in disambiguating this kind of statistical reasoning from the common practice of null hypothesis significance testing, which amalgamates Fisherian inferential philosophy and Neyman-Pearson operations.
I don’t argue that it’s the only relevant possible framework to evaluate the results of my research. That’s why I still report p values in text whenever possible (the sea of tabular asterisks can make precise p values difficult to report there) to allow a Fisherian kind of inference. Such p value reporting also allows those in the Neyman-Pearson tradition to use different decision thresholds in assessing my work. It should also be possible to use the n/df and values of my statistics to compute Bayes factors for the simpler statistics I report (e.g., correlations and t tests), though more complex inferences may be difficult to reverse engineer.
My lab has run four different kinds of studies, each of which has unique goals based on the population being sampled, the methods being used, and the practical importance of the questions being asked. A) One kind of study uses easy-to-access and convenient students or MTurk workers as a proxy for “people at large” for studies involving self-report assessments of various characteristics. B) Another study draws from a convenient student population to make inferences about basic emotional functioning through psychophysiological or behavioral measures. C) A third study draws from (relatively) easily sampled clinical populations in the community to bridge self-report and clinical symptom ratings, behavioral, and psychophysiological methods of assessment. D) The last study type comes from my work on the Route 91 shooting, in which the population sampled is time-sensitive, non-repeatable, and accessible only through electronic means. In each case, the sampling strategy from these populations entails constraints on generality that must be discussed when contextualizing the findings.
In study type A), I want to be relatively confident that the effects I observe are there. I’m also cognizant that measuring everything through self-report has attendant biases, so it’s possible processes like memory inaccuracies, self-representation goals, and either willful or unintentional ignorance may systematically bias results. Additionally, the relative ease of running a few hundred more students (or MTurk workers, if funding is available) makes running studies with high power to detect small effects a simpler proposition than in other study types, as once the study is programmed, there’s very little work needed to run participants through it. Indeed, they often just run themselves! In MTurk studies, it may take a few weeks to run 300 participants; if I run them in the lab, I can run about 400 participants in a year on a given study. Thus, I want to have high power to detect even small effects, I’ll use a lower alpha level to guard against spurious results borne of mega-multiple comparisons, and I’m happy to collect large numbers of participants to reduce the errors surrounding any parameter estimates.
Study type B) deals with taking measurements across domains that may be less reliable and that do not suffer from that same single-method biases as self-report studies. I’m willing to achieve a lower evidential value for the sake of being able to say something about these studies. In part, I think for many psychophysiological measures, the field is learning how to do this work well, and we need to walk statistically before we can run. I also believe that to the extent these studies are genuinely cross-method, we may reduce some of the crud factor especially inherent in single-method studies and produce more robust findings. However, these data require two research assistants to spend an hour of their time applying sensors to participants’ bodies and then spend an extra two to three hours collecting data and removing those sensors, so the cost of acquiring new participants is higher than in study type A). For reasons not predictable from the outset, a certain percentage of participants will also not yield interpretable data. However, the participants are still relatively easy to come by, so replacement is less of an issue, and I can plan to run around 150 participants a year if the lab’s efforts are fully devoted to the study. In such studies, I’ll sacrifice some power and precision to run a medium number of participants with a higher alpha level. I also use a lot of within-subjects designs in these kinds of studies to maximize the power of any experimental manipulations, as they allow participants to serve as their own controls.
In study type C), I run into many of the same kinds of problems as study type B) except that participants are relatively hard to come by. They come from clinical groups that require substantial recruitment resources to characterize and bring in, and they’re relatively scarce compared to unselected undergraduates. For cost comparison purposes, it’s taken me two years to bring in 120 participants for such a study (with a $100+ session compensation rate to reflect that they’d be in the lab for four hours). Within-subjects designs are imperative here to keep power high, and I also hope that study type B) has shown us how to maximize our effect sizes such that I can power a study to detect medium-sized effects as opposed to small ones.
Study type D) entails running participants who cannot be replaced in any way, making good measurement imperative to increase precision to detect effects. Recruitment is also a tremendous challenge, and it’s impossible to know ahead of time how many people will end up participating in the study. Nevertheless, it’s still possible to specify desired effect sizes ahead of time to target along with the precision needed to achieve a particular statistical power. I was fortunate to get just over 200 people initially, though our first follow-up had the approximately 125 participants I was hoping to have throughout the study. I haven’t had the ability to bring such people into the lab so far, so it’s the efforts needed in recruitment and maintenance of the sample that represent substantial costs, not the time it takes research assistants to collect the data.
Many different effect sizes can be computed to address different kinds of questions, yet many researchers answer this question by defaulting to an effect size that either corresponds to a lay person’s intuitions about the size of the effect (Cohen, 1988) or a typical effect size in the literature. However, in my research domain, it’s often important to consider whether effect sizes make a practical difference in real-world applications. That doesn’t mean that effects must be whoppingly large to be worth studying; wee effects with cheap interventions that feature small side effect profiles are still well worth pursuing. Nevertheless, whether theoretical, empirical, or pragmatic, researchers should take care to justify a minimum effect size of interest, as this choice will guide the rest of the justification process.
In setting this minimum effect size of interest, researchers should also consider the reliability of the measures being used in a study. All things being equal, more reliable measures will be able to detect smaller effects given the same number of observations. However, savvy researchers should take into account the unreliability of their measures when detailing the smallest effect size of interest. For instance, a researcher may want to detect a correlation of .10 – which corresponds to an effect explaining 1% of the linear relationship between two measures – and the two measures the researcher is using have internal consistencies of .80 and .70. Rearranging Lord and Novick’s (1968) correction formula, the actual smallest effect size of interest should be calculated as .10*√(.80*.70), or .10*√.56, or .10*.748, or .075.
However, unreliability of measurement is not the only kind of uncertainty that might lead researchers to choose a smaller minimum effect size of interest to detect. Even if researchers consult previous studies for estimates of relevant effect sizes, publication bias and uncertainty around the size of an effect in the population throw additional complications into these considerations. Adjusting the expected effect size of interest in light of these issues may further aid in justifying an alpha.
In the absence of an effect that passes the statistical threshold in a well-powered study, it may be useful to examine whether it is instead inconsistent with the smallest effect size of interest. In this way, we can articulate whether proceeding as if even that effect is not present is a reasonable one rather than defaulting to “retaining the null hypothesis”. This step is important for completing the error control process to ensure that some conclusions can be drawn from the study rather than leaving it out in an epistemological no-man’s land should results not pass the justified statistical threshold.
Utility functions. Ideally, the field could compute some kind of utility function whose maximum value represents a balance among sample size from a given population, minimum effect size of interest, alpha, and power. This function could provide an objective alpha to use in any given situation. However, because each of these quantities has costs and benefits associated with them – and the relative costs and benefits will vary by study and investigator – such a function is unlikely to be computable. Thus, when justifying an alpha level, we need to resort to other kinds of arguments. This means that it’s unlikely all investigators will agree that a given justification is sufficient, but a careful layout of the rationale behind the reported alpha combined with detailed reporting of p values would allow other researchers to re-evaluate a set of findings to determine how they comport with those researchers’ own principles, costs, and benefits. I would also argue that there is no situation in which there are no costs, as other studies could always be run in place of one that’s chosen, participants could be allocated to other studies instead of the one proposed, and the time spent programming a study, reducing its data, and analyzing the results are all costs inherent in any study.
Possible justifications. One blog post summarizes traditionally, citationally, normatively principled, bridgingly principled, and objectively justified alphas. The traditional and cited justifications are similar; the cited version simply notes where the particular authority for the alpha level is (e.g., Fisher, 1925) instead of resting on a nebulous “everybody does it.” In this way, the paper about redefining statistical significance provides a one-stop citation for those looking to summarize that tradition and provides a list of authors who collectively represent that tradition.
However, that paper provides additional sets of justifications for a stricter alpha level that entail multiple possible inferential benefits derived from normative or bridged principles. In particular, the authors of that paper bridge frequentist and Bayesian statistical inferential principles to emphasize the added rigor an alpha of .005 would lend the field. They also note that normatively, other fields have adopted stricter levels for declaring findings “significant” or as “discoveries”, and that such a strict alpha level would reduce the false positive rate below that of the field currently while requiring less than double the number of participants to maintain the same level of power.
Bridging principles. One could theoretically justify a (two-tailed) alpha level of .05 on multiple grounds. For instance, humans tend to think in 5s and 10s, and a 5/100 cutoff seems intuitively and conveniently “rare”. I should also note here that I use the term “convenient” to denote something adopted as more than “arbitrary”, inasmuch as our 5×2 digit hands provide us a quick, shared grouping for counting across humans. I fully expect that species with non-5/10 digit groupings (or pseudopods instead of digits) might use different cutoffs, which would similarly shape their thinking about convenient cutoffs for their statistical epistemology.
Such a cutoff has also been bridged to values of the normal distribution, as an alpha of .05 corresponds to twice that distribution’s standard error. Because many parametric frequentist statistical tests assume normal distributions of the standard errors of scores, this bridge links the alpha level to a fundamental assumption of these kinds of statistical tests.
Another bridging principle entails considering that lower alpha levels correspond to increasingly severe tests of theories. Thus, a researcher may prefer a lower alpha level if the theory is more well-developed, its logical links are clearer (from the core theoretical propositions to its auxiliary corollaries to its specific hypothetical propositions to the statistical hypotheses to test in any study), and its constructs are more precisely measured.
How to describe findings meeting or exceeding alpha? From a normative perspective, labeling findings as “statistically significant” has led to decades of misinterpretation of the practical importance of statistical tests (particularly in the absence of effect sizes). In our commentary, we encouraged abandoning that phrase, but we didn’t offer an alternative. I propose describing these results as “passing threshold” to reduce misinterpretative possibilities. This term is far less charged with…significance…and may help separate evaluation of the statistical hypotheses under test from larger practical or theoretical concerns.
Though justifying alpha is an important step, it’s just as important to justify your beta (which is the long-run proportion of false retentions of the null hypothesis). From a Neyman-Pearson perspective, the lower the beta, the more evidential value a null finding possesses. This is also why Neyman-Pearson reasoning is inductive rather than deductive: Null hypotheses have information value as opposed to being defaults that can only be refuted with deductive logic’s modus tollens tests. However, the lower the beta, the more observations are needed to make a given effect size pass the statistical threshold set by alpha. As shown above, one key to making minimum effect sizes of interest larger is measuring that effect with more reliably. A second is maximizing the strength of any manipulation such that a larger minimum effect size would be interesting to a researcher.
Another angle on the question leading this section is: How precise would the estimate of the effect size need to be to make me comfortable with accepting the statistical hypothesis being tested rather than just retaining it in light of a test statistic that doesn’t pass the statistical threshold? Just because a test has a high power (on account of a large effect size) doesn’t mean that the estimate of that effect is precise. More observations are needed to make precise estimates of an effect – which also reduces the beta (and thus heightens the power) of a given statistical test.
Power curves visualize the tradeoffs among effect size, beta, and the number of observations. They can aid researchers in determining how feasible it is to have a null hypothesis with high evidential value versus being able to conduct the study in the first place. Some power curves start showing a non-linear relationship between observations and beta when beta is about .20 (or power is about .80), consistent with historical guidelines. However, other considerations may take precedence over the shape of a power curve. Implicitly, traditional alpha (.05) and beta (.20) levels imply that erroneously declaring an effect passes threshold is four times worse than erroneously declaring an effect does not. Some researchers might believe even higher ratios should be used. Alternatively, it may be more important for researchers to fix one error rate or another at a specific value and let the other vary as resources dictate. These values should be articulated in the justification.
Most studies do not conduct a single comparison. Indeed, many studies toss in a number of different variables and assess their relationships, mediation, and moderation among them. As a result, there are many more comparisons conducted than the chosen alpha and beta levels are designed to guard against! There are four broad methods to use when considering how multiple comparisons impact your stated alpha and beta levels.
Per-comparison error rate (PCER) does not adjust comparisons at all and simply accepts the risk of there likely being multiple spurious results. In this case, no adjustments to alpha or beta need to be made in determining how many observations are needed.
False discovery error rate (FDER) allows that there will be a certain proportion of false discoveries in any set of multiple tests; FDER corrections attempt to keep the rate of these false discoveries at the given alpha level. However, this comes at a cost of complexity for those trying to justify alpha and beta, as each comparison uses a different critical alpha level. One common method for controlling FDER adjusts alpha levels for each comparison in a relatively linear fashion, retaining null hypotheses starting from the highest p value to the last one in which p > [(step #)/(total # of comparisons)]*(justified alpha). The remaining comparisons are judged as passing threshold. So, which comparison’s alpha value should be used in justifying comparisons? This may require knowing on average how many comparisons would typically pass threshold within a given comparison set size to plan for a final alpha to justify. After that, the number of observations may need adjusting to maintain the desired beta.
Corrections to the family-wise error rate (FWER) seek to reduce the error rate across a set of comparisons by lowering alpha in a more dramatic way across comparisons than do corrections for FDER. One popular method for controlling FWER entails dividing the desired alpha level by the number of comparisons. If the smallest p value in that set is smaller than alpha/(# comparisons), then it passes threshold and the next smallest p value is compared to alpha/(# comparisons-1). Once the p value of a comparison is greater than that fraction, that comparison and the remaining comparisons are considered not to have passed threshold. This correction has the same problems of an ever-shifting alpha and beta as the FDER, so the same cautions apply.
Per-family error rate (PFER) represents the most stringent error control of all. In this view, making multiple errors across families of comparisons is more damaging than making one error. Thus, tight control should be exercised over the possibility of making an error in any family of comparisons. The Bonferroni correction is one method of maintaining a PFER that is familiar to many researchers. In this case, alpha simply needs to be divided by the number of comparisons to be made, beta adjusted to maintain the appropriate power, and the appropriate number of observations collected.
Many researchers reduce alpha in the face of multiple comparisons to address the PCER without taking steps to address other kinds of error rates formally. Such ad hoc adjustments should at least also report how many tests would pass the statistical threshold by chance alone. Using FDER or FWER control techniques represent a balance between leniency and strictness in error control, though researchers should specify in advance whether false discovery or family-wise error control is more in line with their epistemological stance at a given time. Researchers may prefer to control the PFER when the number of comparisons is kept to a minimum through the use of a few critical, focal tests of a well-developed theory.
In FDER, FWER, and PFER control mechanisms, the notion of “family” must be justified. Is it all comparisons conducted in a study? Is it a set of exploratory factors that are considered separately from focal confirmatory comparisons? Does it group together conceptually similar measures (e.g., normal-range personality, abnormal personality, time-limited, psychopathology, and well being)? All of these and more may be reasonable families to use in lumping or splitting comparisons. However, to help researchers believe that these families were considered separable at the outset of a study, family membership decisions should be pre-registered.
Though the epistemological principles involved in justifying alphas and similar quantities run deeply, I don’t believe that a good alpha justification requires more than a paragraph. Ideally, I would like to see this paragraph placed at the start of a Method section in a journal article, as it sets the epistemological stage for everything that comes afterward. For each study type listed above, here are some possible paragraphs to justify a particular alpha, beta, and number of observations. I note that these are riffs off possible justifications; they do not necessarily represent the ways I elected to treat the same data detailed in the first sentences of each paragraph. To determine appropriate corrections for unreliability in measures when computing power estimates, I used Rosnow and Rosenthal’s (2003) conversions of effect sizes to correlations (rs).
Study type A): We planned to sample from a convenience population of undergraduates to provide precise estimates of two families of five effects each; we expected all of these effects to be relatively small. Because our measures in this study have historically been relatively reliable (i.e., internal consistencies > .80), we planned our study to detect a minimum correlation of .08, as that corresponds to a presumed population correlation of .10 (or proportion of variance shared of 1%) measured with instruments with at least 80% true score variance. We also recognized that our study was conducted entirely with self-report measures, making it possible that method variance would contaminate some of our findings. As a result, we adopted a critical one-tailed α level of .005. Because we believed that spuriously detecting an effect was ten times as undesirable as failing to detect an effect, we chose to run enough participants to ensure we had 95% power to detect a correlation of .08. We used the Holm-Bonferroni method to provide a family-wise error rate correction entailing a minimum α of .001 for the largest effect in the study in each of the two families. This required a sample size of 3486 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Molina, Pierce, Bergquist, & Benning, 2018), we anticipated that 2% of our sample would produce invalid personality profiles that would require replacement to avoid distorting the results (Benning & Freeman, 2017). Consequently, we targeted a sample size of 3862 participants to anticipate these replacements.
Study type B): We planned to sample from a convenience population of undergraduates to provide initial estimates of the extent to which four pleasant picture contents potentiated the postauricular reflex compared to neutral pictures. From a synthesis of the extant literature (Benning, 2018), we expected these effects to vary between ds of 0.2 to 0.5, with an average of 0.34. Because our measures in this study have historically been relatively unreliable (i.e., internal consistencies ~ .35; Aaron & Benning, 2016), we planned our study to detect a minimum mean difference of 0.20, as that corresponds to a presumed population mean difference of 0.34 measured with approximately 35% true score variance. We adopted a critical α level of .05 to keep false discoveries at a traditional level as we sought to improve the reliability and effect size of postauricular reflex potentiation. Following conventions that spuriously detecting an an effect was four times as undesirable as failing to detect an effect, we chose enough participants to ensure we had 80% power to detect a d of 0.20. We used the Benjamini-Hochberg (1995) method to provide a false detection error rate correction entailing a minimum one-tailed α of .0125 for the largest potentiation in the study. This required a sample size of 156 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Ait Oumeziane, Molina, & Benning, 2018), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 180 participants to accommodate these additional participants.
Study type C): We planned to sample from our university’s community mental health clinic to examine how anhedonia manifests itself in depression across seven different measures that are modulated by emotional valence. However, because there were insufficient cases with major depressive disorder in that sampling population, we instead used two different advertisements on Craigslist to recruit local depressed and non-depressed participants who were likely to be drawn from the same population. We believed that a medium effect size for the Valence x Group interaction (i.e., an f of 0.25) represented an effect that would be clinically meaningful in this assessment context. Because our measures’ reliabilities vary widely (i.e., internal consistencies ~ .35-.75; Benning & Ait Oumeziane, 2017), we planned our study to detect a minimum f of 0.174, as that corresponds to a presumed population f of 0.25 measured with approximately 50% true score variance. We adopted a critical α level of .007 in evaluating the Valence x Group interactions, using a Bonferroni correction to maintain a per-family error rate of .05 across all seven measures. To balance the number of participants needed from this selected population with maintaining power to detect effects, we chose enough participants to ensure we had 80% power to detect an f of 0.174. This required a sample size of 280 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with these measures (e.g., Benning & Ait Oumeziane, 2017), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 322 participants to accommodate these additional participants.
Study type D): We planned to sample from the population of survivors of the Route 91 Harvest Festival shooting on October 1, 2017, and from the population of the greater Las Vegas valley area who learned of that shooting within 24 hours of it happening. Because we wanted to sample this population within a month after the incident to examine acute stress reactions – and recruit as many participants as possible – this study did not have an a priori number of participants or targeted power. To maximize the possibility of detecting effects in this unique population, we adopted a critical two-tailed α level of .05 with a per-comparison error rate, as we were uncertain about the possible signs of all effects. Among the 45 comparisons conducted this way, chance would predict approximately 2 to pass threshold. However, we believed the time-sensitive, unrepeatable nature of the sample justified using looser evidential thresholds to speak to the effects in the data.