Home » Articles posted by sbenning

Author Archives: sbenning

“Confabulating” instead of “hallucinating” in ChatGPT and generative AI errors?

Generative artificial intelligence, or generative AI, is a technology that creates material from an existing corpus of material based on a user’s prompts. For texts, it uses a large language model that mimics a writer by combining words that statistically co-occur across a range of texts in the corpus. However, these models are prone to generating statistically probable but empirically false statements – and even invent references out of whole cloth to “support” these claims. These falsities are often called hallucinations to highlight the bizarre qualities of calling non-existent research study results and references, or legal cases and citations.

These incorrect text productions are better described as confabulations. Hallucination refers to a sensory experience not based in reality. In contrast, confabulation refers to forming or recalling false memories without an intent to deceive or an awareness of the falsity of the information. Confabulations can be provoked by questions asking for specific information (like a prompt a user might give to a generative AI) or spontaneous during the production of non-directed speech (as might happen when a generative AI calls on words that might plausibly go together but are not factual when assembled during an answer or elaboration). Confabulations can be sparse or well-elaborated; the latter are more consistent with the confabulations generative AI creates. Confabulations are also distinct from amnesia, which represents a loss of memory without an attempt to fill in its gaps, and delusions, which are false beliefs about the external world.

Confabulations are also considered ill-grounded or unjustified memories, which summarizes nicely how generative AI uses statistical relationships among words instead of facts to guide its productions. A variety of clinical conditions that damage normal memory functioning feature confabulation, ranging from harmful alcohol use to neurocognitive disorders. Likewise, a variety of large language models would be prone to generating confabulations, as their function is based on probabilistic relationships among sets of words, not semantic or episodic declarative memories. None of these result from a misperception of the user’s prompt.

In short, generative AI’s errors of false text generation should be considered confabulation, not hallucination. Researchers and practitioners of generative AI should update the terminology accordingly to prevent incorrect descriptions of the phenomenon along with stigmatization of those who suffer hallucinations.

The “Big Lie” is a degenerate research program

Scientists are concerned with the progress of research programs. A research program that doesn’t make progress in understanding the phenomena it studies wastes valuable professional time and resources. It also is a sink of resources that could be better spent improving the human condition. However, most research programs that don’t progress are unlikely to actively harm society. Instead, they will wither away, swirling down a drain of decreasing significance and impact.

Not so the “Big Lie“.

Imre Lakatos talked about research programs being structured around a theoretical core surrounded by auxiliary hypotheses that can be tested more formally. The theoretical core of this research program is that there was substantial fraud in the 2020 election cycle that worked to the benefit of Democrats and the detriment of Republicans. Auxiliary hypotheses swirl around this core regarding how this fraud may be detected and counteracted. If these auxiliary hypotheses survive tests of the theory, then the theoretical core holds.

However, if those hypotheses are falsified through tests (as in a boss fight), then the research program has two challenges. The negative heuristic describes how the auxiliary hypotheses must be regenerated if they are falsified to preserve the theoretical core, as the theoretical core itself cannot be attacked. The positive heuristic states the research program must come up with theoretically consistent methods of generating new hypotheses to be tested should some of the auxiliary belt fail so that the theoretical core can be modified.

I will detail the auxiliary hypotheses of the Big Lie as I understand them and then give the results of their tests.

Court Challenges Will Reveal Improprieties in Election Law Changes during the Pandemic

The Trump campaign, allied Republican entities, and sympathetic attorneys (including 17 state attorneys general) filed over 60 lawsuits challenging the propriety of various electoral counts and changes to voting laws that states made in light of the pandemic. By one count, 62 out of 64 of these cases were dismissed or found in favor of the defendants, indicating that the vast majority of these cases testing this auxiliary hypothesis found it false. The two successful cases related to denying the counting of ballots with missing IDs after election day in Pennsylvania and the Republican candidate for New York’s 22nd Congressional district being declared the eventual winner by 109 votes.

Thus, these two cases seem more like rogue comets passing by the theoretical core than auxiliary hypotheses in tight orbit around it. Furthermore, some of the cases that were brought are so frivolous as to draw sanctions against the attorneys who filed them or cause an attorney to be fired. Though Mike Lindell has been filing suits against voting machine manufacturers, they are countersuing for defamation and seeking sanctions for the filing attorneys.

Audits of Electoral Counts Will Reveal Discrepant Vote Totals

The notion that recounts and audits of vote counts would show massive fraud is prominent across the Big Lie, presumably reflecting vote-dumping shenanigans, vote switching, or other corruption of electronic databases. Secretaries of State in Georgia and Texas have found no meaningful discrepancies between published and audited vote totals. Wisconsin (by Republican leaders), Michigan (by Senate Republicans) and Arizona (by a pro-Trump firm with no previous experience) audits likewise found no fraud, even with agencies having a strong Republican allegiance bias. It is unclear whether Pennsylvania’s Republican-led Senate will release the results of its audit, though its Secretary of State has already audited the vote totals without any meaningful discrepancies with published vote totals. There is no evidence in favor of this auxiliary hypothesis and several tests falsifying it.

Voter Fraud Prosecutions Will Show Rampant Illegal Voting by Democratic Voters

The notion that large swathes of Democratic voters cast illegal ballots is another tenet of the Big Lie. However, in the six most contested states, there is little evidence for this proposition. Indeed, Republicans in Florida (at least 4 cases), Pennsylvania (at least 3 cases), and Nevada (at least 1 case) are accused or convicted of voting crimes. Again, the evidence is against the Big Lie’s auxiliary hypotheses – this time, in the opposite direction.

Objectors to Electoral Count Insisted on Having Their Own Votes Audited

On January 6, six Republican Senators and 121 Republican Representatives objected to the electoral count in Arizona. Thus, it would be reasonable that at least the objecting Republican representatives from Arizona would likewise cast doubt on their own electoral victories and entertain doubt about their victories. Alas, no such protests have issued from Representatives Biggs, Gosar, or Lesko.

Conclusions

All four sets of auxiliary hypotheses have been falsified in multiple ways, sometimes in the direction opposite predictions. Any attempts to regenerate those auxiliary hypotheses through the positive heuristic have also failed.

The Big Lie has been laid bare. It fails to generate hypotheses with connection to data in the world at large. Its support does not derive from empirically supported facts.

The Big Lie is a degenerate research program.

With
real-world
consequences.

“Social spacing”, not “social distancing”: Deepening connections while staying safe

As the SARS-CoV-2 virus spreads across the globe to create the COVID-19 pandemic, people have been requested to stay out of public spaces and reduce interpersonal contact to reduce the transmission of the virus. This process has the unfortunate name of “social distancing“, which has connotations of removing one’s self socially and emotionally as well as physically from the public sphere. Before modern communication technologies existed, those might have been unfortunate side effects of such a containment strategy. However, with all the methods available to us to stay connected across large gaps between us, I propose we call this effort:

Social spacing.

In this way, we emphasize that it’s the physical space between people we seek to minimize, not the interpersonal bonds we share. “Social spacing” entails a simple geographic remove from other people. It also invites people to become creative in using technological means to bridge the space between us.

How to stay faraway, so close

When using technology to stay connected, prioritize keeping deeper, meaningful connections with people. Use Skype or other video messaging to see as well as hear from people important to you. Talk to people on the phone to maintain a vocal connection. Use your favorite social media site’s individual messenger to keep a dialog going with someone. Have individual or group texts for select audiences of messages.

In these deep, close, personalized connections, it’s OK to share your anxieties and fears. Validating that other people are concerned or even scared can help them feel like they are grounded in reality. However, beyond simple validation, use these deep connections to plan out what to do, to take concrete actions to live the lives you want. To the extent possible, share hobbies or other pursuits together if you’re shut off from work or other personal strivings for success.

  • Move book clubs from living rooms or coffee shops to speaker phone calls or group Zoom sessions.
  • Find online or app versions of bridge, board games, roleplaying adventures, or other fun things you might do together in person – or find new things to do online.
  • Set an appointment with some friends to watch a show or movie on TV or streaming media. You can then have a group chat afterward to share your reactions. Then again, maybe you all would enjoy keeping the line open as you’re watching to comment along in real time!
  • Curate playlists on Spotify or other music sites to share with your friends to express your current mood or provide some uplift to each other.
  • Make a creative group so that you write novels, paint pictures, or pursue other artistic endeavors together.
  • Broaden your palate and share culinary adventures with your friends with social cooking sites or online cooking classes. Share your favorites with your friends, or have conversations as you cook to help debug recipes.
  • Exercise your body and share online workout videos or apps with your friends. Setting up a friendly competition might even prolong your activity gains after the pandemic passes.
  • Learn skills through shared courseware from Khan Academy, LinkedIn Learning, or other sources.

These kinds of personalized connections can be prioritized over broad social media posts. In that way, you have more of an influence over your audience – and you can ask for a deeper kind of support than a feed or timeline might provide. If you use social media, use language laden with nouns and verbs to minimize emotional contagion. Share information from trusted, reputable sources as close to the relevant data as possible. I recommend the University of Minnesota’s COVID-19 CIDRAP site for a transnational perspective.

When too many connections and too frequent news become too much

You might find that the firehose of information overwhelms you at times. If you find yourself getting more anxious when you watch the news or browse social media, that’s a good sign that you’d benefit from a break. As a first step, you might disable notifications on your phone from news or social media apps so that you can control when you search for information rather than having it pushed to you. Other possibilities include:

  • Employing the muting options on Twitter, snooze posts or posters on Facebook, or filter words on Instagram.
  • Using a timer, an app, or a browser extension to limit the time you can spend on specific social media sites.
  • Turning off all your devices for a few hours to really unplug for a while.

Through these methods, you can give yourself space to recharge and stay connected even as you’re socially spaced.

Statistically “discernible” instead of “significant”?

Not even in New Vegas.

Once again, the term “statistical significance” in null hypothesis significance testing is under fire. The authors of that commentary favor renaming “confidence intervals” as “compatibility intervals”. This term emphasizes that the intervals represent a range of values that would be reasonable values to obtain under certain statistical assumptions rather than a statement about subjective beliefs (for which Bayesian statistics are more appropriate with their credible intervals). However, I think the term needing replacement most goes back even further. In our “justify your alpha” paper, we recommended abandoning the term “statistically significant”, but we didn’t give a replacement for that term. It wasn’t for lack of trying, but we never came up with a better label for findings that pass a statistical threshold.

My previous blog post about how to justify an alpha level tried using threshold terminology, but it felt clunky and ungainly. After thinking about it for over a year, I think I finally have an answer:

Replace “significant” with “discernible”.

The first advantage is the conceptual shift from meaning to perception. Rather than imbuing a statistical finding with a higher “significance”, the framework moves earlier in the processing stream. Perception entails more than sensation, which a term like “detectable” might imply. “Discernible” implies the arrangement of information, similar to how inferential statistics can arrange data for researchers. Thus, statistics are more readily recognized as tools to peer into a study’s data rather than arbiters of the ultimate truth – or “significance” – of a set of findings. Scientists are thus encouraged to own the interpretation of the statistical test rather than letting the numbers arrogate “significance”.

The lonely vacuum of truly null findings sucks.

The second advantage of this terminological shift is that the boundary between results falling on either side of the statistical threshold becomes appropriately ambiguous. No longer is some omniscient “significance” or “insignificance” imbued to them automatically. Rather, a set of questions immediately arises. 1) Is there really no effect there – at least, if the null hypothesis is the nil hypothesis of exactly 0 difference or relationship? It’s highly unlikely, as unlikely as finding a true vacuum, but I suppose it might be possible. In this case, having “no discernible effect” allows the possibility that our perception is insufficient to recognize it against a background void.

The proper microscope reveals the richness of a mote of dust.

2) Is there an effect there, but it’s so small as to be ignorable given the smallest effect size we care about? This state of affairs is likely when tiny effects exist, like the relationship between intelligence and birth order. Larger samples might be needed or more precise measures could be required to find it – like having an electron microscope instead of a light microscope. However, with the current state of affairs, we as a research community are satisfied that the effects are small enough that we’re willing not to care about them. Well-powered studies should allow relatively definitive conclusions here. Here, “no discernible effect” suggests the effect may be wee, but so wee that we are content not to consider them further.

And they didn’t even have to wipe petrolatum off Hubble’s lens.

3) Are our tests so imprecise that we couldn’t answer whether an effect is actually there or not? Perhaps a study has too few participants to detect even medium-sized effects. Perhaps its measurements are so internally inconsistent as to render them like unto a microscope lens smeared with gel. Either way, the study just may not have enough power to make a meaningful determination one way or the other. The increased smudginess of one set of data compared to another that helped inspire the Nature commentary might be more readily appreciated when described in “discernible” instead of “significant” terms. Indeed, “no discernible effect” helps keep in mind that our perceptual-statistical apparatus may have insufficient resolution to provide a solid answer. Conversely, a discernible finding might also be due to faulty equipment or other non-signal related causes, irrespective of its apparent practical significance.

These questions lead us to ponder whether our findings are reasonable or might instead simply be the phantoms of false positives (or other illusions). Indeed, I think “discernible” nudges us to think about why a finding did or didn’t cross the statistical threshold more deeply instead of simply accepting its “significance” or lack thereof.

Or “significant“.





In any case, I hope that “statistically discernible” is a better term for what we mean when a result passes the alpha threshold in a null hypothesis test and is thus more extreme than we decided would be acceptable to believe as a value arising from the null hypothesis’s distribution. I hope it can lead to meaningful shifts in how we think about the results of these statistical tests. Then again, perhaps the field will just rail against NHDT in a decade. Assuming, of course, that it doesn’t just act as Regina to my Gretchen.

How to justify your alpha: step by step

Do you need alpha? | Study goals? | Minimum effect size of interest? | Justify your alpha | Justify your beta | How to control for multiple comparisons? | Examples

I joined a panoply of scholars who argue that it is necessary to justify your alpha (that is, the acceptable long-run proportion of false positive results in a specific statistical test) rather than redefine a threshold for statistical significance across all fields. We’d hoped to emphasize that more than just alpha needed justification, but space limitations (and the original article’s title) focused our response on that specific issue. We also didn’t have the space to include specific examples of how to justify alpha or other kinds of analytic decisions. Because people are really curious about this issue, I elaborate here my thought process about how one can justify alpha and other analytic decisions.

0) Do you even need to consider alpha?

But not about the Bayes. Just frequentist trouble.

The whole notion of an alpha level doesn’t make sense in multiple statistical frameworks. For one, alpha is only meaningful in the frequentist school of probability, in which we consider how often a particular event would occur in the real world out of the total number of such possible events. In Bayesian statistics, the concern is typically adjusting the probability that a proposition of some sort is true, which is a radically different way of thinking about statistical analysis. Indeed, the Bayes factor represents the relative likelihood of an alternative hypothesis to that of a null hypothesis given the data obtained. There’s also no clear mapping of p values to Bayes factors, so many discussions about alpha aren’t relevant if that’s the statistical epistemology you use.

Fishers (but not dredgers) of data.

Another set of considerations comes from Sir Ronald Fisher, one of the original statisticians. In this view, there’s no error control to be concerned with; p values instead represent levels of evidence against a null hypothesis. Crossing a prespecified low p value (i.e., alpha) then entails rejection of a statistical null hypothesis. However, this particular point of view has fallen out of favor, and Fisher’s attempts to construct a system of fiducial inference also ended up being criticized. Finally, there are systems of statistical inference based on information theory and non-Bayesian likelihood estimation that do not entail error control in their epistemological positions.

I come from a clinical background and teach the psychodiagnostic assessment of adults. As a result, I come from a world in which I must use the information we gather to make a series of dichotomous decisions. Does this client have major depressive disorder or not? Is this client likely to benefit from exposure and response prevention interventions or not? This framework extends to my thinking about how to conduct scientific studies or not. Will I pursue further work on a specific measure or not? Will my research presume this effect is worth including in the experimental design or not? Thus, the error control epistemological framework (borne of Neyman-Pearson reasoning about statistical testing) seems to be a good one for my purposes, and I’m reading more about it to verify that assumption. In particular, I’m interested in disambiguating this kind of statistical reasoning from the common practice of null hypothesis significance testing, which amalgamates Fisherian inferential philosophy and Neyman-Pearson operations.

I don’t argue that it’s the only relevant possible framework to evaluate the results of my research. That’s why I still report p values in text whenever possible (the sea of tabular asterisks can make precise p values difficult to report there) to allow a Fisherian kind of inference. Such p value reporting also allows those in the Neyman-Pearson tradition to use different decision thresholds in assessing my work. It should also be possible to use the n/df and values of my statistics to compute Bayes factors for the simpler statistics I report (e.g., correlations and t tests), though more complex inferences may be difficult to reverse engineer.

1) What are the goals of your study?

My lab has run four different kinds of studies, each of which has unique goals based on the population being sampled, the methods being used, and the practical importance of the questions being asked. A) One kind of study uses easy-to-access and convenient students or MTurk workers as a proxy for “people at large” for studies involving self-report assessments of various characteristics. B) Another study draws from a convenient student population to make inferences about basic emotional functioning through psychophysiological or behavioral measures. C) A third study draws from (relatively) easily sampled clinical populations in the community to bridge self-report and clinical symptom ratings, behavioral, and psychophysiological methods of assessment. D) The last study type comes from my work on the Route 91 shooting, in which the population sampled is time-sensitive, non-repeatable, and accessible only through electronic means. In each case, the sampling strategy from these populations entails constraints on generality that must be discussed when contextualizing the findings.

Five at a time, please.

In study type A), I want to be relatively confident that the effects I observe are there. I’m also cognizant that measuring everything through self-report has attendant biases, so it’s possible processes like memory inaccuracies, self-representation goals, and either willful or unintentional ignorance may systematically bias results. Additionally, the relative ease of running a few hundred more students (or MTurk workers, if funding is available) makes running studies with high power to detect small effects a simpler proposition than in other study types, as once the study is programmed, there’s very little work needed to run participants through it. Indeed, they often just run themselves! In MTurk studies, it may take a few weeks to run 300 participants; if I run them in the lab, I can run about 400 participants in a year on a given study. Thus, I want to have high power to detect even small effects, I’ll use a lower alpha level to guard against spurious results borne of mega-multiple comparisons, and I’m happy to collect large numbers of participants to reduce the errors surrounding any parameter estimates.

With sensors on the face, behind the ear, on the hands and elbows…

Study type B) deals with taking measurements across domains that may be less reliable and that do not suffer from that same single-method biases as self-report studies. I’m willing to achieve a lower evidential value for the sake of being able to say something about these studies. In part, I think for many psychophysiological measures, the field is learning how to do this work well, and we need to walk statistically before we can run. I also believe that to the extent these studies are genuinely cross-method, we may reduce some of the crud factor especially inherent in single-method studies and produce more robust findings. However, these data require two research assistants to spend an hour of their time applying sensors to participants’ bodies and then spend an extra two to three hours collecting data and removing those sensors, so the cost of acquiring new participants is higher than in study type A). For reasons not predictable from the outset, a certain percentage of participants will also not yield interpretable data. However, the participants are still relatively easy to come by, so replacement is less of an issue, and I can plan to run around 150 participants a year if the lab’s efforts are fully devoted to the study. In such studies, I’ll sacrifice some power and precision to run a medium number of participants with a higher alpha level. I also use a lot of within-subjects designs in these kinds of studies to maximize the power of any experimental manipulations, as they allow participants to serve as their own controls.

Finding well-characterized participants imposes substantial personnel costs.

In study type C), I run into many of the same kinds of problems as study type B) except that participants are relatively hard to come by. They come from clinical groups that require substantial recruitment resources to characterize and bring in, and they’re relatively scarce compared to unselected undergraduates. For cost comparison purposes, it’s taken me two years to bring in 120 participants for such a study (with a $100+ session compensation rate to reflect that they’d be in the lab for four hours). Within-subjects designs are imperative here to keep power high, and I also hope that study type B) has shown us how to maximize our effect sizes such that I can power a study to detect medium-sized effects as opposed to small ones.

These can be the most stressful studies to recruit.

Study type D) entails running participants who cannot be replaced in any way, making good measurement imperative to increase precision to detect effects. Recruitment is also a tremendous challenge, and it’s impossible to know ahead of time how many people will end up participating in the study. Nevertheless, it’s still possible to specify desired effect sizes ahead of time to target along with the precision needed to achieve a particular statistical power. I was fortunate to get just over 200 people initially, though our first follow-up had the approximately 125 participants I was hoping to have throughout the study. I haven’t had the ability to bring such people into the lab so far, so it’s the efforts needed in recruitment and maintenance of the sample that represent substantial costs, not the time it takes research assistants to collect the data.

2) Given your study’s goals, what’s the minimum effect size that you want your statistical test to detect?

Many different effect sizes can be computed to address different kinds of questions, yet many researchers answer this question by defaulting to an effect size that either corresponds to a lay person’s intuitions about the size of the effect (Cohen, 1988) or a typical effect size in the literature. However, in my research domain, it’s often important to consider whether effect sizes make a practical difference in real-world applications. That doesn’t mean that effects must be whoppingly large to be worth studying; wee effects with cheap interventions that feature small side effect profiles are still well worth pursuing. Nevertheless, whether theoretical, empirical, or pragmatic, researchers should take care to justify a minimum effect size of interest, as this choice will guide the rest of the justification process.

Is it at least this big? Measured reliably? Well, then… {credit: Dorothy Joseph}

In setting this minimum effect size of interest, researchers should also consider the reliability of the measures being used in a study. All things being equal, more reliable measures will be able to detect smaller effects given the same number of observations. However, savvy researchers should take into account the unreliability of their measures when detailing the smallest effect size of interest. For instance, a researcher may want to detect a correlation of .10 – which corresponds to an effect explaining 1% of the linear relationship between two measures – and the two measures the researcher is using have internal consistencies of .80 and .70. Rearranging Lord and Novick’s (1968) correction formula, the actual smallest effect size of interest should be calculated as .10*√(.80*.70), or .10*√.56, or .10*.748, or .075.

However, unreliability of measurement is not the only kind of uncertainty that might lead researchers to choose a smaller minimum effect size of interest to detect. Even if researchers consult previous studies for estimates of relevant effect sizes, publication bias and uncertainty around the size of an effect in the population throw additional complications into these considerations. Adjusting the expected effect size of interest in light of these issues may further aid in justifying an alpha.

In the absence of an effect that passes the statistical threshold in a well-powered study, it may be useful to examine whether it is instead inconsistent with the smallest effect size of interest. In this way, we can articulate whether proceeding as if even that effect is not present is a reasonable one rather than defaulting to “retaining the null hypothesis”. This step is important for completing the error control process to ensure that some conclusions can be drawn from the study rather than leaving it out in an epistemological no-man’s land should results not pass the justified statistical threshold.

3) Given your study’s goals, what alpha level represents an adequate level of sensitivity to detect effects (or signals) of interest balanced against a specificity against interpreting noise?

Graph of global and local maxima and minimaUtility functions. Ideally, the field could compute some kind of utility function whose maximum value represents a balance among sample size from a given population, minimum effect size of interest, alpha, and power. This function could provide an objective alpha to use in any given situation. However, because each of these quantities has costs and benefits associated with them – and the relative costs and benefits will vary by study and investigator – such a function is unlikely to be computable. Thus, when justifying an alpha level, we need to resort to other kinds of arguments. This means that it’s unlikely all investigators will agree that a given justification is sufficient, but a careful layout of the rationale behind the reported alpha combined with detailed reporting of p values would allow other researchers to re-evaluate a set of findings to determine how they comport with those researchers’ own principles, costs, and benefits. I would also argue that there is no situation in which there are no costs, as other studies could always be run in place of one that’s chosen, participants could be allocated to other studies instead of the one proposed, and the time spent programming a study, reducing its data, and analyzing the results are all costs inherent in any study.

Possible justifications. One blog post summarizes traditionally, citationally, normatively principled, bridgingly principled, and objectively justified alphas. The traditional and cited justifications are similar; the cited version simply notes where the particular authority for the alpha level is (e.g., Fisher, 1925) instead of resting on a nebulous “everybody does it.” In this way, the paper about redefining statistical significance provides a one-stop citation for those looking to summarize that tradition and provides a list of authors who collectively represent that tradition.

However, that paper provides additional sets of justifications for a stricter alpha level that entail multiple possible inferential benefits derived from normative or bridged principles. In particular, the authors of that paper bridge frequentist and Bayesian statistical inferential principles to emphasize the added rigor an alpha of .005 would lend the field. They also note that normatively, other fields have adopted stricter levels for declaring findings “significant” or as “discoveries”, and that such a strict alpha level would reduce the false positive rate below that of the field currently while requiring less than double the number of participants to maintain the same level of power.

Who knows what alpha level pseudopodia would find convenient?

Bridging principles. One could theoretically justify a (two-tailed) alpha level of .05 on multiple grounds. For instance, humans tend to think in 5s and 10s, and a 5/100 cutoff seems intuitively and conveniently “rare”. I should also note here that I use the term “convenient” to denote something adopted as more than “arbitrary”, inasmuch as our 5×2 digit hands provide us a quick, shared grouping for counting across humans. I fully expect that species with non-5/10 digit groupings (or pseudopods instead of digits) might use different cutoffs, which would similarly shape their thinking about convenient cutoffs for their statistical epistemology.

Percentiles of the (convex) normal distribution.

Such a cutoff has also been bridged to values of the normal distribution, as an alpha of .05 corresponds to twice that distribution’s standard error. Because many parametric frequentist statistical tests assume normal distributions of the standard errors of scores, this bridge links the alpha level to a fundamental assumption of these kinds of statistical tests.

Another bridging principle entails considering that lower alpha levels correspond to increasingly severe tests of theories. Thus, a researcher may prefer a lower alpha level if the theory is more well-developed, its logical links are clearer (from the core theoretical propositions to its auxiliary corollaries to its specific hypothetical propositions to the statistical hypotheses to test in any study), and its constructs are more precisely measured.

How to describe findings meeting or exceeding alpha? From a normative perspective, labeling findings as “statistically significant” has led to decades of misinterpretation of the practical importance of statistical tests (particularly in the absence of effect sizes). In our commentary, we encouraged abandoning that phrase, but we didn’t offer an alternative. I propose describing these results as “passing threshold” to reduce misinterpretative possibilities. This term is far less charged with…significance…and may help separate evaluation of the statistical hypotheses under test from larger practical or theoretical concerns.

4) Given that alpha level, what beta level (or 1-power) represents an acceptable tradeoff between missing a potential effect of interest and leaving noise uninterpreted?

Absolute power requires…absolute sample size.

Though justifying alpha is an important step, it’s just as important to justify your beta (which is the long-run proportion of false retentions of the null hypothesis). From a Neyman-Pearson perspective, the lower the beta, the more evidential value a null finding possesses. This is also why Neyman-Pearson reasoning is inductive rather than deductive: Null hypotheses have information value as opposed to being defaults that can only be refuted with deductive logic’s modus tollens tests. However, the lower the beta, the more observations are needed to make a given effect size pass the statistical threshold set by alpha. As shown above, one key to making minimum effect sizes of interest larger is measuring that effect with more reliably. A second is maximizing the strength of any manipulation such that a larger minimum effect size would be interesting to a researcher.

Another angle on the question leading this section is: How precise would the estimate of the effect size need to be to make me comfortable with accepting the statistical hypothesis being tested rather than just retaining it in light of a test statistic that doesn’t pass the statistical threshold? Just because a test has a high power (on account of a large effect size) doesn’t mean that the estimate of that effect is precise. More observations are needed to make precise estimates of an effect – which also reduces the beta (and thus heightens the power) of a given statistical test.

Power curves visualize the tradeoffs among effect size, beta, and the number of observations. They can aid researchers in determining how feasible it is to have a null hypothesis with high evidential value versus being able to conduct the study in the first place. Some power curves start showing a non-linear relationship between observations and beta when beta is about .20 (or power is about .80), consistent with historical guidelines. However, other considerations may take precedence over the shape of a power curve. Implicitly, traditional alpha (.05) and beta (.20) levels imply that erroneously declaring an effect passes threshold is four times worse than erroneously declaring an effect does not. Some researchers might believe even higher ratios should be used. Alternatively, it may be more important for researchers to fix one error rate or another at a specific value and let the other vary as resources dictate. These values should be articulated in the justification.

5) Given your study’s goals and the alpha and beta levels above, how can you adjust for multiple comparisons to maintain an acceptable level of power while guarding against erroneous findings?

So many comparisons, so many techniques for dealing with them.

Most studies do not conduct a single comparison. Indeed, many studies toss in a number of different variables and assess their relationships, mediation, and moderation among them. As a result, there are many more comparisons conducted than the chosen alpha and beta levels are designed to guard against! There are four broad methods to use when considering how multiple comparisons impact your stated alpha and beta levels.

Per-comparison error rate (PCER) does not adjust comparisons at all and simply accepts the risk of there likely being multiple spurious results. In this case, no adjustments to alpha or beta need to be made in determining how many observations are needed.

False discovery error rate (FDER) allows that there will be a certain proportion of false discoveries in any set of multiple tests; FDER corrections attempt to keep the rate of these false discoveries at the given alpha level. However, this comes at a cost of complexity for those trying to justify alpha and beta, as each comparison uses a different critical alpha level. One common method for controlling FDER adjusts alpha levels for each comparison in a relatively linear fashion, retaining null hypotheses starting from the highest p value to the last one in which p > [(step #)/(total # of comparisons)]*(justified alpha). The remaining comparisons are judged as passing threshold. So, which comparison’s alpha value should be used in justifying comparisons? This may require knowing on average how many comparisons would typically pass threshold within a given comparison set size to plan for a final alpha to justify. After that, the number of observations may need adjusting to maintain the desired beta.

Corrections to the family-wise error rate (FWER) seek to reduce the error rate across a set of comparisons by lowering alpha in a more dramatic way across comparisons than do corrections for FDER. One popular method for controlling FWER entails dividing the desired alpha level by the number of comparisons. If the smallest p value in that set is smaller than alpha/(# comparisons), then it passes threshold and the next smallest p value is compared to alpha/(# comparisons-1). Once the p value of a comparison is greater than that fraction, that comparison and the remaining comparisons are considered not to have passed threshold. This correction has the same problems of an ever-shifting alpha and beta as the FDER, so the same cautions apply.

Per-family error rate (PFER) represents the most stringent error control of all. In this view, making multiple errors across families of comparisons is more damaging than making one error. Thus, tight control should be exercised over the possibility of making an error in any family of comparisons. The Bonferroni correction is one method of maintaining a PFER that is familiar to many researchers. In this case, alpha simply needs to be divided by the number of comparisons to be made, beta adjusted to maintain the appropriate power, and the appropriate number of observations collected.

Many researchers reduce alpha in the face of multiple comparisons to address the PCER without taking steps to address other kinds of error rates formally. Such ad hoc adjustments should at least also report how many tests would pass the statistical threshold by chance alone. Using FDER or FWER control techniques represent a balance between leniency and strictness in error control, though researchers should specify in advance whether false discovery or family-wise error control is more in line with their epistemological stance at a given time. Researchers may prefer to control the PFER when the number of comparisons is kept to a minimum through the use of a few critical, focal tests of a well-developed theory.

Who is “family” among all these people?

In FDER, FWER, and PFER control mechanisms, the notion of “family” must be justified. Is it all comparisons conducted in a study? Is it a set of exploratory factors that are considered separately from focal confirmatory comparisons? Does it group together conceptually similar measures (e.g., normal-range personality, abnormal personality, time-limited, psychopathology, and well being)? All of these and more may be reasonable families to use in lumping or splitting comparisons. However, to help researchers believe that these families were considered separable at the outset of a study, family membership decisions should be pre-registered.

6) Examples of justified alphas

Though the epistemological principles involved in justifying alphas and similar quantities run deeply, I don’t believe that a good alpha justification requires more than a paragraph. Ideally, I would like to see this paragraph placed at the start of a Method section in a journal article, as it sets the epistemological stage for everything that comes afterward. For each study type listed above, here are some possible paragraphs to justify a particular alpha, beta, and number of observations. I note that these are riffs off possible justifications; they do not necessarily represent the ways I elected to treat the same data detailed in the first sentences of each paragraph. To determine appropriate corrections for unreliability in measures when computing power estimates, I used Rosnow and Rosenthal’s (2003) conversions of effect sizes to correlations (rs).

Study type A): We planned to sample from a convenience population of undergraduates to provide precise estimates of two families of five effects each; we expected all of these effects to be relatively small. Because our measures in this study have historically been relatively reliable (i.e., internal consistencies > .80), we planned our study to detect a minimum correlation of .08, as that corresponds to a presumed population correlation of .10 (or proportion of variance shared of 1%) measured with instruments with at least 80% true score variance. We also recognized that our study was conducted entirely with self-report measures, making it possible that method variance would contaminate some of our findings. As a result, we adopted a critical one-tailed α level of .005. Because we believed that spuriously detecting an effect was ten times as undesirable as failing to detect an effect, we chose to run enough participants to ensure we had 95% power to detect a correlation of .08. We used the Holm-Bonferroni method to provide a family-wise error rate correction entailing a minimum α of .001 for the largest effect in the study in each of the two families. This required a sample size of 3486 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Molina, Pierce, Bergquist, & Benning, 2018), we anticipated that 2% of our sample would produce invalid personality profiles that would require replacement to avoid distorting the results (Benning & Freeman, 2017). Consequently, we targeted a sample size of 3862 participants to anticipate these replacements.

Study type B): We planned to sample from a convenience population of undergraduates to provide initial estimates of the extent to which four pleasant picture contents potentiated the postauricular reflex compared to neutral pictures. From a synthesis of the extant literature (Benning, 2018), we expected these effects to vary between ds of 0.2 to 0.5, with an average of 0.34. Because our measures in this study have historically been relatively unreliable (i.e., internal consistencies ~ .35; Aaron & Benning, 2016), we planned our study to detect a minimum mean difference of 0.20, as that corresponds to a presumed population mean difference of 0.34 measured with approximately 35% true score variance. We adopted a critical α level of .05 to keep false discoveries at a traditional level as we sought to improve the reliability and effect size of postauricular reflex potentiation. Following conventions that spuriously detecting an an effect was four times as undesirable as failing to detect an effect, we chose enough participants to ensure we had 80% power to detect a d of 0.20. We used the Benjamini-Hochberg (1995) method to provide a false detection error rate correction entailing a minimum one-tailed α of .0125 for the largest potentiation in the study. This required a sample size of 156 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Ait Oumeziane, Molina, & Benning, 2018), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 180 participants to accommodate these additional participants.

Study type C): We planned to sample from our university’s community mental health clinic to examine how anhedonia manifests itself in depression across seven different measures that are modulated by emotional valence. However, because there were insufficient cases with major depressive disorder in that sampling population, we instead used two different advertisements on Craigslist to recruit local depressed and non-depressed participants who were likely to be drawn from the same population. We believed that a medium effect size for the Valence x Group interaction (i.e., an f of 0.25) represented an effect that would be clinically meaningful in this assessment context. Because our measures’ reliabilities vary widely (i.e., internal consistencies ~ .35-.75; Benning & Ait Oumeziane, 2017), we planned our study to detect a minimum f of 0.174, as that corresponds to a presumed population f of 0.25 measured with approximately 50% true score variance. We adopted a critical α level of .007 in evaluating the Valence x Group interactions, using a Bonferroni correction to maintain a per-family error rate of .05 across all seven measures. To balance the number of participants needed from this selected population with maintaining power to detect effects, we chose enough participants to ensure we had 80% power to detect an f of 0.174. This required a sample size of 280 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with these measures (e.g., Benning & Ait Oumeziane, 2017), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 322 participants to accommodate these additional participants.

Study type D): We planned to sample from the population of survivors of the Route 91 Harvest Festival shooting on October 1, 2017, and from the population of the greater Las Vegas valley area who learned of that shooting within 24 hours of it happening. Because we wanted to sample this population within a month after the incident to examine acute stress reactions – and recruit as many participants as possible – this study did not have an a priori number of participants or targeted power. To maximize the possibility of detecting effects in this unique population, we adopted a critical two-tailed α level of .05 with a per-comparison error rate, as we were uncertain about the possible signs of all effects. Among the 45 comparisons conducted this way, chance would predict approximately 2 to pass threshold. However, we believed the time-sensitive, unrepeatable nature of the sample justified using looser evidential thresholds to speak to the effects in the data.

css.php