Home » Meta-science
Category Archives: Meta-science
Scientists are concerned with the progress of research programs. A research program that doesn’t make progress in understanding the phenomena it studies wastes valuable professional time and resources. It also is a sink of resources that could be better spent improving the human condition. However, most research programs that don’t progress are unlikely to actively harm society. Instead, they will wither away, swirling down a drain of decreasing significance and impact.
Not so the “Big Lie“.
Imre Lakatos talked about research programs being structured around a theoretical core surrounded by auxiliary hypotheses that can be tested more formally. The theoretical core of this research program is that there was substantial fraud in the 2020 election cycle that worked to the benefit of Democrats and the detriment of Republicans. Auxiliary hypotheses swirl around this core regarding how this fraud may be detected and counteracted. If these auxiliary hypotheses survive tests of the theory, then the theoretical core holds.
However, if those hypotheses are falsified through tests (as in a boss fight), then the research program has two challenges. The negative heuristic describes how the auxiliary hypotheses must be regenerated if they are falsified to preserve the theoretical core, as the theoretical core itself cannot be attacked. The positive heuristic states the research program must come up with theoretically consistent methods of generating new hypotheses to be tested should some of the auxiliary belt fail so that the theoretical core can be modified.
I will detail the auxiliary hypotheses of the Big Lie as I understand them and then give the results of their tests.
Court Challenges Will Reveal Improprieties in Election Law Changes during the Pandemic
The Trump campaign, allied Republican entities, and sympathetic attorneys (including 17 state attorneys general) filed over 60 lawsuits challenging the propriety of various electoral counts and changes to voting laws that states made in light of the pandemic. By one count, 62 out of 64 of these cases were dismissed or found in favor of the defendants, indicating that the vast majority of these cases testing this auxiliary hypothesis found it false. The two successful cases related to denying the counting of ballots with missing IDs after election day in Pennsylvania and the Republican candidate for New York’s 22nd Congressional district being declared the eventual winner by 109 votes.
Thus, these two cases seem more like rogue comets passing by the theoretical core than auxiliary hypotheses in tight orbit around it. Furthermore, some of the cases that were brought are so frivolous as to draw sanctions against the attorneys who filed them or cause an attorney to be fired. Though Mike Lindell has been filing suits against voting machine manufacturers, they are countersuing for defamation and seeking sanctions for the filing attorneys.
Audits of Electoral Counts Will Reveal Discrepant Vote Totals
The notion that recounts and audits of vote counts would show massive fraud is prominent across the Big Lie, presumably reflecting vote-dumping shenanigans, vote switching, or other corruption of electronic databases. Secretaries of State in Georgia and Texas have found no meaningful discrepancies between published and audited vote totals. Wisconsin (by Republican leaders), Michigan (by Senate Republicans) and Arizona (by a pro-Trump firm with no previous experience) audits likewise found no fraud, even with agencies having a strong Republican allegiance bias. It is unclear whether Pennsylvania’s Republican-led Senate will release the results of its audit, though its Secretary of State has already audited the vote totals without any meaningful discrepancies with published vote totals. There is no evidence in favor of this auxiliary hypothesis and several tests falsifying it.
Voter Fraud Prosecutions Will Show Rampant Illegal Voting by Democratic Voters
The notion that large swathes of Democratic voters cast illegal ballots is another tenet of the Big Lie. However, in the six most contested states, there is little evidence for this proposition. Indeed, Republicans in Florida (at least 4 cases), Pennsylvania (at least 3 cases), and Nevada (at least 1 case) are accused or convicted of voting crimes. Again, the evidence is against the Big Lie’s auxiliary hypotheses – this time, in the opposite direction.
Objectors to Electoral Count Insisted on Having Their Own Votes Audited
On January 6, six Republican Senators and 121 Republican Representatives objected to the electoral count in Arizona. Thus, it would be reasonable that at least the objecting Republican representatives from Arizona would likewise cast doubt on their own electoral victories and entertain doubt about their victories. Alas, no such protests have issued from Representatives Biggs, Gosar, or Lesko.
All four sets of auxiliary hypotheses have been falsified in multiple ways, sometimes in the direction opposite predictions. Any attempts to regenerate those auxiliary hypotheses through the positive heuristic have also failed.
The Big Lie has been laid bare. It fails to generate hypotheses with connection to data in the world at large. Its support does not derive from empirically supported facts.
The Big Lie is a degenerate research program.
Once again, the term “statistical significance” in null hypothesis significance testing is under fire. The authors of that commentary favor renaming “confidence intervals” as “compatibility intervals”. This term emphasizes that the intervals represent a range of values that would be reasonable values to obtain under certain statistical assumptions rather than a statement about subjective beliefs (for which Bayesian statistics are more appropriate with their credible intervals). However, I think the term needing replacement most goes back even further. In our “justify your alpha” paper, we recommended abandoning the term “statistically significant”, but we didn’t give a replacement for that term. It wasn’t for lack of trying, but we never came up with a better label for findings that pass a statistical threshold.
My previous blog post about how to justify an alpha level tried using threshold terminology, but it felt clunky and ungainly. After thinking about it for over a year, I think I finally have an answer:
Replace “significant” with “discernible”.
The first advantage is the conceptual shift from meaning to perception. Rather than imbuing a statistical finding with a higher “significance”, the framework moves earlier in the processing stream. Perception entails more than sensation, which a term like “detectable” might imply. “Discernible” implies the arrangement of information, similar to how inferential statistics can arrange data for researchers. Thus, statistics are more readily recognized as tools to peer into a study’s data rather than arbiters of the ultimate truth – or “significance” – of a set of findings. Scientists are thus encouraged to own the interpretation of the statistical test rather than letting the numbers arrogate “significance”.
The second advantage of this terminological shift is that the boundary between results falling on either side of the statistical threshold becomes appropriately ambiguous. No longer is some omniscient “significance” or “insignificance” imbued to them automatically. Rather, a set of questions immediately arises. 1) Is there really no effect there – at least, if the null hypothesis is the nil hypothesis of exactly 0 difference or relationship? It’s highly unlikely, as unlikely as finding a true vacuum, but I suppose it might be possible. In this case, having “no discernible effect” allows the possibility that our perception is insufficient to recognize it against a background void.
2) Is there an effect there, but it’s so small as to be ignorable given the smallest effect size we care about? This state of affairs is likely when tiny effects exist, like the relationship between intelligence and birth order. Larger samples might be needed or more precise measures could be required to find it – like having an electron microscope instead of a light microscope. However, with the current state of affairs, we as a research community are satisfied that the effects are small enough that we’re willing not to care about them. Well-powered studies should allow relatively definitive conclusions here. Here, “no discernible effect” suggests the effect may be wee, but so wee that we are content not to consider them further.
3) Are our tests so imprecise that we couldn’t answer whether an effect is actually there or not? Perhaps a study has too few participants to detect even medium-sized effects. Perhaps its measurements are so internally inconsistent as to render them like unto a microscope lens smeared with gel. Either way, the study just may not have enough power to make a meaningful determination one way or the other. The increased smudginess of one set of data compared to another that helped inspire the Nature commentary might be more readily appreciated when described in “discernible” instead of “significant” terms. Indeed, “no discernible effect” helps keep in mind that our perceptual-statistical apparatus may have insufficient resolution to provide a solid answer. Conversely, a discernible finding might also be due to faulty equipment or other non-signal related causes, irrespective of its apparent practical significance.
These questions lead us to ponder whether our findings are reasonable or might instead simply be the phantoms of false positives (or other illusions). Indeed, I think “discernible” nudges us to think about why a finding did or didn’t cross the statistical threshold more deeply instead of simply accepting its “significance” or lack thereof.
In any case, I hope that “statistically discernible” is a better term for what we mean when a result passes the alpha threshold in a null hypothesis test and is thus more extreme than we decided would be acceptable to believe as a value arising from the null hypothesis’s distribution. I hope it can lead to meaningful shifts in how we think about the results of these statistical tests. Then again, perhaps the field will just rail against NHDT in a decade. Assuming, of course, that it doesn’t just act as Regina to my Gretchen.
I joined a panoply of scholars who argue that it is necessary to justify your alpha (that is, the acceptable long-run proportion of false positive results in a specific statistical test) rather than redefine a threshold for statistical significance across all fields. We’d hoped to emphasize that more than just alpha needed justification, but space limitations (and the original article’s title) focused our response on that specific issue. We also didn’t have the space to include specific examples of how to justify alpha or other kinds of analytic decisions. Because people are really curious about this issue, I elaborate here my thought process about how one can justify alpha and other analytic decisions.
The whole notion of an alpha level doesn’t make sense in multiple statistical frameworks. For one, alpha is only meaningful in the frequentist school of probability, in which we consider how often a particular event would occur in the real world out of the total number of such possible events. In Bayesian statistics, the concern is typically adjusting the probability that a proposition of some sort is true, which is a radically different way of thinking about statistical analysis. Indeed, the Bayes factor represents the relative likelihood of an alternative hypothesis to that of a null hypothesis given the data obtained. There’s also no clear mapping of p values to Bayes factors, so many discussions about alpha aren’t relevant if that’s the statistical epistemology you use.
Another set of considerations comes from Sir Ronald Fisher, one of the original statisticians. In this view, there’s no error control to be concerned with; p values instead represent levels of evidence against a null hypothesis. Crossing a prespecified low p value (i.e., alpha) then entails rejection of a statistical null hypothesis. However, this particular point of view has fallen out of favor, and Fisher’s attempts to construct a system of fiducial inference also ended up being criticized. Finally, there are systems of statistical inference based on information theory and non-Bayesian likelihood estimation that do not entail error control in their epistemological positions.
I come from a clinical background and teach the psychodiagnostic assessment of adults. As a result, I come from a world in which I must use the information we gather to make a series of dichotomous decisions. Does this client have major depressive disorder or not? Is this client likely to benefit from exposure and response prevention interventions or not? This framework extends to my thinking about how to conduct scientific studies or not. Will I pursue further work on a specific measure or not? Will my research presume this effect is worth including in the experimental design or not? Thus, the error control epistemological framework (borne of Neyman-Pearson reasoning about statistical testing) seems to be a good one for my purposes, and I’m reading more about it to verify that assumption. In particular, I’m interested in disambiguating this kind of statistical reasoning from the common practice of null hypothesis significance testing, which amalgamates Fisherian inferential philosophy and Neyman-Pearson operations.
I don’t argue that it’s the only relevant possible framework to evaluate the results of my research. That’s why I still report p values in text whenever possible (the sea of tabular asterisks can make precise p values difficult to report there) to allow a Fisherian kind of inference. Such p value reporting also allows those in the Neyman-Pearson tradition to use different decision thresholds in assessing my work. It should also be possible to use the n/df and values of my statistics to compute Bayes factors for the simpler statistics I report (e.g., correlations and t tests), though more complex inferences may be difficult to reverse engineer.
My lab has run four different kinds of studies, each of which has unique goals based on the population being sampled, the methods being used, and the practical importance of the questions being asked. A) One kind of study uses easy-to-access and convenient students or MTurk workers as a proxy for “people at large” for studies involving self-report assessments of various characteristics. B) Another study draws from a convenient student population to make inferences about basic emotional functioning through psychophysiological or behavioral measures. C) A third study draws from (relatively) easily sampled clinical populations in the community to bridge self-report and clinical symptom ratings, behavioral, and psychophysiological methods of assessment. D) The last study type comes from my work on the Route 91 shooting, in which the population sampled is time-sensitive, non-repeatable, and accessible only through electronic means. In each case, the sampling strategy from these populations entails constraints on generality that must be discussed when contextualizing the findings.
In study type A), I want to be relatively confident that the effects I observe are there. I’m also cognizant that measuring everything through self-report has attendant biases, so it’s possible processes like memory inaccuracies, self-representation goals, and either willful or unintentional ignorance may systematically bias results. Additionally, the relative ease of running a few hundred more students (or MTurk workers, if funding is available) makes running studies with high power to detect small effects a simpler proposition than in other study types, as once the study is programmed, there’s very little work needed to run participants through it. Indeed, they often just run themselves! In MTurk studies, it may take a few weeks to run 300 participants; if I run them in the lab, I can run about 400 participants in a year on a given study. Thus, I want to have high power to detect even small effects, I’ll use a lower alpha level to guard against spurious results borne of mega-multiple comparisons, and I’m happy to collect large numbers of participants to reduce the errors surrounding any parameter estimates.
Study type B) deals with taking measurements across domains that may be less reliable and that do not suffer from that same single-method biases as self-report studies. I’m willing to achieve a lower evidential value for the sake of being able to say something about these studies. In part, I think for many psychophysiological measures, the field is learning how to do this work well, and we need to walk statistically before we can run. I also believe that to the extent these studies are genuinely cross-method, we may reduce some of the crud factor especially inherent in single-method studies and produce more robust findings. However, these data require two research assistants to spend an hour of their time applying sensors to participants’ bodies and then spend an extra two to three hours collecting data and removing those sensors, so the cost of acquiring new participants is higher than in study type A). For reasons not predictable from the outset, a certain percentage of participants will also not yield interpretable data. However, the participants are still relatively easy to come by, so replacement is less of an issue, and I can plan to run around 150 participants a year if the lab’s efforts are fully devoted to the study. In such studies, I’ll sacrifice some power and precision to run a medium number of participants with a higher alpha level. I also use a lot of within-subjects designs in these kinds of studies to maximize the power of any experimental manipulations, as they allow participants to serve as their own controls.
In study type C), I run into many of the same kinds of problems as study type B) except that participants are relatively hard to come by. They come from clinical groups that require substantial recruitment resources to characterize and bring in, and they’re relatively scarce compared to unselected undergraduates. For cost comparison purposes, it’s taken me two years to bring in 120 participants for such a study (with a $100+ session compensation rate to reflect that they’d be in the lab for four hours). Within-subjects designs are imperative here to keep power high, and I also hope that study type B) has shown us how to maximize our effect sizes such that I can power a study to detect medium-sized effects as opposed to small ones.
Study type D) entails running participants who cannot be replaced in any way, making good measurement imperative to increase precision to detect effects. Recruitment is also a tremendous challenge, and it’s impossible to know ahead of time how many people will end up participating in the study. Nevertheless, it’s still possible to specify desired effect sizes ahead of time to target along with the precision needed to achieve a particular statistical power. I was fortunate to get just over 200 people initially, though our first follow-up had the approximately 125 participants I was hoping to have throughout the study. I haven’t had the ability to bring such people into the lab so far, so it’s the efforts needed in recruitment and maintenance of the sample that represent substantial costs, not the time it takes research assistants to collect the data.
Many different effect sizes can be computed to address different kinds of questions, yet many researchers answer this question by defaulting to an effect size that either corresponds to a lay person’s intuitions about the size of the effect (Cohen, 1988) or a typical effect size in the literature. However, in my research domain, it’s often important to consider whether effect sizes make a practical difference in real-world applications. That doesn’t mean that effects must be whoppingly large to be worth studying; wee effects with cheap interventions that feature small side effect profiles are still well worth pursuing. Nevertheless, whether theoretical, empirical, or pragmatic, researchers should take care to justify a minimum effect size of interest, as this choice will guide the rest of the justification process.
In setting this minimum effect size of interest, researchers should also consider the reliability of the measures being used in a study. All things being equal, more reliable measures will be able to detect smaller effects given the same number of observations. However, savvy researchers should take into account the unreliability of their measures when detailing the smallest effect size of interest. For instance, a researcher may want to detect a correlation of .10 – which corresponds to an effect explaining 1% of the linear relationship between two measures – and the two measures the researcher is using have internal consistencies of .80 and .70. Rearranging Lord and Novick’s (1968) correction formula, the actual smallest effect size of interest should be calculated as .10*√(.80*.70), or .10*√.56, or .10*.748, or .075.
However, unreliability of measurement is not the only kind of uncertainty that might lead researchers to choose a smaller minimum effect size of interest to detect. Even if researchers consult previous studies for estimates of relevant effect sizes, publication bias and uncertainty around the size of an effect in the population throw additional complications into these considerations. Adjusting the expected effect size of interest in light of these issues may further aid in justifying an alpha.
In the absence of an effect that passes the statistical threshold in a well-powered study, it may be useful to examine whether it is instead inconsistent with the smallest effect size of interest. In this way, we can articulate whether proceeding as if even that effect is not present is a reasonable one rather than defaulting to “retaining the null hypothesis”. This step is important for completing the error control process to ensure that some conclusions can be drawn from the study rather than leaving it out in an epistemological no-man’s land should results not pass the justified statistical threshold.
Utility functions. Ideally, the field could compute some kind of utility function whose maximum value represents a balance among sample size from a given population, minimum effect size of interest, alpha, and power. This function could provide an objective alpha to use in any given situation. However, because each of these quantities has costs and benefits associated with them – and the relative costs and benefits will vary by study and investigator – such a function is unlikely to be computable. Thus, when justifying an alpha level, we need to resort to other kinds of arguments. This means that it’s unlikely all investigators will agree that a given justification is sufficient, but a careful layout of the rationale behind the reported alpha combined with detailed reporting of p values would allow other researchers to re-evaluate a set of findings to determine how they comport with those researchers’ own principles, costs, and benefits. I would also argue that there is no situation in which there are no costs, as other studies could always be run in place of one that’s chosen, participants could be allocated to other studies instead of the one proposed, and the time spent programming a study, reducing its data, and analyzing the results are all costs inherent in any study.
Possible justifications. One blog post summarizes traditionally, citationally, normatively principled, bridgingly principled, and objectively justified alphas. The traditional and cited justifications are similar; the cited version simply notes where the particular authority for the alpha level is (e.g., Fisher, 1925) instead of resting on a nebulous “everybody does it.” In this way, the paper about redefining statistical significance provides a one-stop citation for those looking to summarize that tradition and provides a list of authors who collectively represent that tradition.
However, that paper provides additional sets of justifications for a stricter alpha level that entail multiple possible inferential benefits derived from normative or bridged principles. In particular, the authors of that paper bridge frequentist and Bayesian statistical inferential principles to emphasize the added rigor an alpha of .005 would lend the field. They also note that normatively, other fields have adopted stricter levels for declaring findings “significant” or as “discoveries”, and that such a strict alpha level would reduce the false positive rate below that of the field currently while requiring less than double the number of participants to maintain the same level of power.
Bridging principles. One could theoretically justify a (two-tailed) alpha level of .05 on multiple grounds. For instance, humans tend to think in 5s and 10s, and a 5/100 cutoff seems intuitively and conveniently “rare”. I should also note here that I use the term “convenient” to denote something adopted as more than “arbitrary”, inasmuch as our 5×2 digit hands provide us a quick, shared grouping for counting across humans. I fully expect that species with non-5/10 digit groupings (or pseudopods instead of digits) might use different cutoffs, which would similarly shape their thinking about convenient cutoffs for their statistical epistemology.
Such a cutoff has also been bridged to values of the normal distribution, as an alpha of .05 corresponds to twice that distribution’s standard error. Because many parametric frequentist statistical tests assume normal distributions of the standard errors of scores, this bridge links the alpha level to a fundamental assumption of these kinds of statistical tests.
Another bridging principle entails considering that lower alpha levels correspond to increasingly severe tests of theories. Thus, a researcher may prefer a lower alpha level if the theory is more well-developed, its logical links are clearer (from the core theoretical propositions to its auxiliary corollaries to its specific hypothetical propositions to the statistical hypotheses to test in any study), and its constructs are more precisely measured.
How to describe findings meeting or exceeding alpha? From a normative perspective, labeling findings as “statistically significant” has led to decades of misinterpretation of the practical importance of statistical tests (particularly in the absence of effect sizes). In our commentary, we encouraged abandoning that phrase, but we didn’t offer an alternative. I propose describing these results as “passing threshold” to reduce misinterpretative possibilities. This term is far less charged with…significance…and may help separate evaluation of the statistical hypotheses under test from larger practical or theoretical concerns.
Though justifying alpha is an important step, it’s just as important to justify your beta (which is the long-run proportion of false retentions of the null hypothesis). From a Neyman-Pearson perspective, the lower the beta, the more evidential value a null finding possesses. This is also why Neyman-Pearson reasoning is inductive rather than deductive: Null hypotheses have information value as opposed to being defaults that can only be refuted with deductive logic’s modus tollens tests. However, the lower the beta, the more observations are needed to make a given effect size pass the statistical threshold set by alpha. As shown above, one key to making minimum effect sizes of interest larger is measuring that effect with more reliably. A second is maximizing the strength of any manipulation such that a larger minimum effect size would be interesting to a researcher.
Another angle on the question leading this section is: How precise would the estimate of the effect size need to be to make me comfortable with accepting the statistical hypothesis being tested rather than just retaining it in light of a test statistic that doesn’t pass the statistical threshold? Just because a test has a high power (on account of a large effect size) doesn’t mean that the estimate of that effect is precise. More observations are needed to make precise estimates of an effect – which also reduces the beta (and thus heightens the power) of a given statistical test.
Power curves visualize the tradeoffs among effect size, beta, and the number of observations. They can aid researchers in determining how feasible it is to have a null hypothesis with high evidential value versus being able to conduct the study in the first place. Some power curves start showing a non-linear relationship between observations and beta when beta is about .20 (or power is about .80), consistent with historical guidelines. However, other considerations may take precedence over the shape of a power curve. Implicitly, traditional alpha (.05) and beta (.20) levels imply that erroneously declaring an effect passes threshold is four times worse than erroneously declaring an effect does not. Some researchers might believe even higher ratios should be used. Alternatively, it may be more important for researchers to fix one error rate or another at a specific value and let the other vary as resources dictate. These values should be articulated in the justification.
Most studies do not conduct a single comparison. Indeed, many studies toss in a number of different variables and assess their relationships, mediation, and moderation among them. As a result, there are many more comparisons conducted than the chosen alpha and beta levels are designed to guard against! There are four broad methods to use when considering how multiple comparisons impact your stated alpha and beta levels.
Per-comparison error rate (PCER) does not adjust comparisons at all and simply accepts the risk of there likely being multiple spurious results. In this case, no adjustments to alpha or beta need to be made in determining how many observations are needed.
False discovery error rate (FDER) allows that there will be a certain proportion of false discoveries in any set of multiple tests; FDER corrections attempt to keep the rate of these false discoveries at the given alpha level. However, this comes at a cost of complexity for those trying to justify alpha and beta, as each comparison uses a different critical alpha level. One common method for controlling FDER adjusts alpha levels for each comparison in a relatively linear fashion, retaining null hypotheses starting from the highest p value to the last one in which p > [(step #)/(total # of comparisons)]*(justified alpha). The remaining comparisons are judged as passing threshold. So, which comparison’s alpha value should be used in justifying comparisons? This may require knowing on average how many comparisons would typically pass threshold within a given comparison set size to plan for a final alpha to justify. After that, the number of observations may need adjusting to maintain the desired beta.
Corrections to the family-wise error rate (FWER) seek to reduce the error rate across a set of comparisons by lowering alpha in a more dramatic way across comparisons than do corrections for FDER. One popular method for controlling FWER entails dividing the desired alpha level by the number of comparisons. If the smallest p value in that set is smaller than alpha/(# comparisons), then it passes threshold and the next smallest p value is compared to alpha/(# comparisons-1). Once the p value of a comparison is greater than that fraction, that comparison and the remaining comparisons are considered not to have passed threshold. This correction has the same problems of an ever-shifting alpha and beta as the FDER, so the same cautions apply.
Per-family error rate (PFER) represents the most stringent error control of all. In this view, making multiple errors across families of comparisons is more damaging than making one error. Thus, tight control should be exercised over the possibility of making an error in any family of comparisons. The Bonferroni correction is one method of maintaining a PFER that is familiar to many researchers. In this case, alpha simply needs to be divided by the number of comparisons to be made, beta adjusted to maintain the appropriate power, and the appropriate number of observations collected.
Many researchers reduce alpha in the face of multiple comparisons to address the PCER without taking steps to address other kinds of error rates formally. Such ad hoc adjustments should at least also report how many tests would pass the statistical threshold by chance alone. Using FDER or FWER control techniques represent a balance between leniency and strictness in error control, though researchers should specify in advance whether false discovery or family-wise error control is more in line with their epistemological stance at a given time. Researchers may prefer to control the PFER when the number of comparisons is kept to a minimum through the use of a few critical, focal tests of a well-developed theory.
In FDER, FWER, and PFER control mechanisms, the notion of “family” must be justified. Is it all comparisons conducted in a study? Is it a set of exploratory factors that are considered separately from focal confirmatory comparisons? Does it group together conceptually similar measures (e.g., normal-range personality, abnormal personality, time-limited, psychopathology, and well being)? All of these and more may be reasonable families to use in lumping or splitting comparisons. However, to help researchers believe that these families were considered separable at the outset of a study, family membership decisions should be pre-registered.
Though the epistemological principles involved in justifying alphas and similar quantities run deeply, I don’t believe that a good alpha justification requires more than a paragraph. Ideally, I would like to see this paragraph placed at the start of a Method section in a journal article, as it sets the epistemological stage for everything that comes afterward. For each study type listed above, here are some possible paragraphs to justify a particular alpha, beta, and number of observations. I note that these are riffs off possible justifications; they do not necessarily represent the ways I elected to treat the same data detailed in the first sentences of each paragraph. To determine appropriate corrections for unreliability in measures when computing power estimates, I used Rosnow and Rosenthal’s (2003) conversions of effect sizes to correlations (rs).
Study type A): We planned to sample from a convenience population of undergraduates to provide precise estimates of two families of five effects each; we expected all of these effects to be relatively small. Because our measures in this study have historically been relatively reliable (i.e., internal consistencies > .80), we planned our study to detect a minimum correlation of .08, as that corresponds to a presumed population correlation of .10 (or proportion of variance shared of 1%) measured with instruments with at least 80% true score variance. We also recognized that our study was conducted entirely with self-report measures, making it possible that method variance would contaminate some of our findings. As a result, we adopted a critical one-tailed α level of .005. Because we believed that spuriously detecting an effect was ten times as undesirable as failing to detect an effect, we chose to run enough participants to ensure we had 95% power to detect a correlation of .08. We used the Holm-Bonferroni method to provide a family-wise error rate correction entailing a minimum α of .001 for the largest effect in the study in each of the two families. This required a sample size of 3486 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Molina, Pierce, Bergquist, & Benning, 2018), we anticipated that 2% of our sample would produce invalid personality profiles that would require replacement to avoid distorting the results (Benning & Freeman, 2017). Consequently, we targeted a sample size of 3862 participants to anticipate these replacements.
Study type B): We planned to sample from a convenience population of undergraduates to provide initial estimates of the extent to which four pleasant picture contents potentiated the postauricular reflex compared to neutral pictures. From a synthesis of the extant literature (Benning, 2018), we expected these effects to vary between ds of 0.2 to 0.5, with an average of 0.34. Because our measures in this study have historically been relatively unreliable (i.e., internal consistencies ~ .35; Aaron & Benning, 2016), we planned our study to detect a minimum mean difference of 0.20, as that corresponds to a presumed population mean difference of 0.34 measured with approximately 35% true score variance. We adopted a critical α level of .05 to keep false discoveries at a traditional level as we sought to improve the reliability and effect size of postauricular reflex potentiation. Following conventions that spuriously detecting an an effect was four times as undesirable as failing to detect an effect, we chose enough participants to ensure we had 80% power to detect a d of 0.20. We used the Benjamini-Hochberg (1995) method to provide a false detection error rate correction entailing a minimum one-tailed α of .0125 for the largest potentiation in the study. This required a sample size of 156 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Ait Oumeziane, Molina, & Benning, 2018), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 180 participants to accommodate these additional participants.
Study type C): We planned to sample from our university’s community mental health clinic to examine how anhedonia manifests itself in depression across seven different measures that are modulated by emotional valence. However, because there were insufficient cases with major depressive disorder in that sampling population, we instead used two different advertisements on Craigslist to recruit local depressed and non-depressed participants who were likely to be drawn from the same population. We believed that a medium effect size for the Valence x Group interaction (i.e., an f of 0.25) represented an effect that would be clinically meaningful in this assessment context. Because our measures’ reliabilities vary widely (i.e., internal consistencies ~ .35-.75; Benning & Ait Oumeziane, 2017), we planned our study to detect a minimum f of 0.174, as that corresponds to a presumed population f of 0.25 measured with approximately 50% true score variance. We adopted a critical α level of .007 in evaluating the Valence x Group interactions, using a Bonferroni correction to maintain a per-family error rate of .05 across all seven measures. To balance the number of participants needed from this selected population with maintaining power to detect effects, we chose enough participants to ensure we had 80% power to detect an f of 0.174. This required a sample size of 280 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with these measures (e.g., Benning & Ait Oumeziane, 2017), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 322 participants to accommodate these additional participants.
Study type D): We planned to sample from the population of survivors of the Route 91 Harvest Festival shooting on October 1, 2017, and from the population of the greater Las Vegas valley area who learned of that shooting within 24 hours of it happening. Because we wanted to sample this population within a month after the incident to examine acute stress reactions – and recruit as many participants as possible – this study did not have an a priori number of participants or targeted power. To maximize the possibility of detecting effects in this unique population, we adopted a critical two-tailed α level of .05 with a per-comparison error rate, as we were uncertain about the possible signs of all effects. Among the 45 comparisons conducted this way, chance would predict approximately 2 to pass threshold. However, we believed the time-sensitive, unrepeatable nature of the sample justified using looser evidential thresholds to speak to the effects in the data.
This post is long; the links below will take you to potential topics of interest.
Our study about the Route 91 shooting represents a substantial addition to my lab’s research skills portfolio. Specifically, it’s my first foray into anything remotely resembling community psychology, in which researchers actively engage in helping to solve problems in an identified community. In this case, I thought of the Route 91 festival survivors (and potentially the broader Las Vegas community affected by the shooting) as the community. Below are the steps we used to conduct time-sensitive research in this community, with eyes both toward doing the best science possible and toward serving the community from which our participants were drawn.
0. In time-sensitive situations, call or visit your Institutional Review Board (IRB) in person and delegate work.
Because research cannot be performed without IRB approval, I gave my IRB a call as soon as a) I had a concrete idea for the study I wanted to do and b) support from my lab to work as hard as necessary to make it happen. In my case, that was the Friday after the shooting – and three days before I was scheduled to fly out of the country for a conference. Fortunately, I was able to get call our IRB’s administrative personnel, and Dax Miller guided me through the areas we’d need to make sure we addressed for a successful application. He also agreed to perform a thorough administrative review on Sunday so that I could get initial revisions back to the IRB before leaving. We performed the revisions so the study could be looked over by an IRB reviewer by Wednesday, who had additional questions I could address from afar so that the study could be approved by Thursday afternoon in Vegas (or early Friday morning in Europe). Without that kind of heads-up teamwork from the IRB, we simply couldn’t have done this study, which sought to look at people’s stories of the trauma along with their symptoms within the acute stress window of a month of the shooting.
I also drafted my lab to perform a number of tasks I simply couldn’t do by myself in such a short period of time. Three students provided literature to help me conceptualize the risks and benefits this study might pose. Two worked to provide a list of therapeutic resources for participants. Two others scoured the internet for various beliefs people espoused about the shooting to develop a measure of those. Yet another two programmed the study in Qualtrics and coordinated the transfer of a study-specific ID variable so that we could keep contact information separate from participants’ stories and scores, one of whom also drafted social media advertisements. A final student created a flyer to use for recruiting participants (including a QR code to scan instead of forcing participants to remember our study’s URL). Again, without their help, this study simply couldn’t have been done, as I was already at or exceeding my capacity to stay up late in putting together the IRB application and its supporting documentation (along with programming in personality feedback in Qualtrics as our only incentive for participating).
1. Social media can be your recruitment friends, as can internal email lists.
We spread our flyers far and wide, including a benefit event for Route 91 survivors as well as coffee shops, community bulletin boards, and other such locations across the greater Las Vegas valley. Nevertheless, my RAs used their social media to help promote our study with the IRB-approved text, as did I. Other friends took up the cause and shared posts, spreading the reach of our study into the broader Las Vegas community in ways that would have been impossible otherwise.
A number of UNLV students, faculty, and staff had also attended Route 91 (and all were affected in some way by the shooting), so we distributed our study through internal email lists. At first, I had access to send an announcement through the College of Liberal Arts’ weekly student email list along with the faculty and staff daily email. After word spread of the study (see the point below), I was also allowed to send a message out to all students at the university. Those contacts helped bump our recruitment substantially, getting both people at the festival and from the broader Las Vegas community in the study.
2. The news media extends your reach even more deeply into the community, both for recruitment and dissemination.
Over the years, I’ve been fortunate enough to have multiple members of the news media contact me about stories they’re doing that can help put psychological research into context for the public. At first, I thought of contacting them as having them return the favor to me to help get the word out about my study. However, as I did so, I also recognized that approaching them with content relevant to their beats may have made their jobs mildly easier. They have airtime or column inches they have to fill, and if you provide them meaningful stories, it’ll save them effort in locating material to fill that time. Thus, if you’re prepared with a camera- or phone-ready set of points, both you and your media contacts can have a satisfying professional relationship.
I made sure to have a reasonable and concise story about what the study was about, what motivated it, what all we were looking at, and what the benefits might be to the community. That way, the journalists had plain-English descriptions of the study that could be understood by the average reader or viewer and that could be used more or less as-is, without a lot of editing. In general, I recommend having a good handle on about 3 well-rehearsed bullet points you want to make sure you get across – and that are expressed in calm, clear language you could defend as if in peer review. Those points may not all fit in with the particular story that the journalist is telling, but they’ll get the gist, especially if you have an action item at the end to motivate people. For me, that was my study’s web address.
As time went on, more journalists started contacting me. I made sure I engaged all of them, as I wanted the story of our study out in as many places as possible. Generally speaking, with each new story that came out, I had 5-10 new people participate in the study. If your study is interesting, it may snowball, and you never know which media your potential community participants might consume. The university’s press office helped in getting the word out as well, crafting a press release that was suitable for other outlets to pick up and modify.
The media can also help you re-engage your community during and after research has commenced (which I discuss more in point 5 below). They have a reach beyond your specific community you’ll likely never have, and they can help tell your community’s story to the larger world. Again, it’s imperative to do so in a way that’s not stigmatizing or harmful (see point 4 below), but you can help make prominent people whose voices otherwise wouldn’t be heard or considered.
3. Lead your approach to community groups directly with helpful resources after building credibility.
Another good reason to approach the media beyond increasing participation immediately is that having a public presence for your research will give you more credibility when approaching your community of interest directly. I recognized that after about a week of press, one of the participants mentioned a survivor’s Facebook group, and believed the time was right to make direct inquiries to the community I wanted to help. To that end, I messaged the administrators of survivors’ Facebook groups, asking them to post the free and reduced cost therapy resources we gave to participants after the study. I was also careful not to ask to join groups, as I didn’t want to run the risk of violating that community’s healing places.
Two of the groups’ administrators asked me to join the group directly to post them, and I was honored they asked me to do so. However, in those groups, I confined myself to being someone who posted resources when general calls went out rather than offering advice about coping with trauma. I didn’t want to over-insinuate myself into the group and thereby distort their culture, and I also wanted to maintain a professionally respectful distance to allow the group to function as a community resource. A couple of other groups said they would be willing to consider posting on my behalf but that the groups were closed to all but survivors. I thanked them for their consideration and emphasized I just wanted to spread the word about available resources.
All in all, it seems imperative to approach a community with something to give, rather than just wanting to receive from them. In this unique case, I had something to offer almost immediately. However, if it’s not clear what you might bring to the table, research your community’s needs and talk to some representatives to see what they might need. To the extent your professional skills might help and that the community believes you’ll help them (and not harm them), you’re more likely to get accepted into the community to conduct your research.
4. Engage the community in developing your study.
I learned quickly about forming a partnership with the community in developing my research when one of the members of a Route 91 survivor’s group contacted me about our study. I noted that she was local and had a background in psychology, thus making her an excellent bridge between my research team and the broader community. She zipped through IRB training and provided invaluable feedback about the types of experiences people have had after Route 91 (and helped develop items to measure those) along with providing feedback about a plan to compensate participants (confirming that offering the opportunity to donate compensation to a victim’s fund might alleviate some people’s discomfort). She also gave excellent advice about how to present the study’s results to the community, down to the colors used on the graph to make more obvious the meaning of the curves I drew. Consistent with best practices in community psychology, I intend to have her as an author on the final paper(s) so that the community has a voice in this research’s reports.
Though the ad hoc, geographically dispersed nature of this community makes more centralized planning with it more challenging (especially with a short time frame), I hope our efforts thus far have helped stay true to the community’s perceptions and has avoided stigmatizing them. In communities with leadership structures of their own, engaging those leaders in study planning, participation, and dissemination helps make the research truer to the community’s experience and will likely make people more comfortable with participating. Those people may want to make changes that may initially seem to compromise your goals for a study, but in this framework, the community is a co-creator of the research. If you can’t explain well why certain procedures you really want to use are important in ways the community can accept, then you’ll need to listen to the community to figure out how to work together. Treat education as a two-way street: You have a lot to learn about the community, and you can also show them the ins and outs of research procedures, including why certain procedures (e.g., informed consent) have been developed to protect participants, not harm them.
5. Give research results back to your community in an easily digestible form.
In community psychology, the research must feed back into the community somehow. Because we’re not doing formal interventions in this study, the best I think we can do at this point is share our results in a format that’s accessible to people without a statistical education. In the web page I designed to do just that, I use language as plain as I can to describe our findings without giving tons of numbers in the text. In the numerical graphs I feature, I’ve used animated GIFs to introduce the public to the layers comprising a graph rather than expecting them to comprehend the whole thing at once. I hope that it works.
I also posted my findings on all the groups that had me and engaged reporters who’d asked for follow-up stories once we had our first round of data collected so that their investment in helping me recruit would see fruit. It seemed like many Route 91 survivors reported being perceived as not having anything “real” wrong with them or being misunderstood by their families, friends, coworkers, or romantic partners. Thus, I tried to diagram how people at Route 91 had much higher levels of post-traumatic stress than people in the community, such that about half of them would qualify for a provisional PTSD diagnosis if their symptoms persisted for longer than a month.
6. Advocate for your community.
This is one of the trickier parts of this kind of research for me, as I don’t want to speak as a representative for a broad, decentralized community of which I’m ultimately not a part. Nevertheless, I think data from this research could help advocate for the survivors in their claims, especially those who may not be eligible for other kinds of compensatory funds. I only found out about the town hall meetings of the Las Vegas Victims Fund as they were happening, so I was unable to provide an in-person comment to the board administering the funds. Fortunately, Nicole Raz alerted me to the videos of the town halls, and I was privileged to hear the voices of those who want to be remembered in this process. Right now, I’m drafting a proposal based on this study’s data and considerations of how the disability weights assigned to PTSD by the World Health Organization compare to other conditions that may be granted compensation.
In essence, I’m hoping to make a case that post-traumatic stress is worth compensating, especially given that preliminary results suggest that post-traumatic stress symptomatology as a whole doesn’t seem to have declined in this sample over the course of a month. One of the biggest problems facing this particular victims’ fund is that there are tens of thousands of possible claimants unlike just about any other mass tragedy in modern US history, so the fund administrators have terribly difficult decisions to make. I hope to create as effective an argument as possible for their consideration, and I also hope to make those who are suffering aware of other resources that may help them reduce the burden dealing with the shooting has placed on them.
7. Use statistical decision thresholds that reflect the relative difficulty of sampling from your community.
This is a point that’s likely of interest only to researchers, but it bears heavily on how you conceptualize your study’s design and analytic framework when writing for professional publication. In this case, I knew I was dealing with (hopefully) a once-in-a-lifetime sample. Originally, I was swayed by arguments to define statistical significance more stringently and computed power estimates based on finding statistically significant effects at the .005 level with 80% power using one-tailed tests. My initial thought was that I wanted any results about which I wrote to have as much evidential value as possible.
However, as I took to heart calls I’ve joined to justify one’s threshold for discussing results instead of accepting a blanket threshold, I realized that was too stringent a standard to uphold given the unrepeatable nature of this sample. I recognized I was willing to trade a lower evidential threshold for the ability to discuss more fully the results of our study. To that end, I’m now thinking we should use an alpha level of .05, though corrected for multiple comparisons within a family using the Holm-Bonferroni method within fairly narrowly defined families of results to correct for multiple comparisons.
Specifically, for each conceptual set of measures (i.e., psychopathology, normal-range personality, other personality, well-being, beliefs about the event, and demographics), I’ll adjust the critical p value through dividing .05 by the number of measures in that family. We have two measures of psychopathology (i.e., the PCL-5 and PHQ-9), 11 normal-range personality traits, 3 other personality traits, 5 measures of well-being, and (probably) 2 measures of beliefs. Thus, if I’m interested in how those at the festival vs. those who weren’t at the festival differed in their normal range personality traits, I could conduct a series of 11 independent sample Welch’s t tests (potentially after a MANOVA involving all traits suggested there are some variables whose means differ between groups).
I’d evaluate the significance of the largest difference at a critical value of .05/11, the second largest (if that first one is significant) at a critical value of .05/10, and so on until the comparison is no longer significant. For my psychopathology variables, I’d evaluate (likely) the PTSD difference first at a critical value of .05/2, then (likely) the depression difference at a critical value of .05/1 (or .05).
That way, I’ll keep my overall error rate at .05 within a conceptual family of comparisons without overcorrecting for multiple comparisons. When dealing with correlations of variables across families of comparisons, I’ll use the larger family’s number in the initial critical value’s denominator. This procedure seems to balance having some kind of evidential value (albeit potentially small) with these findings and a reasonable amount of statistical rigor. Using the new suggested threshold, I’d have to divide .005 by the number of comparisons in a family to maintain my stated family-wise error rate, which would make for some incredibly difficult thresholds to meet!
There are other design decisions I made (e.g., imputing missing values of many study measures using mice rather than only using complete cases in analyses) that also furthered my desire to keep as many voices represented as possible and make our findings as plausible as we can. In our initial study design, we also did not pay participants so that a) there wouldn’t be undue incitement to participate, b) we could accurately estimate the costs of the study when having no idea how many people might actually sign up, and c) we wouldn’t have to worry as much about the validity of responses that may have been driven more by the desire to obtain money than to provide accurate information. In each case, I intend on reporting these justifications and registering them before conducting data analyses to provide as much transparency as possible, even in a situation in which genuine preregistration wasn’t possible.
The current culture of science thrives on peer review – that is, the willingness of your colleagues to read through your work, critique it, and thereby improve it. Science magazine recently collected a slew of tips on how to review papers, which give people getting started in the process of peer reviewing some lovely overarching strategies about how to prepare a review.
But how can you keep in your head all those pieces of good advice and apply them to the specifics of a paper in front of you? I’d argue that like many human endeavors, it’s impossible. There are too many complexities in each paper to collate loads of disparate recommendations and keep them straight in your head. To that end, I’ve created a template for reviewing papers our lab either puts out or critiques. Not incidentally, I highly recommend using your lab group as a first round of review before sending papers out for review, as even the greenest RA can parse the paper for problems in logic and comprehensibility (inculding teh dreded “tpyoese”).
To help my lab out in doing this, I’ve prepared the following template. It organizes questions I typically have about various pieces of manuscripts, and I’ve found that undergrads given nice reviews with its help. In particular, I find it helps them focus on things beyond the analytic details to which they may have not been exposed so that they don’t feel so overwhelmed. It may also be helpful for more experienced reviewers to judge what they could contribute as a reviewer in an unfamiliar topic or analytical approach. I encourage my lab members to copy and paste it verbatim when they draft their feedback, so please do the same if it’s useful to you!
Summarize in a sentence or two the strengths of the manuscript. Summarize in a sentence or two the chief weaknesses of the manuscript that must be addressed.
How coherent, crisp, and focused is the literature summary? Are all the studies discussed relevant to the topic at hand?
Are there important pieces of literature that are omitted? If so, note what they are, and provide full citations at the end of the review.
Does the literature summary flow directly into the questions posed by this study? Are their hypotheses clearly laid out?
Are the participants’ ages, sexes, and ethnic/racial distribution reasonably characterized? Is it clear from what population the sample is drawn? Are any criteria used to exclude participants from overall analyses clearly specified?
Are the measures described in brief but with enough data so that the naive reader knows what to expect? Are there internal consistency or other reliability statistics presented for inventories and other measures that can have these presented?
For any experimental task, is it described in sufficient detail to allow a naive reader to replicate the task and understand how it works? Are all critical experimental measures and dependent variables clearly explained?
Was the procedure sufficiently detailed to allow you to know what the experience was like from the perspective of the participant? Could you rerun the study with this description and that provided above of the measures and tasks?
Is each step that the authors took to get from raw data to the data that were analyzed laid out plainly? Are particular equipment settings, scoring algorithms, or the like described in sufficient detail that you could take the authors’ data and get out exactly what they analyzed?
Do the authors specify the analyses they used to test all of their hypotheses? Are those analytic tools proper to use given their research design and data at hand? Are any post hoc analyses properly described as such? Is the criterion used for statistical significance given? What measure of effect size do the authors report? Does there appear to be adequate power to test the effects of interest? Do the authors report what software they used to analyze their data?
How easily can you read the Results section? How does it flow from analysis to analysis, and from section to section? Do the authors use appropriate references to tables and/or figures to clarify the patterns they discuss?
How correct are the statistics? Are they correctly annotated in tables and/or figures? Do the degrees of freedom match up to what they should based on what’s reported in the Method section?
Do the authors provide reasonable numbers to substantiate the verbal descriptions they use in the text?
If differences among groups or correlations are given, are there actual statistical tests performed that assess these differences, or do the authors simply rely on results falling on either side of a line of statistical significance?
If models are being compared, are the fit indexes both varied in their domains they assess (e.g., error of approximation, percentage of variance explained relative to a null model, containing more information given the number of parameters) and interpreted appropriately?
Are all the findings reported on in the Results mentioned in the Discussion?
Does the discussion contextualize the findings of this study back into the broader literature in a way that flows, is sensible, and appropriately characterizes the findings and the state of the literature? If any relevant citations are missing, again give the full citation at the end of the review
How reasonable is the authors’ scope in the Discussion? Do they exceed the boundaries of their data substantially at any point?
What limitations of the study do the authors acknowledge? Are there major ones they omitted?
Are compelling future directions given for future research? Are you left with a sense of the broader impact of these findings beyond the narrow scope of this study?
REFERENCES FOR THIS REVIEW (only if you cited articles beyond what the authors already included in the manuscript)
Post navigation← Older posts
- “Confabulating” instead of “hallucinating” in ChatGPT and generative AI errors?
- The “Big Lie” is a degenerate research program
- “Social spacing”, not “social distancing”: Deepening connections while staying safe
- Statistically “discernible” instead of “significant”?
- How to justify your alpha: step by step