As the SARS-CoV-2 virus spreads across the globe to create the COVID-19 pandemic, people have been requested to stay out of public spaces and reduce interpersonal contact to reduce the transmission of the virus. This process has the unfortunate name of “social distancing“, which has connotations of removing one’s self socially and emotionally as well as physically from the public sphere. Before modern communication technologies existed, those might have been unfortunate side effects of such a containment strategy. However, with all the methods available to us to stay connected across large gaps between us, I propose we call this effort:
In this way, we emphasize that it’s the physical space between people we seek to minimize, not the interpersonal bonds we share. “Social spacing” entails a simple geographic remove from other people. It also invites people to become creative in using technological means to bridge the space between us.
How to stay faraway, so close
When using technology to stay connected, prioritize keeping deeper, meaningful connections with people. Use Skype or other video messaging to see as well as hear from people important to you. Talk to people on the phone to maintain a vocal connection. Use your favorite social media site’s individual messenger to keep a dialog going with someone. Have individual or group texts for select audiences of messages.
In these deep, close, personalized connections, it’s OK to share your anxieties and fears. Validating that other people are concerned or even scared can help them feel like they are grounded in reality. However, beyond simple validation, use these deep connections to plan out what to do, to take concrete actions to live the lives you want. To the extent possible, share hobbies or other pursuits together if you’re shut off from work or other personal strivings for success.
- Move book clubs from living rooms or coffee shops to speaker phone calls or group Zoom sessions.
- Find online or app versions of bridge, board games, roleplaying adventures, or other fun things you might do together in person – or find new things to do online.
- Set an appointment with some friends to watch a show or movie on TV or streaming media. You can then have a group chat afterward to share your reactions. Then again, maybe you all would enjoy keeping the line open as you’re watching to comment along in real time!
- Curate playlists on Spotify or other music sites to share with your friends to express your current mood or provide some uplift to each other.
- Make a creative group so that you write novels, paint pictures, or pursue other artistic endeavors together.
- Broaden your palate and share culinary adventures with your friends with social cooking sites or online cooking classes. Share your favorites with your friends, or have conversations as you cook to help debug recipes.
- Exercise your body and share online workout videos or apps with your friends. Setting up a friendly competition might even prolong your activity gains after the pandemic passes.
- Learn skills through shared courseware from Khan Academy, LinkedIn Learning, or other sources.
These kinds of personalized connections can be prioritized over broad social media posts. In that way, you have more of an influence over your audience – and you can ask for a deeper kind of support than a feed or timeline might provide. If you use social media, use language laden with nouns and verbs to minimize emotional contagion. Share information from trusted, reputable sources as close to the relevant data as possible. I recommend the University of Minnesota’s COVID-19 CIDRAP site for a transnational perspective.
When too many connections and too frequent news become too much
You might find that the firehose of information overwhelms you at times. If you find yourself getting more anxious when you watch the news or browse social media, that’s a good sign that you’d benefit from a break. As a first step, you might disable notifications on your phone from news or social media apps so that you can control when you search for information rather than having it pushed to you. Other possibilities include:
- Employing the muting options on Twitter, snooze posts or posters on Facebook, or filter words on Instagram.
- Using a timer, an app, or a browser extension to limit the time you can spend on specific social media sites.
- Turning off all your devices for a few hours to really unplug for a while.
Through these methods, you can give yourself space to recharge and stay connected even as you’re socially spaced.
Once again, the term “statistical significance” in null hypothesis significance testing is under fire. The authors of that commentary favor renaming “confidence intervals” as “compatibility intervals”. This term emphasizes that the intervals represent a range of values that would be reasonable values to obtain under certain statistical assumptions rather than a statement about subjective beliefs (for which Bayesian statistics are more appropriate with their credible intervals). However, I think the term needing replacement most goes back even further. In our “justify your alpha” paper, we recommended abandoning the term “statistically significant”, but we didn’t give a replacement for that term. It wasn’t for lack of trying, but we never came up with a better label for findings that pass a statistical threshold.
My previous blog post about how to justify an alpha level tried using threshold terminology, but it felt clunky and ungainly. After thinking about it for over a year, I think I finally have an answer:
Replace “significant” with “discernible”.
The first advantage is the conceptual shift from meaning to perception. Rather than imbuing a statistical finding with a higher “significance”, the framework moves earlier in the processing stream. Perception entails more than sensation, which a term like “detectable” might imply. “Discernible” implies the arrangement of information, similar to how inferential statistics can arrange data for researchers. Thus, statistics are more readily recognized as tools to peer into a study’s data rather than arbiters of the ultimate truth – or “significance” – of a set of findings. Scientists are thus encouraged to own the interpretation of the statistical test rather than letting the numbers arrogate “significance”.
The second advantage of this terminological shift is that the boundary between results falling on either side of the statistical threshold becomes appropriately ambiguous. No longer is some omniscient “significance” or “insignificance” imbued to them automatically. Rather, a set of questions immediately arises. 1) Is there really no effect there – at least, if the null hypothesis is the nil hypothesis of exactly 0 difference or relationship? It’s highly unlikely, as unlikely as finding a true vacuum, but I suppose it might be possible. In this case, having “no discernible effect” allows the possibility that our perception is insufficient to recognize it against a background void.
2) Is there an effect there, but it’s so small as to be ignorable given the smallest effect size we care about? This state of affairs is likely when tiny effects exist, like the relationship between intelligence and birth order. Larger samples might be needed or more precise measures could be required to find it – like having an electron microscope instead of a light microscope. However, with the current state of affairs, we as a research community are satisfied that the effects are small enough that we’re willing not to care about them. Well-powered studies should allow relatively definitive conclusions here. Here, “no discernible effect” suggests the effect may be wee, but so wee that we are content not to consider them further.
3) Are our tests so imprecise that we couldn’t answer whether an effect is actually there or not? Perhaps a study has too few participants to detect even medium-sized effects. Perhaps its measurements are so internally inconsistent as to render them like unto a microscope lens smeared with gel. Either way, the study just may not have enough power to make a meaningful determination one way or the other. The increased smudginess of one set of data compared to another that helped inspire the Nature commentary might be more readily appreciated when described in “discernible” instead of “significant” terms. Indeed, “no discernible effect” helps keep in mind that our perceptual-statistical apparatus may have insufficient resolution to provide a solid answer. Conversely, a discernible finding might also be due to faulty equipment or other non-signal related causes, irrespective of its apparent practical significance.
These questions lead us to ponder whether our findings are reasonable or might instead simply be the phantoms of false positives (or other illusions). Indeed, I think “discernible” nudges us to think about why a finding did or didn’t cross the statistical threshold more deeply instead of simply accepting its “significance” or lack thereof.
In any case, I hope that “statistically discernible” is a better term for what we mean when a result passes the alpha threshold in a null hypothesis test and is thus more extreme than we decided would be acceptable to believe as a value arising from the null hypothesis’s distribution. I hope it can lead to meaningful shifts in how we think about the results of these statistical tests. Then again, perhaps the field will just rail against NHDT in a decade. Assuming, of course, that it doesn’t just act as Regina to my Gretchen.
I joined a panoply of scholars who argue that it is necessary to justify your alpha (that is, the acceptable long-run proportion of false positive results in a specific statistical test) rather than redefine a threshold for statistical significance across all fields. We’d hoped to emphasize that more than just alpha needed justification, but space limitations (and the original article’s title) focused our response on that specific issue. We also didn’t have the space to include specific examples of how to justify alpha or other kinds of analytic decisions. Because people are really curious about this issue, I elaborate here my thought process about how one can justify alpha and other analytic decisions.
The whole notion of an alpha level doesn’t make sense in multiple statistical frameworks. For one, alpha is only meaningful in the frequentist school of probability, in which we consider how often a particular event would occur in the real world out of the total number of such possible events. In Bayesian statistics, the concern is typically adjusting the probability that a proposition of some sort is true, which is a radically different way of thinking about statistical analysis. Indeed, the Bayes factor represents the relative likelihood of an alternative hypothesis to that of a null hypothesis given the data obtained. There’s also no clear mapping of p values to Bayes factors, so many discussions about alpha aren’t relevant if that’s the statistical epistemology you use.
Another set of considerations comes from Sir Ronald Fisher, one of the original statisticians. In this view, there’s no error control to be concerned with; p values instead represent levels of evidence against a null hypothesis. Crossing a prespecified low p value (i.e., alpha) then entails rejection of a statistical null hypothesis. However, this particular point of view has fallen out of favor, and Fisher’s attempts to construct a system of fiducial inference also ended up being criticized. Finally, there are systems of statistical inference based on information theory and non-Bayesian likelihood estimation that do not entail error control in their epistemological positions.
I come from a clinical background and teach the psychodiagnostic assessment of adults. As a result, I come from a world in which I must use the information we gather to make a series of dichotomous decisions. Does this client have major depressive disorder or not? Is this client likely to benefit from exposure and response prevention interventions or not? This framework extends to my thinking about how to conduct scientific studies or not. Will I pursue further work on a specific measure or not? Will my research presume this effect is worth including in the experimental design or not? Thus, the error control epistemological framework (borne of Neyman-Pearson reasoning about statistical testing) seems to be a good one for my purposes, and I’m reading more about it to verify that assumption. In particular, I’m interested in disambiguating this kind of statistical reasoning from the common practice of null hypothesis significance testing, which amalgamates Fisherian inferential philosophy and Neyman-Pearson operations.
I don’t argue that it’s the only relevant possible framework to evaluate the results of my research. That’s why I still report p values in text whenever possible (the sea of tabular asterisks can make precise p values difficult to report there) to allow a Fisherian kind of inference. Such p value reporting also allows those in the Neyman-Pearson tradition to use different decision thresholds in assessing my work. It should also be possible to use the n/df and values of my statistics to compute Bayes factors for the simpler statistics I report (e.g., correlations and t tests), though more complex inferences may be difficult to reverse engineer.
My lab has run four different kinds of studies, each of which has unique goals based on the population being sampled, the methods being used, and the practical importance of the questions being asked. A) One kind of study uses easy-to-access and convenient students or MTurk workers as a proxy for “people at large” for studies involving self-report assessments of various characteristics. B) Another study draws from a convenient student population to make inferences about basic emotional functioning through psychophysiological or behavioral measures. C) A third study draws from (relatively) easily sampled clinical populations in the community to bridge self-report and clinical symptom ratings, behavioral, and psychophysiological methods of assessment. D) The last study type comes from my work on the Route 91 shooting, in which the population sampled is time-sensitive, non-repeatable, and accessible only through electronic means. In each case, the sampling strategy from these populations entails constraints on generality that must be discussed when contextualizing the findings.
In study type A), I want to be relatively confident that the effects I observe are there. I’m also cognizant that measuring everything through self-report has attendant biases, so it’s possible processes like memory inaccuracies, self-representation goals, and either willful or unintentional ignorance may systematically bias results. Additionally, the relative ease of running a few hundred more students (or MTurk workers, if funding is available) makes running studies with high power to detect small effects a simpler proposition than in other study types, as once the study is programmed, there’s very little work needed to run participants through it. Indeed, they often just run themselves! In MTurk studies, it may take a few weeks to run 300 participants; if I run them in the lab, I can run about 400 participants in a year on a given study. Thus, I want to have high power to detect even small effects, I’ll use a lower alpha level to guard against spurious results borne of mega-multiple comparisons, and I’m happy to collect large numbers of participants to reduce the errors surrounding any parameter estimates.
Study type B) deals with taking measurements across domains that may be less reliable and that do not suffer from that same single-method biases as self-report studies. I’m willing to achieve a lower evidential value for the sake of being able to say something about these studies. In part, I think for many psychophysiological measures, the field is learning how to do this work well, and we need to walk statistically before we can run. I also believe that to the extent these studies are genuinely cross-method, we may reduce some of the crud factor especially inherent in single-method studies and produce more robust findings. However, these data require two research assistants to spend an hour of their time applying sensors to participants’ bodies and then spend an extra two to three hours collecting data and removing those sensors, so the cost of acquiring new participants is higher than in study type A). For reasons not predictable from the outset, a certain percentage of participants will also not yield interpretable data. However, the participants are still relatively easy to come by, so replacement is less of an issue, and I can plan to run around 150 participants a year if the lab’s efforts are fully devoted to the study. In such studies, I’ll sacrifice some power and precision to run a medium number of participants with a higher alpha level. I also use a lot of within-subjects designs in these kinds of studies to maximize the power of any experimental manipulations, as they allow participants to serve as their own controls.
In study type C), I run into many of the same kinds of problems as study type B) except that participants are relatively hard to come by. They come from clinical groups that require substantial recruitment resources to characterize and bring in, and they’re relatively scarce compared to unselected undergraduates. For cost comparison purposes, it’s taken me two years to bring in 120 participants for such a study (with a $100+ session compensation rate to reflect that they’d be in the lab for four hours). Within-subjects designs are imperative here to keep power high, and I also hope that study type B) has shown us how to maximize our effect sizes such that I can power a study to detect medium-sized effects as opposed to small ones.
Study type D) entails running participants who cannot be replaced in any way, making good measurement imperative to increase precision to detect effects. Recruitment is also a tremendous challenge, and it’s impossible to know ahead of time how many people will end up participating in the study. Nevertheless, it’s still possible to specify desired effect sizes ahead of time to target along with the precision needed to achieve a particular statistical power. I was fortunate to get just over 200 people initially, though our first follow-up had the approximately 125 participants I was hoping to have throughout the study. I haven’t had the ability to bring such people into the lab so far, so it’s the efforts needed in recruitment and maintenance of the sample that represent substantial costs, not the time it takes research assistants to collect the data.
Many different effect sizes can be computed to address different kinds of questions, yet many researchers answer this question by defaulting to an effect size that either corresponds to a lay person’s intuitions about the size of the effect (Cohen, 1988) or a typical effect size in the literature. However, in my research domain, it’s often important to consider whether effect sizes make a practical difference in real-world applications. That doesn’t mean that effects must be whoppingly large to be worth studying; wee effects with cheap interventions that feature small side effect profiles are still well worth pursuing. Nevertheless, whether theoretical, empirical, or pragmatic, researchers should take care to justify a minimum effect size of interest, as this choice will guide the rest of the justification process.
In setting this minimum effect size of interest, researchers should also consider the reliability of the measures being used in a study. All things being equal, more reliable measures will be able to detect smaller effects given the same number of observations. However, savvy researchers should take into account the unreliability of their measures when detailing the smallest effect size of interest. For instance, a researcher may want to detect a correlation of .10 – which corresponds to an effect explaining 1% of the linear relationship between two measures – and the two measures the researcher is using have internal consistencies of .80 and .70. Rearranging Lord and Novick’s (1968) correction formula, the actual smallest effect size of interest should be calculated as .10*√(.80*.70), or .10*√.56, or .10*.748, or .075.
However, unreliability of measurement is not the only kind of uncertainty that might lead researchers to choose a smaller minimum effect size of interest to detect. Even if researchers consult previous studies for estimates of relevant effect sizes, publication bias and uncertainty around the size of an effect in the population throw additional complications into these considerations. Adjusting the expected effect size of interest in light of these issues may further aid in justifying an alpha.
In the absence of an effect that passes the statistical threshold in a well-powered study, it may be useful to examine whether it is instead inconsistent with the smallest effect size of interest. In this way, we can articulate whether proceeding as if even that effect is not present is a reasonable one rather than defaulting to “retaining the null hypothesis”. This step is important for completing the error control process to ensure that some conclusions can be drawn from the study rather than leaving it out in an epistemological no-man’s land should results not pass the justified statistical threshold.
Utility functions. Ideally, the field could compute some kind of utility function whose maximum value represents a balance among sample size from a given population, minimum effect size of interest, alpha, and power. This function could provide an objective alpha to use in any given situation. However, because each of these quantities has costs and benefits associated with them – and the relative costs and benefits will vary by study and investigator – such a function is unlikely to be computable. Thus, when justifying an alpha level, we need to resort to other kinds of arguments. This means that it’s unlikely all investigators will agree that a given justification is sufficient, but a careful layout of the rationale behind the reported alpha combined with detailed reporting of p values would allow other researchers to re-evaluate a set of findings to determine how they comport with those researchers’ own principles, costs, and benefits. I would also argue that there is no situation in which there are no costs, as other studies could always be run in place of one that’s chosen, participants could be allocated to other studies instead of the one proposed, and the time spent programming a study, reducing its data, and analyzing the results are all costs inherent in any study.
Possible justifications. One blog post summarizes traditionally, citationally, normatively principled, bridgingly principled, and objectively justified alphas. The traditional and cited justifications are similar; the cited version simply notes where the particular authority for the alpha level is (e.g., Fisher, 1925) instead of resting on a nebulous “everybody does it.” In this way, the paper about redefining statistical significance provides a one-stop citation for those looking to summarize that tradition and provides a list of authors who collectively represent that tradition.
However, that paper provides additional sets of justifications for a stricter alpha level that entail multiple possible inferential benefits derived from normative or bridged principles. In particular, the authors of that paper bridge frequentist and Bayesian statistical inferential principles to emphasize the added rigor an alpha of .005 would lend the field. They also note that normatively, other fields have adopted stricter levels for declaring findings “significant” or as “discoveries”, and that such a strict alpha level would reduce the false positive rate below that of the field currently while requiring less than double the number of participants to maintain the same level of power.
Bridging principles. One could theoretically justify a (two-tailed) alpha level of .05 on multiple grounds. For instance, humans tend to think in 5s and 10s, and a 5/100 cutoff seems intuitively and conveniently “rare”. I should also note here that I use the term “convenient” to denote something adopted as more than “arbitrary”, inasmuch as our 5×2 digit hands provide us a quick, shared grouping for counting across humans. I fully expect that species with non-5/10 digit groupings (or pseudopods instead of digits) might use different cutoffs, which would similarly shape their thinking about convenient cutoffs for their statistical epistemology.
Such a cutoff has also been bridged to values of the normal distribution, as an alpha of .05 corresponds to twice that distribution’s standard error. Because many parametric frequentist statistical tests assume normal distributions of the standard errors of scores, this bridge links the alpha level to a fundamental assumption of these kinds of statistical tests.
Another bridging principle entails considering that lower alpha levels correspond to increasingly severe tests of theories. Thus, a researcher may prefer a lower alpha level if the theory is more well-developed, its logical links are clearer (from the core theoretical propositions to its auxiliary corollaries to its specific hypothetical propositions to the statistical hypotheses to test in any study), and its constructs are more precisely measured.
How to describe findings meeting or exceeding alpha? From a normative perspective, labeling findings as “statistically significant” has led to decades of misinterpretation of the practical importance of statistical tests (particularly in the absence of effect sizes). In our commentary, we encouraged abandoning that phrase, but we didn’t offer an alternative. I propose describing these results as “passing threshold” to reduce misinterpretative possibilities. This term is far less charged with…significance…and may help separate evaluation of the statistical hypotheses under test from larger practical or theoretical concerns.
Though justifying alpha is an important step, it’s just as important to justify your beta (which is the long-run proportion of false retentions of the null hypothesis). From a Neyman-Pearson perspective, the lower the beta, the more evidential value a null finding possesses. This is also why Neyman-Pearson reasoning is inductive rather than deductive: Null hypotheses have information value as opposed to being defaults that can only be refuted with deductive logic’s modus tollens tests. However, the lower the beta, the more observations are needed to make a given effect size pass the statistical threshold set by alpha. As shown above, one key to making minimum effect sizes of interest larger is measuring that effect with more reliably. A second is maximizing the strength of any manipulation such that a larger minimum effect size would be interesting to a researcher.
Another angle on the question leading this section is: How precise would the estimate of the effect size need to be to make me comfortable with accepting the statistical hypothesis being tested rather than just retaining it in light of a test statistic that doesn’t pass the statistical threshold? Just because a test has a high power (on account of a large effect size) doesn’t mean that the estimate of that effect is precise. More observations are needed to make precise estimates of an effect – which also reduces the beta (and thus heightens the power) of a given statistical test.
Power curves visualize the tradeoffs among effect size, beta, and the number of observations. They can aid researchers in determining how feasible it is to have a null hypothesis with high evidential value versus being able to conduct the study in the first place. Some power curves start showing a non-linear relationship between observations and beta when beta is about .20 (or power is about .80), consistent with historical guidelines. However, other considerations may take precedence over the shape of a power curve. Implicitly, traditional alpha (.05) and beta (.20) levels imply that erroneously declaring an effect passes threshold is four times worse than erroneously declaring an effect does not. Some researchers might believe even higher ratios should be used. Alternatively, it may be more important for researchers to fix one error rate or another at a specific value and let the other vary as resources dictate. These values should be articulated in the justification.
Most studies do not conduct a single comparison. Indeed, many studies toss in a number of different variables and assess their relationships, mediation, and moderation among them. As a result, there are many more comparisons conducted than the chosen alpha and beta levels are designed to guard against! There are four broad methods to use when considering how multiple comparisons impact your stated alpha and beta levels.
Per-comparison error rate (PCER) does not adjust comparisons at all and simply accepts the risk of there likely being multiple spurious results. In this case, no adjustments to alpha or beta need to be made in determining how many observations are needed.
False discovery error rate (FDER) allows that there will be a certain proportion of false discoveries in any set of multiple tests; FDER corrections attempt to keep the rate of these false discoveries at the given alpha level. However, this comes at a cost of complexity for those trying to justify alpha and beta, as each comparison uses a different critical alpha level. One common method for controlling FDER adjusts alpha levels for each comparison in a relatively linear fashion, retaining null hypotheses starting from the highest p value to the last one in which p > [(step #)/(total # of comparisons)]*(justified alpha). The remaining comparisons are judged as passing threshold. So, which comparison’s alpha value should be used in justifying comparisons? This may require knowing on average how many comparisons would typically pass threshold within a given comparison set size to plan for a final alpha to justify. After that, the number of observations may need adjusting to maintain the desired beta.
Corrections to the family-wise error rate (FWER) seek to reduce the error rate across a set of comparisons by lowering alpha in a more dramatic way across comparisons than do corrections for FDER. One popular method for controlling FWER entails dividing the desired alpha level by the number of comparisons. If the smallest p value in that set is smaller than alpha/(# comparisons), then it passes threshold and the next smallest p value is compared to alpha/(# comparisons-1). Once the p value of a comparison is greater than that fraction, that comparison and the remaining comparisons are considered not to have passed threshold. This correction has the same problems of an ever-shifting alpha and beta as the FDER, so the same cautions apply.
Per-family error rate (PFER) represents the most stringent error control of all. In this view, making multiple errors across families of comparisons is more damaging than making one error. Thus, tight control should be exercised over the possibility of making an error in any family of comparisons. The Bonferroni correction is one method of maintaining a PFER that is familiar to many researchers. In this case, alpha simply needs to be divided by the number of comparisons to be made, beta adjusted to maintain the appropriate power, and the appropriate number of observations collected.
Many researchers reduce alpha in the face of multiple comparisons to address the PCER without taking steps to address other kinds of error rates formally. Such ad hoc adjustments should at least also report how many tests would pass the statistical threshold by chance alone. Using FDER or FWER control techniques represent a balance between leniency and strictness in error control, though researchers should specify in advance whether false discovery or family-wise error control is more in line with their epistemological stance at a given time. Researchers may prefer to control the PFER when the number of comparisons is kept to a minimum through the use of a few critical, focal tests of a well-developed theory.
In FDER, FWER, and PFER control mechanisms, the notion of “family” must be justified. Is it all comparisons conducted in a study? Is it a set of exploratory factors that are considered separately from focal confirmatory comparisons? Does it group together conceptually similar measures (e.g., normal-range personality, abnormal personality, time-limited, psychopathology, and well being)? All of these and more may be reasonable families to use in lumping or splitting comparisons. However, to help researchers believe that these families were considered separable at the outset of a study, family membership decisions should be pre-registered.
Though the epistemological principles involved in justifying alphas and similar quantities run deeply, I don’t believe that a good alpha justification requires more than a paragraph. Ideally, I would like to see this paragraph placed at the start of a Method section in a journal article, as it sets the epistemological stage for everything that comes afterward. For each study type listed above, here are some possible paragraphs to justify a particular alpha, beta, and number of observations. I note that these are riffs off possible justifications; they do not necessarily represent the ways I elected to treat the same data detailed in the first sentences of each paragraph. To determine appropriate corrections for unreliability in measures when computing power estimates, I used Rosnow and Rosenthal’s (2003) conversions of effect sizes to correlations (rs).
Study type A): We planned to sample from a convenience population of undergraduates to provide precise estimates of two families of five effects each; we expected all of these effects to be relatively small. Because our measures in this study have historically been relatively reliable (i.e., internal consistencies > .80), we planned our study to detect a minimum correlation of .08, as that corresponds to a presumed population correlation of .10 (or proportion of variance shared of 1%) measured with instruments with at least 80% true score variance. We also recognized that our study was conducted entirely with self-report measures, making it possible that method variance would contaminate some of our findings. As a result, we adopted a critical one-tailed α level of .005. Because we believed that spuriously detecting an effect was ten times as undesirable as failing to detect an effect, we chose to run enough participants to ensure we had 95% power to detect a correlation of .08. We used the Holm-Bonferroni method to provide a family-wise error rate correction entailing a minimum α of .001 for the largest effect in the study in each of the two families. This required a sample size of 3486 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Molina, Pierce, Bergquist, & Benning, 2018), we anticipated that 2% of our sample would produce invalid personality profiles that would require replacement to avoid distorting the results (Benning & Freeman, 2017). Consequently, we targeted a sample size of 3862 participants to anticipate these replacements.
Study type B): We planned to sample from a convenience population of undergraduates to provide initial estimates of the extent to which four pleasant picture contents potentiated the postauricular reflex compared to neutral pictures. From a synthesis of the extant literature (Benning, 2018), we expected these effects to vary between ds of 0.2 to 0.5, with an average of 0.34. Because our measures in this study have historically been relatively unreliable (i.e., internal consistencies ~ .35; Aaron & Benning, 2016), we planned our study to detect a minimum mean difference of 0.20, as that corresponds to a presumed population mean difference of 0.34 measured with approximately 35% true score variance. We adopted a critical α level of .05 to keep false discoveries at a traditional level as we sought to improve the reliability and effect size of postauricular reflex potentiation. Following conventions that spuriously detecting an an effect was four times as undesirable as failing to detect an effect, we chose enough participants to ensure we had 80% power to detect a d of 0.20. We used the Benjamini-Hochberg (1995) method to provide a false detection error rate correction entailing a minimum one-tailed α of .0125 for the largest potentiation in the study. This required a sample size of 156 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Ait Oumeziane, Molina, & Benning, 2018), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 180 participants to accommodate these additional participants.
Study type C): We planned to sample from our university’s community mental health clinic to examine how anhedonia manifests itself in depression across seven different measures that are modulated by emotional valence. However, because there were insufficient cases with major depressive disorder in that sampling population, we instead used two different advertisements on Craigslist to recruit local depressed and non-depressed participants who were likely to be drawn from the same population. We believed that a medium effect size for the Valence x Group interaction (i.e., an f of 0.25) represented an effect that would be clinically meaningful in this assessment context. Because our measures’ reliabilities vary widely (i.e., internal consistencies ~ .35-.75; Benning & Ait Oumeziane, 2017), we planned our study to detect a minimum f of 0.174, as that corresponds to a presumed population f of 0.25 measured with approximately 50% true score variance. We adopted a critical α level of .007 in evaluating the Valence x Group interactions, using a Bonferroni correction to maintain a per-family error rate of .05 across all seven measures. To balance the number of participants needed from this selected population with maintaining power to detect effects, we chose enough participants to ensure we had 80% power to detect an f of 0.174. This required a sample size of 280 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with these measures (e.g., Benning & Ait Oumeziane, 2017), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 322 participants to accommodate these additional participants.
Study type D): We planned to sample from the population of survivors of the Route 91 Harvest Festival shooting on October 1, 2017, and from the population of the greater Las Vegas valley area who learned of that shooting within 24 hours of it happening. Because we wanted to sample this population within a month after the incident to examine acute stress reactions – and recruit as many participants as possible – this study did not have an a priori number of participants or targeted power. To maximize the possibility of detecting effects in this unique population, we adopted a critical two-tailed α level of .05 with a per-comparison error rate, as we were uncertain about the possible signs of all effects. Among the 45 comparisons conducted this way, chance would predict approximately 2 to pass threshold. However, we believed the time-sensitive, unrepeatable nature of the sample justified using looser evidential thresholds to speak to the effects in the data.
On New Year’s Eve 2016, Mariah Carey had a…notable performance in which she had difficulties rendering the songs “Emotions” and “We Belong Together”. She roared back on New Year’s Eve 2017, sparking the first meme of 2018.
Alas, it is unlikely that the field of psychophysiology will un-mangle its measurement of emotions with reflexes in such a short span of time.
My lab uses two reflexes to assess the experience of emotion, both of which can be elicited through short, loud noise probes. The startle blink reflex is measured underneath the eye, and it measures a defensive negative emotional state. The postauricular reflex is a tiny reflex behind the ear that measures a variety of positive emotional states. Unfortunately, neither reflex assesses emotion reliably.
When I say “reliably”, I mean an old-school meaning of reliability that addresses what percentage of variability in a measurement’s score is due to the construct it’s supposed to measure. The higher that percentage, the more reliable the measurement. In the case of these reflexes, in the best-case scenarios, about half of the variability in scores is due to the emotion they’re supposed to assess.
That’s pretty bad.
For comparison, the reliability of many personality traits is at least 80%, especially from modern scales with good attention to the internal consistency of what’s being measured. The reliability of height measurements is almost 95%.
Why is reflexive emotion’s reliability so bad?
Part of it likely stems from the fact that (at least in my lab), we measure emotion as a difference of reactivity during a specific emotion versus during neutral. For the postauricular reflex, we take the reflex magnitude during pleasant pictures and subtract from that the reflex magnitude during neutral pictures. For the startle blink, we take the reflex magnitude during aversive pictures and subtract from that the reflex magnitude during neutral pictures. Differences can have lower reliabilities than single measurements because the unreliability in both emotion and neutral measures compound when making the difference scores.
However, it’s even worse when we use reflex magnitudes just during pleasant or aversive pictures. In fact, it’s so bad that I’ve found both reflexes have negative reliabilities when measured just as the average magnitude during either pleasant or aversive pictures! That’s a recipe for a terrible, awful, no good, very bad day in the lab. That’s why I don’t look at reflexes during single emotions by themselves as good measures of emotion.
Now, some of these difficulties look like can be alleviated if you look at raw reflex magnitude during each emotion. If you do that, it looks like we could get reliabilities of 98% or more! So why don’t I do this?
Because from person to person, reflex magnitudes during any stimulus can differ over 100 times, which means that it’s a person’s overall reflex magnitude that raw reflex magnitudes are measuring – irrespective of any emotional state the person’s in at that moment.
Let’s take the example of height again. Let’s also suppose that feeling sad makes people’s shoulder’s stoop and head droop, so they should be shorter (that is, have a lower height measurement) whenever they’re feeling sad. I have people stand up while watching a neutral movie and a sad movie, and I measure their height four times during each movie to get a sense of how reliable the measurement of height is.
If all I do is measure the reliability of people’s mean height across the four sadness measurements, I’m likely to get a really high value. But what have I really measured there? Well, it’s just basically how tall people are – it doesn’t have anything to do with the effect of sadness on their height! To understand how sadness specifically affects people’s heights, I’d have to subtract their measured height in the neutral condition from that in the sad condition: a difference score.
Furthermore, if I wanted to take out entirely the variability associated with people’s heights from the effects of sadness I’m measuring (perhaps because I’m measuring participants whose heights vary from 1 inch to 100 inches), I can use a process called “within-subject z scoring”, which is what I use in my work. It doesn’t seem like the overall reflex magnitude people have predicts many interesting psychological states, so I feel confident in this procedure. Though my measurements aren’t great, at least they measure what I want to some degree.
What could I do to make reflexive measures of emotion better? Well, I’ve used four noise probes in each of four different picture contents to cover a broad range of positive emotions. One thing I could do is target a specific emotion within the positive or negative emotional domain and probe it sixteen times. Though it would reduce the generalizability of my findings, it would substantially improve reliability of the reflexes, as reliabilities tend to increase the more trials you include (because random variations have more opportunities to get cancelled out through averaging). For the postauricular reflex, I could also present lots of noise clicks instead of probes to increase the number of reflexes elicited during each picture. Unfortunately, click-elicited and probe-elicited reflexes only share about 16% of their variability, so it may be difficult to argue they’re measuring the same thing. That paper also shows you can’t do that for startle blinks, so that’s a dead end method for that reflex.
In short, there’s a lot of work to do before the psychophysiology of reflexive emotion can relax with its cup of tea after redeeming itself with a reliable, well-received performance (in the lab).
This post is long; the links below will take you to potential topics of interest.
Our study about the Route 91 shooting represents a substantial addition to my lab’s research skills portfolio. Specifically, it’s my first foray into anything remotely resembling community psychology, in which researchers actively engage in helping to solve problems in an identified community. In this case, I thought of the Route 91 festival survivors (and potentially the broader Las Vegas community affected by the shooting) as the community. Below are the steps we used to conduct time-sensitive research in this community, with eyes both toward doing the best science possible and toward serving the community from which our participants were drawn.
0. In time-sensitive situations, call or visit your Institutional Review Board (IRB) in person and delegate work.
Because research cannot be performed without IRB approval, I gave my IRB a call as soon as a) I had a concrete idea for the study I wanted to do and b) support from my lab to work as hard as necessary to make it happen. In my case, that was the Friday after the shooting – and three days before I was scheduled to fly out of the country for a conference. Fortunately, I was able to get call our IRB’s administrative personnel, and Dax Miller guided me through the areas we’d need to make sure we addressed for a successful application. He also agreed to perform a thorough administrative review on Sunday so that I could get initial revisions back to the IRB before leaving. We performed the revisions so the study could be looked over by an IRB reviewer by Wednesday, who had additional questions I could address from afar so that the study could be approved by Thursday afternoon in Vegas (or early Friday morning in Europe). Without that kind of heads-up teamwork from the IRB, we simply couldn’t have done this study, which sought to look at people’s stories of the trauma along with their symptoms within the acute stress window of a month of the shooting.
I also drafted my lab to perform a number of tasks I simply couldn’t do by myself in such a short period of time. Three students provided literature to help me conceptualize the risks and benefits this study might pose. Two worked to provide a list of therapeutic resources for participants. Two others scoured the internet for various beliefs people espoused about the shooting to develop a measure of those. Yet another two programmed the study in Qualtrics and coordinated the transfer of a study-specific ID variable so that we could keep contact information separate from participants’ stories and scores, one of whom also drafted social media advertisements. A final student created a flyer to use for recruiting participants (including a QR code to scan instead of forcing participants to remember our study’s URL). Again, without their help, this study simply couldn’t have been done, as I was already at or exceeding my capacity to stay up late in putting together the IRB application and its supporting documentation (along with programming in personality feedback in Qualtrics as our only incentive for participating).
1. Social media can be your recruitment friends, as can internal email lists.
We spread our flyers far and wide, including a benefit event for Route 91 survivors as well as coffee shops, community bulletin boards, and other such locations across the greater Las Vegas valley. Nevertheless, my RAs used their social media to help promote our study with the IRB-approved text, as did I. Other friends took up the cause and shared posts, spreading the reach of our study into the broader Las Vegas community in ways that would have been impossible otherwise.
A number of UNLV students, faculty, and staff had also attended Route 91 (and all were affected in some way by the shooting), so we distributed our study through internal email lists. At first, I had access to send an announcement through the College of Liberal Arts’ weekly student email list along with the faculty and staff daily email. After word spread of the study (see the point below), I was also allowed to send a message out to all students at the university. Those contacts helped bump our recruitment substantially, getting both people at the festival and from the broader Las Vegas community in the study.
2. The news media extends your reach even more deeply into the community, both for recruitment and dissemination.
Over the years, I’ve been fortunate enough to have multiple members of the news media contact me about stories they’re doing that can help put psychological research into context for the public. At first, I thought of contacting them as having them return the favor to me to help get the word out about my study. However, as I did so, I also recognized that approaching them with content relevant to their beats may have made their jobs mildly easier. They have airtime or column inches they have to fill, and if you provide them meaningful stories, it’ll save them effort in locating material to fill that time. Thus, if you’re prepared with a camera- or phone-ready set of points, both you and your media contacts can have a satisfying professional relationship.
I made sure to have a reasonable and concise story about what the study was about, what motivated it, what all we were looking at, and what the benefits might be to the community. That way, the journalists had plain-English descriptions of the study that could be understood by the average reader or viewer and that could be used more or less as-is, without a lot of editing. In general, I recommend having a good handle on about 3 well-rehearsed bullet points you want to make sure you get across – and that are expressed in calm, clear language you could defend as if in peer review. Those points may not all fit in with the particular story that the journalist is telling, but they’ll get the gist, especially if you have an action item at the end to motivate people. For me, that was my study’s web address.
As time went on, more journalists started contacting me. I made sure I engaged all of them, as I wanted the story of our study out in as many places as possible. Generally speaking, with each new story that came out, I had 5-10 new people participate in the study. If your study is interesting, it may snowball, and you never know which media your potential community participants might consume. The university’s press office helped in getting the word out as well, crafting a press release that was suitable for other outlets to pick up and modify.
The media can also help you re-engage your community during and after research has commenced (which I discuss more in point 5 below). They have a reach beyond your specific community you’ll likely never have, and they can help tell your community’s story to the larger world. Again, it’s imperative to do so in a way that’s not stigmatizing or harmful (see point 4 below), but you can help make prominent people whose voices otherwise wouldn’t be heard or considered.
3. Lead your approach to community groups directly with helpful resources after building credibility.
Another good reason to approach the media beyond increasing participation immediately is that having a public presence for your research will give you more credibility when approaching your community of interest directly. I recognized that after about a week of press, one of the participants mentioned a survivor’s Facebook group, and believed the time was right to make direct inquiries to the community I wanted to help. To that end, I messaged the administrators of survivors’ Facebook groups, asking them to post the free and reduced cost therapy resources we gave to participants after the study. I was also careful not to ask to join groups, as I didn’t want to run the risk of violating that community’s healing places.
Two of the groups’ administrators asked me to join the group directly to post them, and I was honored they asked me to do so. However, in those groups, I confined myself to being someone who posted resources when general calls went out rather than offering advice about coping with trauma. I didn’t want to over-insinuate myself into the group and thereby distort their culture, and I also wanted to maintain a professionally respectful distance to allow the group to function as a community resource. A couple of other groups said they would be willing to consider posting on my behalf but that the groups were closed to all but survivors. I thanked them for their consideration and emphasized I just wanted to spread the word about available resources.
All in all, it seems imperative to approach a community with something to give, rather than just wanting to receive from them. In this unique case, I had something to offer almost immediately. However, if it’s not clear what you might bring to the table, research your community’s needs and talk to some representatives to see what they might need. To the extent your professional skills might help and that the community believes you’ll help them (and not harm them), you’re more likely to get accepted into the community to conduct your research.
4. Engage the community in developing your study.
I learned quickly about forming a partnership with the community in developing my research when one of the members of a Route 91 survivor’s group contacted me about our study. I noted that she was local and had a background in psychology, thus making her an excellent bridge between my research team and the broader community. She zipped through IRB training and provided invaluable feedback about the types of experiences people have had after Route 91 (and helped develop items to measure those) along with providing feedback about a plan to compensate participants (confirming that offering the opportunity to donate compensation to a victim’s fund might alleviate some people’s discomfort). She also gave excellent advice about how to present the study’s results to the community, down to the colors used on the graph to make more obvious the meaning of the curves I drew. Consistent with best practices in community psychology, I intend to have her as an author on the final paper(s) so that the community has a voice in this research’s reports.
Though the ad hoc, geographically dispersed nature of this community makes more centralized planning with it more challenging (especially with a short time frame), I hope our efforts thus far have helped stay true to the community’s perceptions and has avoided stigmatizing them. In communities with leadership structures of their own, engaging those leaders in study planning, participation, and dissemination helps make the research truer to the community’s experience and will likely make people more comfortable with participating. Those people may want to make changes that may initially seem to compromise your goals for a study, but in this framework, the community is a co-creator of the research. If you can’t explain well why certain procedures you really want to use are important in ways the community can accept, then you’ll need to listen to the community to figure out how to work together. Treat education as a two-way street: You have a lot to learn about the community, and you can also show them the ins and outs of research procedures, including why certain procedures (e.g., informed consent) have been developed to protect participants, not harm them.
5. Give research results back to your community in an easily digestible form.
In community psychology, the research must feed back into the community somehow. Because we’re not doing formal interventions in this study, the best I think we can do at this point is share our results in a format that’s accessible to people without a statistical education. In the web page I designed to do just that, I use language as plain as I can to describe our findings without giving tons of numbers in the text. In the numerical graphs I feature, I’ve used animated GIFs to introduce the public to the layers comprising a graph rather than expecting them to comprehend the whole thing at once. I hope that it works.
I also posted my findings on all the groups that had me and engaged reporters who’d asked for follow-up stories once we had our first round of data collected so that their investment in helping me recruit would see fruit. It seemed like many Route 91 survivors reported being perceived as not having anything “real” wrong with them or being misunderstood by their families, friends, coworkers, or romantic partners. Thus, I tried to diagram how people at Route 91 had much higher levels of post-traumatic stress than people in the community, such that about half of them would qualify for a provisional PTSD diagnosis if their symptoms persisted for longer than a month.
6. Advocate for your community.
This is one of the trickier parts of this kind of research for me, as I don’t want to speak as a representative for a broad, decentralized community of which I’m ultimately not a part. Nevertheless, I think data from this research could help advocate for the survivors in their claims, especially those who may not be eligible for other kinds of compensatory funds. I only found out about the town hall meetings of the Las Vegas Victims Fund as they were happening, so I was unable to provide an in-person comment to the board administering the funds. Fortunately, Nicole Raz alerted me to the videos of the town halls, and I was privileged to hear the voices of those who want to be remembered in this process. Right now, I’m drafting a proposal based on this study’s data and considerations of how the disability weights assigned to PTSD by the World Health Organization compare to other conditions that may be granted compensation.
In essence, I’m hoping to make a case that post-traumatic stress is worth compensating, especially given that preliminary results suggest that post-traumatic stress symptomatology as a whole doesn’t seem to have declined in this sample over the course of a month. One of the biggest problems facing this particular victims’ fund is that there are tens of thousands of possible claimants unlike just about any other mass tragedy in modern US history, so the fund administrators have terribly difficult decisions to make. I hope to create as effective an argument as possible for their consideration, and I also hope to make those who are suffering aware of other resources that may help them reduce the burden dealing with the shooting has placed on them.
7. Use statistical decision thresholds that reflect the relative difficulty of sampling from your community.
This is a point that’s likely of interest only to researchers, but it bears heavily on how you conceptualize your study’s design and analytic framework when writing for professional publication. In this case, I knew I was dealing with (hopefully) a once-in-a-lifetime sample. Originally, I was swayed by arguments to define statistical significance more stringently and computed power estimates based on finding statistically significant effects at the .005 level with 80% power using one-tailed tests. My initial thought was that I wanted any results about which I wrote to have as much evidential value as possible.
However, as I took to heart calls I’ve joined to justify one’s threshold for discussing results instead of accepting a blanket threshold, I realized that was too stringent a standard to uphold given the unrepeatable nature of this sample. I recognized I was willing to trade a lower evidential threshold for the ability to discuss more fully the results of our study. To that end, I’m now thinking we should use an alpha level of .05, though corrected for multiple comparisons within a family using the Holm-Bonferroni method within fairly narrowly defined families of results to correct for multiple comparisons.
Specifically, for each conceptual set of measures (i.e., psychopathology, normal-range personality, other personality, well-being, beliefs about the event, and demographics), I’ll adjust the critical p value through dividing .05 by the number of measures in that family. We have two measures of psychopathology (i.e., the PCL-5 and PHQ-9), 11 normal-range personality traits, 3 other personality traits, 5 measures of well-being, and (probably) 2 measures of beliefs. Thus, if I’m interested in how those at the festival vs. those who weren’t at the festival differed in their normal range personality traits, I could conduct a series of 11 independent sample Welch’s t tests (potentially after a MANOVA involving all traits suggested there are some variables whose means differ between groups).
I’d evaluate the significance of the largest difference at a critical value of .05/11, the second largest (if that first one is significant) at a critical value of .05/10, and so on until the comparison is no longer significant. For my psychopathology variables, I’d evaluate (likely) the PTSD difference first at a critical value of .05/2, then (likely) the depression difference at a critical value of .05/1 (or .05).
That way, I’ll keep my overall error rate at .05 within a conceptual family of comparisons without overcorrecting for multiple comparisons. When dealing with correlations of variables across families of comparisons, I’ll use the larger family’s number in the initial critical value’s denominator. This procedure seems to balance having some kind of evidential value (albeit potentially small) with these findings and a reasonable amount of statistical rigor. Using the new suggested threshold, I’d have to divide .005 by the number of comparisons in a family to maintain my stated family-wise error rate, which would make for some incredibly difficult thresholds to meet!
There are other design decisions I made (e.g., imputing missing values of many study measures using mice rather than only using complete cases in analyses) that also furthered my desire to keep as many voices represented as possible and make our findings as plausible as we can. In our initial study design, we also did not pay participants so that a) there wouldn’t be undue incitement to participate, b) we could accurately estimate the costs of the study when having no idea how many people might actually sign up, and c) we wouldn’t have to worry as much about the validity of responses that may have been driven more by the desire to obtain money than to provide accurate information. In each case, I intend on reporting these justifications and registering them before conducting data analyses to provide as much transparency as possible, even in a situation in which genuine preregistration wasn’t possible.