How to justify your alpha: step by step

Do you need alpha? | Study goals? | Minimum effect size of interest? | Justify your alpha | Justify your beta | How to control for multiple comparisons? | Examples

I joined a panoply of scholars who argue that it is necessary to justify your alpha (that is, the acceptable long-run proportion of false positive results in a specific statistical test) rather than redefine a threshold for statistical significance across all fields. We’d hoped to emphasize that more than just alpha needed justification, but space limitations (and the original article’s title) focused our response on that specific issue. We also didn’t have the space to include specific examples of how to justify alpha or other kinds of analytic decisions. Because people are really curious about this issue, I elaborate here my thought process about how one can justify alpha and other analytic decisions.

 

0) Do you even need to consider alpha?

But not about the Bayes. Just frequentist trouble.

The whole notion of an alpha level doesn’t make sense in multiple statistical frameworks. For one, alpha is only meaningful in the frequentist school of probability, in which we consider how often a particular event would occur in the real world out of the total number of such possible events. In Bayesian statistics, the concern is typically adjusting the probability that a proposition of some sort is true, which is a radically different way of thinking about statistical analysis. Indeed, the Bayes factor represents the relative likelihood of an alternative hypothesis to that of a null hypothesis given the data obtained. There’s also no clear mapping of p values to Bayes factors, so many discussions about alpha aren’t relevant if that’s the statistical epistemology you use.

Fishers (but not dredgers) of data.

Another set of considerations comes from Sir Ronald Fisher, one of the original statisticians. In this view, there’s no error control to be concerned with; p values instead represent levels of evidence against a null hypothesis. Crossing a prespecified low p value (i.e., alpha) then entails rejection of a statistical null hypothesis. However, this particular point of view has fallen out of favor, and Fisher’s attempts to construct a system of fiducial inference also ended up being criticized. Finally, there are systems of statistical inference based on information theory and non-Bayesian likelihood estimation that do not entail error control in their epistemological positions.

I come from a clinical background and teach the psychodiagnostic assessment of adults. As a result, I come from a world in which I must use the information we gather to make a series of dichotomous decisions. Does this client have major depressive disorder or not? Is this client likely to benefit from exposure and response prevention interventions or not? This framework extends to my thinking about how to conduct scientific studies or not. Will I pursue further work on a specific measure or not? Will my research presume this effect is worth including in the experimental design or not? Thus, the error control epistemological framework (borne of Neyman-Pearson reasoning about statistical testing) seems to be a good one for my purposes, and I’m reading more about it to verify that assumption. In particular, I’m interested in disambiguating this kind of statistical reasoning from the common practice of null hypothesis significance testing, which amalgamates Fisherian inferential philosophy and Neyman-Pearson operations.

I don’t argue that it’s the only relevant possible framework to evaluate the results of my research. That’s why I still report p values in text whenever possible (the sea of tabular asterisks can make precise p values difficult to report there) to allow a Fisherian kind of inference. Such p value reporting also allows those in the Neyman-Pearson tradition to use different decision thresholds in assessing my work. It should also be possible to use the n/df and values of my statistics to compute Bayes factors for the simpler statistics I report (e.g., correlations and t tests), though more complex inferences may be difficult to reverse engineer.

 

1) What are the goals of your study?

My lab has run four different kinds of studies, each of which has unique goals based on the population being sampled, the methods being used, and the practical importance of the questions being asked. A) One kind of study uses easy-to-access and convenient students or MTurk workers as a proxy for “people at large” for studies involving self-report assessments of various characteristics. B) Another study draws from a convenient student population to make inferences about basic emotional functioning through psychophysiological or behavioral measures. C) A third study draws from (relatively) easily sampled clinical populations in the community to bridge self-report and clinical symptom ratings, behavioral, and psychophysiological methods of assessment. D) The last study type comes from my work on the Route 91 shooting, in which the population sampled is time-sensitive, non-repeatable, and accessible only through electronic means. In each case, the sampling strategy from these populations entails constraints on generality that must be discussed when contextualizing the findings.

Five at a time, please.

In study type A), I want to be relatively confident that the effects I observe are there. I’m also cognizant that measuring everything through self-report has attendant biases, so it’s possible processes like memory inaccuracies, self-representation goals, and either willful or unintentional ignorance may systematically bias results. Additionally, the relative ease of running a few hundred more students (or MTurk workers, if funding is available) makes running studies with high power to detect small effects a simpler proposition than in other study types, as once the study is programmed, there’s very little work needed to run participants through it. Indeed, they often just run themselves! In MTurk studies, it may take a few weeks to run 300 participants; if I run them in the lab, I can run about 400 participants in a year on a given study. Thus, I want to have high power to detect even small effects, I’ll use a lower alpha level to guard against spurious results borne of mega-multiple comparisons, and I’m happy to collect large numbers of participants to reduce the errors surrounding any parameter estimates.

With sensors on the face, behind the ear, on the hands and elbows…

Study type B) deals with taking measurements across domains that may be less reliable and that do not suffer from that same single-method biases as self-report studies. I’m willing to achieve a lower evidential value for the sake of being able to say something about these studies. In part, I think for many psychophysiological measures, the field is learning how to do this work well, and we need to walk statistically before we can run. I also believe that to the extent these studies are genuinely cross-method, we may reduce some of the crud factor especially inherent in single-method studies and produce more robust findings. However, these data require two research assistants to spend an hour of their time applying sensors to participants’ bodies and then spend an extra two to three hours collecting data and removing those sensors, so the cost of acquiring new participants is higher than in study type A). For reasons not predictable from the outset, a certain percentage of participants will also not yield interpretable data. However, the participants are still relatively easy to come by, so replacement is less of an issue, and I can plan to run around 150 participants a year if the lab’s efforts are fully devoted to the study. In such studies, I’ll sacrifice some power and precision to run a medium number of participants with a higher alpha level. I also use a lot of within-subjects designs in these kinds of studies to maximize the power of any experimental manipulations, as they allow participants to serve as their own controls.

Finding well-characterized participants imposes substantial personnel costs.

In study type C), I run into many of the same kinds of problems as study type B) except that participants are relatively hard to come by. They come from clinical groups that require substantial recruitment resources to characterize and bring in, and they’re relatively scarce compared to unselected undergraduates. For cost comparison purposes, it’s taken me two years to bring in 120 participants for such a study (with a $100+ session compensation rate to reflect that they’d be in the lab for four hours). Within-subjects designs are imperative here to keep power high, and I also hope that study type B) has shown us how to maximize our effect sizes such that I can power a study to detect medium-sized effects as opposed to small ones.

These can be the most stressful studies to recruit.

Study type D) entails running participants who cannot be replaced in any way, making good measurement imperative to increase precision to detect effects. Recruitment is also a tremendous challenge, and it’s impossible to know ahead of time how many people will end up participating in the study. Nevertheless, it’s still possible to specify desired effect sizes ahead of time to target along with the precision needed to achieve a particular statistical power. I was fortunate to get just over 200 people initially, though our first follow-up had the approximately 125 participants I was hoping to have throughout the study. I haven’t had the ability to bring such people into the lab so far, so it’s the efforts needed in recruitment and maintenance of the sample that represent substantial costs, not the time it takes research assistants to collect the data.

 

2) Given your study’s goals, what’s the minimum effect size that you want your statistical test to detect?

Many different effect sizes can be computed to address different kinds of questions, yet many researchers answer this question by defaulting to an effect size that either corresponds to a lay person’s intuitions about the size of the effect (Cohen, 1988) or a typical effect size in the literature. However, in my research domain, it’s often important to consider whether effect sizes make a practical difference in real-world applications. That doesn’t mean that effects must be whoppingly large to be worth studying; wee effects with cheap interventions that feature small side effect profiles are still well worth pursuing. Nevertheless, whether theoretical, empirical, or pragmatic, researchers should take care to justify a minimum effect size of interest, as this choice will guide the rest of the justification process.

Is it at least this big? Measured reliably? Well, then…

In setting this minimum effect size of interest, researchers should also consider the reliability of the measures being used in a study. All things being equal, more reliable measures will be able to detect smaller effects given the same number of observations. However, savvy researchers should take into account the unreliability of their measures when detailing the smallest effect size of interest. For instance, a researcher may want to detect a correlation of .10 – which corresponds to an effect explaining 1% of the linear relationship between two measures – and the two measures the researcher is using have internal consistencies of .80 and .70. Rearranging Lord and Novick’s (1968) correction formula, the actual smallest effect size of interest should be calculated as .10*√(.80*.70), or .10*√.56, or .10*.748, or .075.

However, unreliability of measurement is not the only kind of uncertainty that might lead researchers to choose a smaller minimum effect size of interest to detect. Even if researchers consult previous studies for estimates of relevant effect sizes, publication bias and uncertainty around the size of an effect in the population throw additional complications into these considerations. Adjusting the expected effect size of interest in light of these issues may further aid in justifying an alpha.

In the absence of an effect that passes the statistical threshold in a well-powered study, it may be useful to examine whether it is instead inconsistent with the smallest effect size of interest. In this way, we can articulate whether proceeding as if even that effect is not present is a reasonable one rather than defaulting to “retaining the null hypothesis”. This step is important for completing the error control process to ensure that some conclusions can be drawn from the study rather than leaving it out in an epistemological no-man’s land should results not pass the justified statistical threshold.

 

3) Given your study’s goals, what alpha level represents an adequate level of sensitivity to detect effects (or signals) of interest balanced against a specificity against interpreting noise?

Graph of global and local maxima and minimaUtility functions. Ideally, the field could compute some kind of utility function whose maximum value represents a balance among sample size from a given population, minimum effect size of interest, alpha, and power. This function could provide an objective alpha to use in any given situation. However, because each of these quantities has costs and benefits associated with them – and the relative costs and benefits will vary by study and investigator – such a function is unlikely to be computable. Thus, when justifying an alpha level, we need to resort to other kinds of arguments. This means that it’s unlikely all investigators will agree that a given justification is sufficient, but a careful layout of the rationale behind the reported alpha combined with detailed reporting of p values would allow other researchers to re-evaluate a set of findings to determine how they comport with those researchers’ own principles, costs, and benefits. I would also argue that there is no situation in which there are no costs, as other studies could always be run in place of one that’s chosen, participants could be allocated to other studies instead of the one proposed, and the time spent programming a study, reducing its data, and analyzing the results are all costs inherent in any study.

Possible justifications. One blog post summarizes traditionally, citationally, normatively principled, bridgingly principled, and objectively justified alphas. The traditional and cited justifications are similar; the cited version simply notes where the particular authority for the alpha level is (e.g., Fisher, 1925) instead of resting on a nebulous “everybody does it.” In this way, the paper about redefining statistical significance provides a one-stop citation for those looking to summarize that tradition and provides a list of authors who collectively represent that tradition.

However, that paper provides additional sets of justifications for a stricter alpha level that entail multiple possible inferential benefits derived from normative or bridged principles. In particular, the authors of that paper bridge frequentist and Bayesian statistical inferential principles to emphasize the added rigor an alpha of .005 would lend the field. They also note that normatively, other fields have adopted stricter levels for declaring findings “significant” or as “discoveries”, and that such a strict alpha level would reduce the false positive rate below that of the field currently while requiring less than double the number of participants to maintain the same level of power.

Who knows what alpha level pseudopodia would find convenient?

Bridging principles. One could theoretically justify a (two-tailed) alpha level of .05 on multiple grounds. For instance, humans tend to think in 5s and 10s, and a 5/100 cutoff seems intuitively and conveniently “rare”. I should also note here that I use the term “convenient” to denote something adopted as more than “arbitrary”, inasmuch as our 5×2 digit hands provide us a quick, shared grouping for counting across humans. I fully expect that species with non-5/10 digit groupings (or pseudopods instead of digits) might use different cutoffs, which would similarly shape their thinking about convenient cutoffs for their statistical epistemology.

Percentiles of the (convex) normal distribution.

Such a cutoff has also been bridged to values of the normal distribution, as an alpha of .05 corresponds to twice that distribution’s standard error. Because many parametric frequentist statistical tests assume normal distributions of the standard errors of scores, this bridge links the alpha level to a fundamental assumption of these kinds of statistical tests.

Another bridging principle entails considering that lower alpha levels correspond to increasingly severe tests of theories. Thus, a researcher may prefer a lower alpha level if the theory is more well-developed, its logical links are clearer (from the core theoretical propositions to its auxiliary corollaries to its specific hypothetical propositions to the statistical hypotheses to test in any study), and its constructs are more precisely measured.

How to describe findings meeting or exceeding alpha? From a normative perspective, labeling findings as “statistically significant” has led to decades of misinterpretation of the practical importance of statistical tests (particularly in the absence of effect sizes). In our commentary, we encouraged abandoning that phrase, but we didn’t offer an alternative. I propose describing these results as “passing threshold” to reduce misinterpretative possibilities. This term is far less charged with…significance…and may help separate evaluation of the statistical hypotheses under test from larger practical or theoretical concerns.

 

4) Given that alpha level, what beta level (or 1-power) represents an acceptable tradeoff between missing a potential effect of interest and leaving noise uninterpreted?

Absolute power requires…absolute sample size.

Though justifying alpha is an important step, it’s just as important to justify your beta (which is the long-run proportion of false retentions of the null hypothesis). From a Neyman-Pearson perspective, the lower the beta, the more evidential value a null finding possesses. This is also why Neyman-Pearson reasoning is inductive rather than deductive: Null hypotheses have information value as opposed to being defaults that can only be refuted with deductive logic’s modus tollens tests. However, the lower the beta, the more observations are needed to make a given effect size pass the statistical threshold set by alpha. As shown above, one key to making minimum effect sizes of interest larger is measuring that effect with more reliably. A second is maximizing the strength of any manipulation such that a larger minimum effect size would be interesting to a researcher.

Another angle on the question leading this section is: How precise would the estimate of the effect size need to be to make me comfortable with accepting the statistical hypothesis being tested rather than just retaining it in light of a test statistic that doesn’t pass the statistical threshold? Just because a test has a high power (on account of a large effect size) doesn’t mean that the estimate of that effect is precise. More observations are needed to make precise estimates of an effect – which also reduces the beta (and thus heightens the power) of a given statistical test.

Power curves visualize the tradeoffs among effect size, beta, and the number of observations. They can aid researchers in determining how feasible it is to have a null hypothesis with high evidential value versus being able to conduct the study in the first place. Some power curves start showing a non-linear relationship between observations and beta when beta is about .20 (or power is about .80), consistent with historical guidelines. However, other considerations may take precedence over the shape of a power curve. Implicitly, traditional alpha (.05) and beta (.20) levels imply that erroneously declaring an effect passes threshold is four times worse than erroneously declaring an effect does not. Some researchers might believe even higher ratios should be used. Alternatively, it may be more important for researchers to fix one error rate or another at a specific value and let the other vary as resources dictate. These values should be articulated in the justification.

 

5) Given your study’s goals and the alpha and beta levels above, how can you adjust for multiple comparisons to maintain an acceptable level of power while guarding against erroneous findings?

So many comparisons, so many techniques for dealing with them.

Most studies do not conduct a single comparison. Indeed, many studies toss in a number of different variables and assess their relationships, mediation, and moderation among them. As a result, there are many more comparisons conducted than the chosen alpha and beta levels are designed to guard against! There are four broad methods to use when considering how multiple comparisons impact your stated alpha and beta levels.

Per-comparison error rate (PCER) does not adjust comparisons at all and simply accepts the risk of there likely being multiple spurious results. In this case, no adjustments to alpha or beta need to be made in determining how many observations are needed.

False discovery error rate (FDER) allows that there will be a certain proportion of false discoveries in any set of multiple tests; FDER corrections attempt to keep the rate of these false discoveries at the given alpha level. However, this comes at a cost of complexity for those trying to justify alpha and beta, as each comparison uses a different critical alpha level. One common method for controlling FDER adjusts alpha levels for each comparison in a relatively linear fashion, retaining null hypotheses starting from the highest p value to the last one in which p > [(step #)/(total # of comparisons)]*(justified alpha). The remaining comparisons are judged as passing threshold. So, which comparison’s alpha value should be used in justifying comparisons? This may require knowing on average how many comparisons would typically pass threshold within a given comparison set size to plan for a final alpha to justify. After that, the number of observations may need adjusting to maintain the desired beta.

Corrections to the family-wise error rate (FWER) seek to reduce the error rate across a set of comparisons by lowering alpha in a more dramatic way across comparisons than do corrections for FDER. One popular method for controlling FWER entails dividing the desired alpha level by the number of comparisons. If the smallest p value in that set is smaller than alpha/(# comparisons), then it passes threshold and the next smallest p value is compared to alpha/(# comparisons-1). Once the p value of a comparison is greater than that fraction, that comparison and the remaining comparisons are considered not to have passed threshold. This correction has the same problems of an ever-shifting alpha and beta as the FDER, so the same cautions apply.

Per-family error rate (PFER) represents the most stringent error control of all. In this view, making multiple errors across families of comparisons is more damaging than making one error. Thus, tight control should be exercised over the possibility of making an error in any family of comparisons. The Bonferroni correction is one method of maintaining a PFER that is familiar to many researchers. In this case, alpha simply needs to be divided by the number of comparisons to be made, beta adjusted to maintain the appropriate power, and the appropriate number of observations collected.

Many researchers reduce alpha in the face of multiple comparisons to address the PCER without taking steps to address other kinds of error rates formally. Such ad hoc adjustments should at least also report how many tests would pass the statistical threshold by chance alone. Using FDER or FWER control techniques represent a balance between leniency and strictness in error control, though researchers should specify in advance whether false discovery or family-wise error control is more in line with their epistemological stance at a given time. Researchers may prefer to control the PFER when the number of comparisons is kept to a minimum through the use of a few critical, focal tests of a well-developed theory.

Who is “family” among all these people?

In FDER, FWER, and PFER control mechanisms, the notion of “family” must be justified. Is it all comparisons conducted in a study? Is it a set of exploratory factors that are considered separately from focal confirmatory comparisons? Does it group together conceptually similar measures (e.g., normal-range personality, abnormal personality, time-limited, psychopathology, and well being)? All of these and more may be reasonable families to use in lumping or splitting comparisons. However, to help researchers believe that these families were considered separable at the outset of a study, family membership decisions should be pre-registered.

 

6) Examples of justified alphas

Though the epistemological principles involved in justifying alphas and similar quantities run deeply, I don’t believe that a good alpha justification requires more than a paragraph. Ideally, I would like to see this paragraph placed at the start of a Method section in a journal article, as it sets the epistemological stage for everything that comes afterward. For each study type listed above, here are some possible paragraphs to justify a particular alpha, beta, and number of observations. I note that these are riffs off possible justifications; they do not necessarily represent the ways I elected to treat the same data detailed in the first sentences of each paragraph. To determine appropriate corrections for unreliability in measures when computing power estimates, I used Rosnow and Rosenthal’s (2003) conversions of effect sizes to correlations (rs).

Study type A): We planned to sample from a convenience population of undergraduates to provide precise estimates of two families of five effects each; we expected all of these effects to be relatively small. Because our measures in this study have historically been relatively reliable (i.e., internal consistencies > .80), we planned our study to detect a minimum correlation of .08, as that corresponds to a presumed population correlation of .10 (or proportion of variance shared of 1%) measured with instruments with at least 80% true score variance. We also recognized that our study was conducted entirely with self-report measures, making it possible that method variance would contaminate some of our findings. As a result, we adopted a critical one-tailed α level of .005. Because we believed that spuriously detecting an effect was ten times as undesirable as failing to detect an effect, we chose to run enough participants to ensure we had 95% power to detect a correlation of .08. We used the Holm-Bonferroni method to provide a family-wise error rate correction entailing a minimum α of .001 for the largest effect in the study in each of the two families. This required a sample size of 3486 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Molina, Pierce, Bergquist, & Benning, 2018), we anticipated that 2% of our sample would produce invalid personality profiles that would require replacement to avoid distorting the results (Benning & Freeman, 2017). Consequently, we targeted a sample size of 3862 participants to anticipate these replacements.

Study type B): We planned to sample from a convenience population of undergraduates to provide initial estimates of the extent to which four pleasant picture contents potentiated the postauricular reflex compared to neutral pictures. From a synthesis of the extant literature (Benning, 2018), we expected these effects to vary between ds of 0.2 to 0.5, with an average of 0.34. Because our measures in this study have historically been relatively unreliable (i.e., internal consistencies ~ .35; Aaron & Benning, 2016), we planned our study to detect a minimum mean difference of 0.20, as that corresponds to a presumed population mean difference of 0.34 measured with approximately 35% true score variance. We adopted a critical α level of .05 to keep false discoveries at a traditional level as we sought to improve the reliability and effect size of postauricular reflex potentiation. Following conventions that spuriously detecting an an effect was four times as undesirable as failing to detect an effect, we chose enough participants to ensure we had 80% power to detect a d of 0.20. We used the Benjamini-Hochberg (1995) method to provide a false detection error rate correction entailing a minimum one-tailed α of .0125 for the largest potentiation in the study. This required a sample size of 156 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with this population (e.g., Ait Oumeziane, Molina, & Benning, 2018), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 180 participants to accommodate these additional participants.

Study type C): We planned to sample from our university’s community mental health clinic to examine how anhedonia manifests itself in depression across seven different measures that are modulated by emotional valence. However, because there were insufficient cases with major depressive disorder in that sampling population, we instead used two different advertisements on Craigslist to recruit local depressed and non-depressed participants who were likely to be drawn from the same population. We believed that a medium effect size for the Valence x Group interaction (i.e., an f of 0.25) represented an effect that would be clinically meaningful in this assessment context. Because our measures’ reliabilities vary widely (i.e., internal consistencies ~ .35-.75; Benning & Ait Oumeziane, 2017), we planned our study to detect a minimum f of 0.174, as that corresponds to a presumed population f of 0.25 measured with approximately 50% true score variance. We adopted a critical α level of .007 in evaluating the Valence x Group interactions, using a Bonferroni correction to maintain a per-family error rate of .05 across all seven measures. To balance the number of participants needed from this selected population with maintaining power to detect effects, we chose enough participants to ensure we had 80% power to detect an f of 0.174. This required a sample size of 280 participants with analyzable data according to G*Power (Faul, Erdfelder, Buchner, & Lang, 2009). Based on previous studies with these measures (e.g., Benning & Ait Oumeziane, 2017), we anticipated that 15% of our sample would produce unscoreable data in at least one condition for various reasons and would require additional data to fill in those conditions. Thus, we targeted a sample size of 322 participants to accommodate these additional participants.

Study type D): We planned to sample from the population of survivors of the Route 91 Harvest Festival shooting on October 1, 2017, and from the population of the greater Las Vegas valley area who learned of that shooting within 24 hours of it happening. Because we wanted to sample this population within a month after the incident to examine acute stress reactions – and recruit as many participants as possible – this study did not have an a priori number of participants or targeted power. To maximize the possibility of detecting effects in this unique population, we adopted a critical two-tailed α level of .05 with a per-comparison error rate, as we were uncertain about the possible signs of all effects. Among the 45 comparisons conducted this way, chance would predict approximately 2 to pass threshold. However, we believed the time-sensitive, unrepeatable nature of the sample justified using looser evidential thresholds to speak to the effects in the data.

The (mis)measure of emotion through psychophysiology

On New Year’s Eve 2016, Mariah Carey had a…notable performance in which she had difficulties rendering the songs “Emotions” and “We Belong Together”. She roared back on New Year’s Eve 2017, sparking the first meme of 2018.

Alas, it is unlikely that the field of psychophysiology will un-mangle its measurement of emotions with reflexes in such a short span of time.

My lab uses two reflexes to assess the experience of emotion, both of which can be elicited through short, loud noise probes. The startle blink reflex is measured underneath the eye, and it measures a defensive negative emotional state. The postauricular reflex is a tiny reflex behind the ear that measures a variety of positive emotional states. Unfortunately, neither reflex assesses emotion reliably.

When I say “reliably”, I mean an old-school meaning of reliability that addresses what percentage of variability in a measurement’s score is due to the construct it’s supposed to measure. The higher that percentage, the more reliable the measurement. In the case of these reflexes, in the best-case scenarios, about half of the variability in scores is due to the emotion they’re supposed to assess.

That’s pretty bad.

For comparison, the reliability of many personality traits is at least 80%, especially from modern scales with good attention to the internal consistency of what’s being measured. The reliability of height measurements is almost 95%.

Why is reflexive emotion’s reliability so bad?

Part of it likely stems from the fact that (at least in my lab), we measure emotion as a difference of reactivity during a specific emotion versus during neutral. For the postauricular reflex, we take the reflex magnitude during pleasant pictures and subtract from that the reflex magnitude during neutral pictures. For the startle blink, we take the reflex magnitude during aversive pictures and subtract from that the reflex magnitude during neutral pictures. Differences can have lower reliabilities than single measurements because the unreliability in both emotion and neutral measures compound when making the difference scores.

However, it’s even worse when we use reflex magnitudes just during pleasant or aversive pictures. In fact, it’s so bad that I’ve found both reflexes have negative reliabilities when measured just as the average magnitude during either pleasant or aversive pictures! That’s a recipe for a terrible, awful, no good, very bad day in the lab. That’s why I don’t look at reflexes during single emotions by themselves as good measures of emotion.

Now, some of these difficulties look like can be alleviated if you look at raw reflex magnitude during each emotion. If you do that, it looks like we could get reliabilities of 98% or more! So why don’t I do this?

Because from person to person, reflex magnitudes during any stimulus can differ over 100 times, which means that it’s a person’s overall reflex magnitude that raw reflex magnitudes are measuring – irrespective of any emotional state the person’s in at that moment.

Let’s take the example of height again. Let’s also suppose that feeling sad makes people’s shoulder’s stoop and head droop, so they should be shorter (that is, have a lower height measurement) whenever they’re feeling sad. I have people stand up while watching a neutral movie and a sad movie, and I measure their height four times during each movie to get a sense of how reliable the measurement of height is.

If all I do is measure the reliability of people’s mean height across the four sadness measurements, I’m likely to get a really high value. But what have I really measured there? Well, it’s just basically how tall people are – it doesn’t have anything to do with the effect of sadness on their height! To understand how sadness specifically affects people’s heights, I’d have to subtract their measured height in the neutral condition from that in the sad condition: a difference score.

Furthermore, if I wanted to take out entirely the variability associated with people’s heights from the effects of sadness I’m measuring (perhaps because I’m measuring participants whose heights vary from 1 inch to 100 inches), I can use a process called “within-subject z scoring”, which is what I use in my work. It doesn’t seem like the overall reflex magnitude people have predicts many interesting psychological states, so I feel confident in this procedure. Though my measurements aren’t great, at least they measure what I want to some degree.

What could I do to make reflexive measures of emotion better? Well, I’ve used four noise probes in each of four different picture contents to cover a broad range of positive emotions. One thing I could do is target a specific emotion within the positive or negative emotional domain and probe it sixteen times. Though it would reduce the generalizability of my findings, it would substantially improve reliability of the reflexes, as reliabilities tend to increase the more trials you include (because random variations have more opportunities to get cancelled out through averaging). For the postauricular reflex, I could also present lots of noise clicks instead of probes to increase the number of reflexes elicited during each picture. Unfortunately, click-elicited and probe-elicited reflexes only share about 16% of their variability, so it may be difficult to argue they’re measuring the same thing. That paper also shows you can’t do that for startle blinks, so that’s a dead end method for that reflex.

In short, there’s a lot of work to do before the psychophysiology of reflexive emotion can relax with its cup of tea after redeeming itself with a reliable, well-received performance (in the lab).

Community psychology: Lessons learned from Route 91

This post is long; the links below will take you to potential topics of interest.

BEHIND THE SCENES: Call your IRB & delegate | Justify your statistical decisions

ON THE GROUND: Promote your study | Involve the media | Lead with helpful resources | Listen to community reps | Give results back to community | Advocate for your community

Our study about the Route 91 shooting represents a substantial addition to my lab’s research skills portfolio. Specifically, it’s my first foray into anything remotely resembling community psychology, in which researchers actively engage in helping to solve problems in an identified community. In this case, I thought of the Route 91 festival survivors (and potentially the broader Las Vegas community affected by the shooting) as the community. Below are the steps we used to conduct time-sensitive research in this community, with eyes both toward doing the best science possible and toward serving the community from which our participants were drawn.

0. In time-sensitive situations, call or visit your Institutional Review Board (IRB) in person and delegate work.

Because research cannot be performed without IRB approval, I gave my IRB a call as soon as a) I had a concrete idea for the study I wanted to do and b) support from my lab to work as hard as necessary to make it happen. In my case, that was the Friday after the shooting – and three days before I was scheduled to fly out of the country for a conference. Fortunately, I was able to get call our IRB’s administrative personnel, and Dax Miller guided me through the areas we’d need to make sure we addressed for a successful application. He also agreed to perform a thorough administrative review on Sunday so that I could get initial revisions back to the IRB before leaving. We performed the revisions so the study could be looked over by an IRB reviewer by Wednesday, who had additional questions I could address from afar so that the study could be approved by Thursday afternoon in Vegas (or early Friday morning in Europe). Without that kind of heads-up teamwork from the IRB, we simply couldn’t have done this study, which sought to look at people’s stories of the trauma along with their symptoms within the acute stress window of a month of the shooting.

I also drafted my lab to perform a number of tasks I simply couldn’t do by myself in such a short period of time. Three students provided literature to help me conceptualize the risks and benefits this study might pose. Two worked to provide a list of therapeutic resources for participants. Two others scoured the internet for various beliefs people espoused about the shooting to develop a measure of those. Yet another two programmed the study in Qualtrics and coordinated the transfer of a study-specific ID variable so that we could keep contact information separate from participants’ stories and scores, one of whom also drafted social media advertisements. A final student created a flyer to use for recruiting participants (including a QR code to scan instead of forcing participants to remember our study’s URL). Again, without their help, this study simply couldn’t have been done, as I was already at or exceeding my capacity to stay up late in putting together the IRB application and its supporting documentation (along with programming in personality feedback in Qualtrics as our only incentive for participating).

1. Social media can be your recruitment friends, as can internal email lists.

We spread our flyers far and wide, including a benefit event for Route 91 survivors as well as coffee shops, community bulletin boards, and other such locations across the greater Las Vegas valley. Nevertheless, my RAs used their social media to help promote our study with the IRB-approved text, as did I. Other friends took up the cause and shared posts, spreading the reach of our study into the broader Las Vegas community in ways that would have been impossible otherwise.

A number of UNLV students, faculty, and staff had also attended Route 91 (and all were affected in some way by the shooting), so we distributed our study through internal email lists. At first, I had access to send an announcement through the College of Liberal Arts’ weekly student email list along with the faculty and staff daily email. After word spread of the study (see the point below), I was also allowed to send a message out to all students at the university. Those contacts helped bump our recruitment substantially, getting both people at the festival and from the broader Las Vegas community in the study.

2. The news media extends your reach even more deeply into the community, both for recruitment and dissemination.

Over the years, I’ve been fortunate enough to have multiple members of the news media contact me about stories they’re doing that can help put psychological research into context for the public. At first, I thought of contacting them as having them return the favor to me to help get the word out about my study. However, as I did so, I also recognized that approaching them with content relevant to their beats may have made their jobs mildly easier. They have airtime or column inches they have to fill, and if you provide them meaningful stories, it’ll save them effort in locating material to fill that time. Thus, if you’re prepared with a camera- or phone-ready set of points, both you and your media contacts can have a satisfying professional relationship.

I made sure to have a reasonable and concise story about what the study was about, what motivated it, what all we were looking at, and what the benefits might be to the community. That way, the journalists had plain-English descriptions of the study that could be understood by the average reader or viewer and that could be used more or less as-is, without a lot of editing. In general, I recommend having a good handle on about 3 well-rehearsed bullet points you want to make sure you get across – and that are expressed in calm, clear language you could defend as if in peer review. Those points may not all fit in with the particular story that the journalist is telling, but they’ll get the gist, especially if you have an action item at the end to motivate people. For me, that was my study’s web address.

As time went on, more journalists started contacting me. I made sure I engaged all of them, as I wanted the story of our study out in as many places as possible. Generally speaking, with each new story that came out, I had 5-10 new people participate in the study. If your study is interesting, it may snowball, and you never know which media your potential community participants might consume. The university’s press office helped in getting the word out as well, crafting a press release that was suitable for other outlets to pick up and modify.

The media can also help you re-engage your community during and after research has commenced (which I discuss more in point 5 below). They have a reach beyond your specific community you’ll likely never have, and they can help tell your community’s story to the larger world. Again, it’s imperative to do so in a way that’s not stigmatizing or harmful (see point 4 below), but you can help make prominent people whose voices otherwise wouldn’t be heard or considered.

3. Lead your approach to community groups directly with helpful resources after building credibility.

Another good reason to approach the media beyond increasing participation immediately is that having a public presence for your research will give you more credibility when approaching your community of interest directly. I recognized that after about a week of press, one of the participants mentioned a survivor’s Facebook group, and believed the time was right to make direct inquiries to the community I wanted to help. To that end, I messaged the administrators of survivors’ Facebook groups, asking them to post the free and reduced cost therapy resources we gave to participants after the study. I was also careful not to ask to join groups, as I didn’t want to run the risk of violating that community’s healing places.

Two of the groups’ administrators asked me to join the group directly to post them, and I was honored they asked me to do so. However, in those groups, I confined myself to being someone who posted resources when general calls went out rather than offering advice about coping with trauma. I didn’t want to over-insinuate myself into the group and thereby distort their culture, and I also wanted to maintain a professionally respectful distance to allow the group to function as a community resource. A couple of other groups said they would be willing to consider posting on my behalf but that the groups were closed to all but survivors. I thanked them for their consideration and emphasized I just wanted to spread the word about available resources.

All in all, it seems imperative to approach a community with something to give, rather than just wanting to receive from them. In this unique case, I had something to offer almost immediately. However, if it’s not clear what you might bring to the table, research your community’s needs and talk to some representatives to see what they might need. To the extent your professional skills might help and that the community believes you’ll help them (and not harm them), you’re more likely to get accepted into the community to conduct your research.

4. Engage the community in developing your study.

I learned quickly about forming a partnership with the community in developing my research when one of the members of a Route 91 survivor’s group contacted me about our study. I noted that she was local and had a background in psychology, thus making her an excellent bridge between my research team and the broader community. She zipped through IRB training and provided invaluable feedback about the types of experiences people have had after Route 91 (and helped develop items to measure those) along with providing feedback about a plan to compensate participants (confirming that offering the opportunity to donate compensation to a victim’s fund might alleviate some people’s discomfort). She also gave excellent advice about how to present the study’s results to the community, down to the colors used on the graph to make more obvious the meaning of the curves I drew. Consistent with best practices in community psychology, I intend to have her as an author on the final paper(s) so that the community has a voice in this research’s reports.

Though the ad hoc, geographically dispersed nature of this community makes more centralized planning with it more challenging (especially with a short time frame), I hope our efforts thus far have helped stay true to the community’s perceptions and has avoided stigmatizing them. In communities with leadership structures of their own, engaging those leaders in study planning, participation, and dissemination helps make the research truer to the community’s experience and will likely make people more comfortable with participating. Those people may want to make changes that may initially seem to compromise your goals for a study, but in this framework, the community is a co-creator of the research. If you can’t explain well why certain procedures you really want to use are important in ways the community can accept, then you’ll need to listen to the community to figure out how to work together. Treat education as a two-way street: You have a lot to learn about the community, and you can also show them the ins and outs of research procedures, including why certain procedures (e.g., informed consent) have been developed to protect participants, not harm them.

5. Give research results back to your community in an easily digestible form.

In community psychology, the research must feed back into the community somehow. Because we’re not doing formal interventions in this study, the best I think we can do at this point is share our results in a format that’s accessible to people without a statistical education. In the web page I designed to do just that, I use language as plain as I can to describe our findings without giving tons of numbers in the text. In the numerical graphs I feature, I’ve used animated GIFs to introduce the public to the layers comprising a graph rather than expecting them to comprehend the whole thing at once. I hope that it works.

I also posted my findings on all the groups that had me and engaged reporters who’d asked for follow-up stories once we had our first round of data collected so that their investment in helping me recruit would see fruit. It seemed like many Route 91 survivors reported being perceived as not having anything “real” wrong with them or being misunderstood by their families, friends, coworkers, or romantic partners. Thus, I tried to diagram how people at Route 91 had much higher levels of post-traumatic stress than people in the community, such that about half of them would qualify for a provisional PTSD diagnosis if their symptoms persisted for longer than a month.

6. Advocate for your community.

This is one of the trickier parts of this kind of research for me, as I don’t want to speak as a representative for a broad, decentralized community of which I’m ultimately not a part. Nevertheless, I think data from this research could help advocate for the survivors in their claims, especially those who may not be eligible for other kinds of compensatory funds. I only found out about the town hall meetings of the Las Vegas Victims Fund as they were happening, so I was unable to provide an in-person comment to the board administering the funds. Fortunately, Nicole Raz alerted me to the videos of the town halls, and I was privileged to hear the voices of those who want to be remembered in this process. Right now, I’m drafting a proposal based on this study’s data and considerations of how the disability weights assigned to PTSD by the World Health Organization compare to other conditions that may be granted compensation.

In essence, I’m hoping to make a case that post-traumatic stress is worth compensating, especially given that preliminary results suggest that post-traumatic stress symptomatology as a whole doesn’t seem to have declined in this sample over the course of a month. One of the biggest problems facing this particular victims’ fund is that there are tens of thousands of possible claimants unlike just about any other mass tragedy in modern US history, so the fund administrators have terribly difficult decisions to make. I hope to create as effective an argument as possible for their consideration, and I also hope to make those who are suffering aware of other resources that may help them reduce the burden dealing with the shooting has placed on them.

7. Use statistical decision thresholds that reflect the relative difficulty of sampling from your community.

This is a point that’s likely of interest only to researchers, but it bears heavily on how you conceptualize your study’s design and analytic framework when writing for professional publication. In this case, I knew I was dealing with (hopefully) a once-in-a-lifetime sample. Originally, I was swayed by arguments to define statistical significance more stringently and computed power estimates based on finding statistically significant effects at the .005 level with 80% power using one-tailed tests. My initial thought was that I wanted any results about which I wrote to have as much evidential value as possible.

However, as I took to heart calls I’ve joined to justify one’s threshold for discussing results instead of accepting a blanket threshold, I realized that was too stringent a standard to uphold given the unrepeatable nature of this sample. I recognized I was willing to trade a lower evidential threshold for the ability to discuss more fully the results of our study. To that end, I’m now thinking we should use an alpha level of .05, though corrected for multiple comparisons within a family using the Holm-Bonferroni method within fairly narrowly defined families of results to correct for multiple comparisons.

Specifically, for each conceptual set of measures (i.e., psychopathology, normal-range personality, other personality, well-being, beliefs about the event, and demographics), I’ll adjust the critical p value through dividing .05 by the number of measures in that family. We have two measures of psychopathology (i.e., the PCL-5 and PHQ-9), 11 normal-range personality traits, 3 other personality traits, 5 measures of well-being, and (probably) 2 measures of beliefs. Thus, if I’m interested in how those at the festival vs. those who weren’t at the festival differed in their normal range personality traits, I could conduct a series of 11 independent sample Welch’s t tests (potentially after a MANOVA involving all traits suggested there are some variables whose means differ between groups).

I’d evaluate the significance of the largest difference at a critical value of .05/11, the second largest (if that first one is significant) at a critical value of .05/10, and so on until the comparison is no longer significant. For my psychopathology variables, I’d evaluate (likely) the PTSD difference first at a critical value of .05/2, then (likely) the depression difference at a critical value of .05/1 (or .05).

That way, I’ll keep my overall error rate at .05 within a conceptual family of comparisons without overcorrecting for multiple comparisons. When dealing with correlations of variables across families of comparisons, I’ll use the larger family’s number in the initial critical value’s denominator. This procedure seems to balance having some kind of evidential value (albeit potentially small) with these findings and a reasonable amount of statistical rigor. Using the new suggested threshold, I’d have to divide .005 by the number of comparisons in a family to maintain my stated family-wise error rate, which would make for some incredibly difficult thresholds to meet!

There are other design decisions I made (e.g., imputing missing values of many study measures using mice rather than only using complete cases in analyses) that also furthered my desire to keep as many voices represented as possible and make our findings as plausible as we can. In our initial study design, we also did not pay participants so that a) there wouldn’t be undue incitement to participate, b) we could accurately estimate the costs of the study when having no idea how many people might actually sign up, and c) we wouldn’t have  to worry as much about the validity of responses that may have been driven more by the desire to obtain money than to provide accurate information. In each case, I intend on reporting these justifications and registering them before conducting data analyses to provide as much transparency as possible, even in a situation in which genuine preregistration wasn’t possible.

A template for reviewing papers

Peer review’s technology (but not volume) has changed over the decades.

The current culture of science thrives on peer review – that is, the willingness of your colleagues to read through your work, critique it, and thereby improve it. Science magazine recently collected a slew of tips on how to review papers, which give people getting started in the process of peer reviewing some lovely overarching strategies about how to prepare a review.

But how can you keep in your head all those pieces of good advice and apply them to the specifics of a paper in front of you? I’d argue that like many human endeavors, it’s impossible. There are too many complexities in each paper to collate loads of disparate recommendations and keep them straight in your head. To that end, I’ve created a template for reviewing papers our lab either puts out or critiques. Not incidentally, I highly recommend using your lab group as a first round of review before sending papers out for review, as even the greenest RA can parse the paper for problems in logic and comprehensibility (inculding teh dreded “tpyoese”).

To help my lab out in doing this, I’ve prepared the following template. It organizes questions I typically have about various pieces of manuscripts, and I’ve found that undergrads given nice reviews with its help. In particular, I find it helps them focus on things beyond the analytic details to which they may have not been exposed so that they don’t feel so overwhelmed. It may also be helpful for more experienced reviewers to judge what they could contribute as a reviewer in an unfamiliar topic or analytical approach. I encourage my lab members to copy and paste it verbatim when they draft their feedback, so please do the same if it’s useful to you!


Summarize in a sentence or two the strengths of the manuscript. Summarize in a sentence or two the chief weaknesses of the manuscript that must be addressed.

 

INTRODUCTION

How coherent, crisp, and focused is the literature summary? Are all the studies discussed relevant to the topic at hand?

 

Are there important pieces of literature that are omitted? If so, note what they are, and provide full citations at the end of the review.

 

Does the literature summary flow directly into the questions posed by this study? Are their hypotheses clearly laid out?

 

METHOD

Are the participants’ ages, sexes, and ethnic/racial distribution reasonably characterized? Is it clear from what population the sample is drawn? Are any criteria used to exclude participants from overall analyses clearly specified?

 

Are the measures described in brief but with enough data so that the naive reader knows what to expect? Are there internal consistency or other reliability statistics presented for inventories and other measures that can have these presented?

 

For any experimental task, is it described in sufficient detail to allow a naive reader to replicate the task and understand how it works? Are all critical experimental measures and dependent variables clearly explained?

 

Was the procedure sufficiently detailed to allow you to know what the experience was like from the perspective of the participant? Could you rerun the study with this description and that provided above of the measures and tasks?

 

Is each step that the authors took to get from raw data to the data that were analyzed laid out plainly? Are particular equipment settings, scoring algorithms, or the like described in sufficient detail that you could take the authors’ data and get out exactly what they analyzed?

 

Do the authors specify the analyses they used to test all of their hypotheses? Are those analytic tools proper to use given their research design and data at hand? Are any post hoc analyses properly described as such? Is the criterion used for statistical significance given? What measure of effect size do the authors report? Does there appear to be adequate power to test the effects of interest? Do the authors report what software they used to analyze their data?

 

RESULTS

How easily can you read the Results section? How does it flow from analysis to analysis, and from section to section? Do the authors use appropriate references to tables and/or figures to clarify the patterns they discuss?

 

How correct are the statistics? Are they correctly annotated in tables and/or figures? Do the degrees of freedom match up to what they should based on what’s reported in the Method section?

 

Do the authors provide reasonable numbers to substantiate the verbal descriptions they use in the text?

 

If differences among groups or correlations are given, are there actual statistical tests performed that assess these differences, or do the authors simply rely on results falling on either side of a line of statistical significance?

 

If models are being compared, are the fit indexes both varied in their domains they assess (e.g., error of approximation, percentage of variance explained relative to a null model, containing more information given the number of parameters) and interpreted appropriately?

 

DISCUSSION

Are all the findings reported on in the Results mentioned in the Discussion?

 

Does the discussion contextualize the findings of this study back into the broader literature in a way that flows, is sensible, and appropriately characterizes the findings and the state of the literature? If any relevant citations are missing, again give the full citation at the end of the review

 

How reasonable is the authors’ scope in the Discussion? Do they exceed the boundaries of their data substantially at any point?

 

What limitations of the study do the authors acknowledge? Are there major ones they omitted?

 

Are compelling future directions given for future research? Are you left with a sense of the broader impact of these findings beyond the narrow scope of this study?

 

REFERENCES FOR THIS REVIEW (only if you cited articles beyond what the authors already included in the manuscript)

50 years of Star Trek: best episode and reflections on autism

The 50th anniversary of the TV show Star Trek‘s first broadcast is today. It was a formative franchise for me growing up, informing many of my first ideas about space exploration, heroism, and a collaborative society. Debates redound about the best episode of the series. However, I agree with Business Insider’s choice of the episode Balance of Terror. It’s essentially a space version of submarine warfare, for which I’ve been a sucker ever since the game Red Storm Rising for the Commodore 64. This episode has everything: Lore building of the political and technological history of the Federation, the introduction of a new opponent, a glimpse of life on the lower decks, and character development galore for multiple cast members – including a guest star.

One of the moments that always stuck with me was one in the Captain’s quarters as the Enterprise and its Romulan counterpart wait each other out in silence. Dr. McCoy comes to speak with Captain Kirk, who expresses a rare moment of self-doubt regarding his decisions during tactical combat. The doctor’s compassionate nature comes through as he reminds the captain how across 3 million Earth-like planets that might exist, recapitulated across 3 million million galaxies, there’s only one of each of us – and not to destroy the one named Kirk. The lesson of that moment resonates 50 years later and is one I like to revisit when I feel myself beset by doubts about myself or my career.

Another moment I appreciate is the imperfection allowed in Spock’s character without being under the influence of spores, temporal vortices, or other sci-fi contrivances. Already, he has been accused of being a Romulan spy by a bigoted member of the crew who lost multiple family members in a war with the Romulans decades before visual communication was possible. Now, Spock breaks the silence under which the Enterprise was operating with a clumsy grip on the console he is repairing. Is this the action of a spy? Or just an errant mistake that anyone could make, especially when under heightened scrutiny?

Indeed, this error might be expected when Mr. Spock operates under stereotype threat. Just hours earlier, he was revealed to share striking physiological similarities with the Romulan enemies, who Spock described as possible warrior offshoots of the Vulcan race before Vulcans embraced logic. This revelation caused Lt. Stiles, who had branches of his family wiped out in the prior war with the Romulans, to view Spock with distrust and outright bigotry that was so blatant that the captain called him on it on the bridge. Still, Stiles’s prejudice against Spock is keenly displayed throughout the episode, making it more likely that Spock would conform to the sabotaging behavior expected of him by his bridgemate.

On their own ship, the sneaky and cunning Romulans were not depicted as mere stereotypes of those adjectives but instead as a richly developed martial culture. Their commander and his centurion have a deep bond that extends over a hundred campaigns; the regard these two have for each other is highlighted in the actors’ subtle inflections and camaraderie. The internal politics of the Romulan empire are detailed through select lines of dialog surrounding the character of Decius and the pique that character elicits in his commander. In the end, the Romulan commander is shown to be sensitive to the demands of his culture and his subordinates in the culminating action of the episode, though the conflict between these and his own plans is palpable.

The contrast between Romulans and Spock highlights how alien Vulcan logic seems to everyone else. Spock is a character who represents the outsider, the one struggling for acceptance among an emotional human crew even as he struggles to maintain his culture’s logical discipline. Authors with autism have even remarked how Spock helped them understand how they perceive the world differently from neurotypicals in a highly logical fashion. However, given the emotional militarism of the Romulans, I believe that Vulcan logic is a strongly culturally conditioned behavior rather than a reflection of fundamental differences in baseline neurobiological processing.

There are neurobiological differences in sustained attention to different kinds of objects in autism compared to neurotypical controls. Work I did in collaboration with Gabriel Dichter has demonstrated that individuals with autism spectrum disorders have heightened attention to objects of high interest to these individuals (e.g., trains, computers) compared to faces, whereas neurotypicals show the opposite pattern of attention (access here). Based on decades of cultural influence, Mr. Spock might be expected to show equal attention to objects and faces, but Dr. McCoy, Captain Kirk, and the Romulans all would be expected to be exquisitely sensitive to faces, as they convey a lot of information about the social world.

css.php