The (mis)measure of emotion through psychophysiology

On New Year’s Eve 2016, Mariah Carey had a…notable performance in which she had difficulties rendering the songs “Emotions” and “We Belong Together”. She roared back on New Year’s Eve 2017, sparking the first meme of 2018.

Alas, it is unlikely that the field of psychophysiology will un-mangle its measurement of emotions with reflexes in such a short span of time.

My lab uses two reflexes to assess the experience of emotion, both of which can be elicited through short, loud noise probes. The startle blink reflex is measured underneath the eye, and it measures a defensive negative emotional state. The postauricular reflex is a tiny reflex behind the ear that measures a variety of positive emotional states. Unfortunately, neither reflex assesses emotion reliably.

When I say “reliably”, I mean an old-school meaning of reliability that addresses what percentage of variability in a measurement’s score is due to the construct it’s supposed to measure. The higher that percentage, the more reliable the measurement. In the case of these reflexes, in the best-case scenarios, about half of the variability in scores is due to the emotion they’re supposed to assess.

That’s pretty bad.

For comparison, the reliability of many personality traits is at least 80%, especially from modern scales with good attention to the internal consistency of what’s being measured. The reliability of height measurements is almost 95%.

Why is reflexive emotion’s reliability so bad?

Part of it likely stems from the fact that (at least in my lab), we measure emotion as a difference of reactivity during a specific emotion versus during neutral. For the postauricular reflex, we take the reflex magnitude during pleasant pictures and subtract from that the reflex magnitude during neutral pictures. For the startle blink, we take the reflex magnitude during aversive pictures and subtract from that the reflex magnitude during neutral pictures. Differences can have lower reliabilities than single measurements because the unreliability in both emotion and neutral measures compound when making the difference scores.

However, it’s even worse when we use reflex magnitudes just during pleasant or aversive pictures. In fact, it’s so bad that I’ve found both reflexes have negative reliabilities when measured just as the average magnitude during either pleasant or aversive pictures! That’s a recipe for a terrible, awful, no good, very bad day in the lab. That’s why I don’t look at reflexes during single emotions by themselves as good measures of emotion.

Now, some of these difficulties look like can be alleviated if you look at raw reflex magnitude during each emotion. If you do that, it looks like we could get reliabilities of 98% or more! So why don’t I do this?

Because from person to person, reflex magnitudes during any stimulus can differ over 100 times, which means that it’s a person’s overall reflex magnitude that raw reflex magnitudes are measuring – irrespective of any emotional state the person’s in at that moment.

Let’s take the example of height again. Let’s also suppose that feeling sad makes people’s shoulder’s stoop and head droop, so they should be shorter (that is, have a lower height measurement) whenever they’re feeling sad. I have people stand up while watching a neutral movie and a sad movie, and I measure their height four times during each movie to get a sense of how reliable the measurement of height is.

If all I do is measure the reliability of people’s mean height across the four sadness measurements, I’m likely to get a really high value. But what have I really measured there? Well, it’s just basically how tall people are – it doesn’t have anything to do with the effect of sadness on their height! To understand how sadness specifically affects people’s heights, I’d have to subtract their measured height in the neutral condition from that in the sad condition: a difference score.

Furthermore, if I wanted to take out entirely the variability associated with people’s heights from the effects of sadness I’m measuring (perhaps because I’m measuring participants whose heights vary from 1 inch to 100 inches), I can use a process called “within-subject z scoring”, which is what I use in my work. It doesn’t seem like the overall reflex magnitude people have predicts many interesting psychological states, so I feel confident in this procedure. Though my measurements aren’t great, at least they measure what I want to some degree.

What could I do to make reflexive measures of emotion better? Well, I’ve used four noise probes in each of four different picture contents to cover a broad range of positive emotions. One thing I could do is target a specific emotion within the positive or negative emotional domain and probe it sixteen times. Though it would reduce the generalizability of my findings, it would substantially improve reliability of the reflexes, as reliabilities tend to increase the more trials you include (because random variations have more opportunities to get cancelled out through averaging). For the postauricular reflex, I could also present lots of noise clicks instead of probes to increase the number of reflexes elicited during each picture. Unfortunately, click-elicited and probe-elicited reflexes only share about 16% of their variability, so it may be difficult to argue they’re measuring the same thing. That paper also shows you can’t do that for startle blinks, so that’s a dead end method for that reflex.

In short, there’s a lot of work to do before the psychophysiology of reflexive emotion can relax with its cup of tea after redeeming itself with a reliable, well-received performance (in the lab).

Community psychology: Lessons learned from Route 91

This post is long; the links below will take you to potential topics of interest.

BEHIND THE SCENES: Call your IRB & delegate | Justify your statistical decisions

ON THE GROUND: Promote your study | Involve the media | Lead with helpful resources | Listen to community reps | Give results back to community | Advocate for your community

Our study about the Route 91 shooting represents a substantial addition to my lab’s research skills portfolio. Specifically, it’s my first foray into anything remotely resembling community psychology, in which researchers actively engage in helping to solve problems in an identified community. In this case, I thought of the Route 91 festival survivors (and potentially the broader Las Vegas community affected by the shooting) as the community. Below are the steps we used to conduct time-sensitive research in this community, with eyes both toward doing the best science possible and toward serving the community from which our participants were drawn.

0. In time-sensitive situations, call or visit your Institutional Review Board (IRB) in person and delegate work.

Because research cannot be performed without IRB approval, I gave my IRB a call as soon as a) I had a concrete idea for the study I wanted to do and b) support from my lab to work as hard as necessary to make it happen. In my case, that was the Friday after the shooting – and three days before I was scheduled to fly out of the country for a conference. Fortunately, I was able to get call our IRB’s administrative personnel, and Dax Miller guided me through the areas we’d need to make sure we addressed for a successful application. He also agreed to perform a thorough administrative review on Sunday so that I could get initial revisions back to the IRB before leaving. We performed the revisions so the study could be looked over by an IRB reviewer by Wednesday, who had additional questions I could address from afar so that the study could be approved by Thursday afternoon in Vegas (or early Friday morning in Europe). Without that kind of heads-up teamwork from the IRB, we simply couldn’t have done this study, which sought to look at people’s stories of the trauma along with their symptoms within the acute stress window of a month of the shooting.

I also drafted my lab to perform a number of tasks I simply couldn’t do by myself in such a short period of time. Three students provided literature to help me conceptualize the risks and benefits this study might pose. Two worked to provide a list of therapeutic resources for participants. Two others scoured the internet for various beliefs people espoused about the shooting to develop a measure of those. Yet another two programmed the study in Qualtrics and coordinated the transfer of a study-specific ID variable so that we could keep contact information separate from participants’ stories and scores, one of whom also drafted social media advertisements. A final student created a flyer to use for recruiting participants (including a QR code to scan instead of forcing participants to remember our study’s URL). Again, without their help, this study simply couldn’t have been done, as I was already at or exceeding my capacity to stay up late in putting together the IRB application and its supporting documentation (along with programming in personality feedback in Qualtrics as our only incentive for participating).

1. Social media can be your recruitment friends, as can internal email lists.

We spread our flyers far and wide, including a benefit event for Route 91 survivors as well as coffee shops, community bulletin boards, and other such locations across the greater Las Vegas valley. Nevertheless, my RAs used their social media to help promote our study with the IRB-approved text, as did I. Other friends took up the cause and shared posts, spreading the reach of our study into the broader Las Vegas community in ways that would have been impossible otherwise.

A number of UNLV students, faculty, and staff had also attended Route 91 (and all were affected in some way by the shooting), so we distributed our study through internal email lists. At first, I had access to send an announcement through the College of Liberal Arts’ weekly student email list along with the faculty and staff daily email. After word spread of the study (see the point below), I was also allowed to send a message out to all students at the university. Those contacts helped bump our recruitment substantially, getting both people at the festival and from the broader Las Vegas community in the study.

2. The news media extends your reach even more deeply into the community, both for recruitment and dissemination.

Over the years, I’ve been fortunate enough to have multiple members of the news media contact me about stories they’re doing that can help put psychological research into context for the public. At first, I thought of contacting them as having them return the favor to me to help get the word out about my study. However, as I did so, I also recognized that approaching them with content relevant to their beats may have made their jobs mildly easier. They have airtime or column inches they have to fill, and if you provide them meaningful stories, it’ll save them effort in locating material to fill that time. Thus, if you’re prepared with a camera- or phone-ready set of points, both you and your media contacts can have a satisfying professional relationship.

I made sure to have a reasonable and concise story about what the study was about, what motivated it, what all we were looking at, and what the benefits might be to the community. That way, the journalists had plain-English descriptions of the study that could be understood by the average reader or viewer and that could be used more or less as-is, without a lot of editing. In general, I recommend having a good handle on about 3 well-rehearsed bullet points you want to make sure you get across – and that are expressed in calm, clear language you could defend as if in peer review. Those points may not all fit in with the particular story that the journalist is telling, but they’ll get the gist, especially if you have an action item at the end to motivate people. For me, that was my study’s web address.

As time went on, more journalists started contacting me. I made sure I engaged all of them, as I wanted the story of our study out in as many places as possible. Generally speaking, with each new story that came out, I had 5-10 new people participate in the study. If your study is interesting, it may snowball, and you never know which media your potential community participants might consume. The university’s press office helped in getting the word out as well, crafting a press release that was suitable for other outlets to pick up and modify.

The media can also help you re-engage your community during and after research has commenced (which I discuss more in point 5 below). They have a reach beyond your specific community you’ll likely never have, and they can help tell your community’s story to the larger world. Again, it’s imperative to do so in a way that’s not stigmatizing or harmful (see point 4 below), but you can help make prominent people whose voices otherwise wouldn’t be heard or considered.

3. Lead your approach to community groups directly with helpful resources after building credibility.

Another good reason to approach the media beyond increasing participation immediately is that having a public presence for your research will give you more credibility when approaching your community of interest directly. I recognized that after about a week of press, one of the participants mentioned a survivor’s Facebook group, and believed the time was right to make direct inquiries to the community I wanted to help. To that end, I messaged the administrators of survivors’ Facebook groups, asking them to post the free and reduced cost therapy resources we gave to participants after the study. I was also careful not to ask to join groups, as I didn’t want to run the risk of violating that community’s healing places.

Two of the groups’ administrators asked me to join the group directly to post them, and I was honored they asked me to do so. However, in those groups, I confined myself to being someone who posted resources when general calls went out rather than offering advice about coping with trauma. I didn’t want to over-insinuate myself into the group and thereby distort their culture, and I also wanted to maintain a professionally respectful distance to allow the group to function as a community resource. A couple of other groups said they would be willing to consider posting on my behalf but that the groups were closed to all but survivors. I thanked them for their consideration and emphasized I just wanted to spread the word about available resources.

All in all, it seems imperative to approach a community with something to give, rather than just wanting to receive from them. In this unique case, I had something to offer almost immediately. However, if it’s not clear what you might bring to the table, research your community’s needs and talk to some representatives to see what they might need. To the extent your professional skills might help and that the community believes you’ll help them (and not harm them), you’re more likely to get accepted into the community to conduct your research.

4. Engage the community in developing your study.

I learned quickly about forming a partnership with the community in developing my research when one of the members of a Route 91 survivor’s group contacted me about our study. I noted that she was local and had a background in psychology, thus making her an excellent bridge between my research team and the broader community. She zipped through IRB training and provided invaluable feedback about the types of experiences people have had after Route 91 (and helped develop items to measure those) along with providing feedback about a plan to compensate participants (confirming that offering the opportunity to donate compensation to a victim’s fund might alleviate some people’s discomfort). She also gave excellent advice about how to present the study’s results to the community, down to the colors used on the graph to make more obvious the meaning of the curves I drew. Consistent with best practices in community psychology, I intend to have her as an author on the final paper(s) so that the community has a voice in this research’s reports.

Though the ad hoc, geographically dispersed nature of this community makes more centralized planning with it more challenging (especially with a short time frame), I hope our efforts thus far have helped stay true to the community’s perceptions and has avoided stigmatizing them. In communities with leadership structures of their own, engaging those leaders in study planning, participation, and dissemination helps make the research truer to the community’s experience and will likely make people more comfortable with participating. Those people may want to make changes that may initially seem to compromise your goals for a study, but in this framework, the community is a co-creator of the research. If you can’t explain well why certain procedures you really want to use are important in ways the community can accept, then you’ll need to listen to the community to figure out how to work together. Treat education as a two-way street: You have a lot to learn about the community, and you can also show them the ins and outs of research procedures, including why certain procedures (e.g., informed consent) have been developed to protect participants, not harm them.

5. Give research results back to your community in an easily digestible form.

In community psychology, the research must feed back into the community somehow. Because we’re not doing formal interventions in this study, the best I think we can do at this point is share our results in a format that’s accessible to people without a statistical education. In the web page I designed to do just that, I use language as plain as I can to describe our findings without giving tons of numbers in the text. In the numerical graphs I feature, I’ve used animated GIFs to introduce the public to the layers comprising a graph rather than expecting them to comprehend the whole thing at once. I hope that it works.

I also posted my findings on all the groups that had me and engaged reporters who’d asked for follow-up stories once we had our first round of data collected so that their investment in helping me recruit would see fruit. It seemed like many Route 91 survivors reported being perceived as not having anything “real” wrong with them or being misunderstood by their families, friends, coworkers, or romantic partners. Thus, I tried to diagram how people at Route 91 had much higher levels of post-traumatic stress than people in the community, such that about half of them would qualify for a provisional PTSD diagnosis if their symptoms persisted for longer than a month.

6. Advocate for your community.

This is one of the trickier parts of this kind of research for me, as I don’t want to speak as a representative for a broad, decentralized community of which I’m ultimately not a part. Nevertheless, I think data from this research could help advocate for the survivors in their claims, especially those who may not be eligible for other kinds of compensatory funds. I only found out about the town hall meetings of the Las Vegas Victims Fund as they were happening, so I was unable to provide an in-person comment to the board administering the funds. Fortunately, Nicole Raz alerted me to the videos of the town halls, and I was privileged to hear the voices of those who want to be remembered in this process. Right now, I’m drafting a proposal based on this study’s data and considerations of how the disability weights assigned to PTSD by the World Health Organization compare to other conditions that may be granted compensation.

In essence, I’m hoping to make a case that post-traumatic stress is worth compensating, especially given that preliminary results suggest that post-traumatic stress symptomatology as a whole doesn’t seem to have declined in this sample over the course of a month. One of the biggest problems facing this particular victims’ fund is that there are tens of thousands of possible claimants unlike just about any other mass tragedy in modern US history, so the fund administrators have terribly difficult decisions to make. I hope to create as effective an argument as possible for their consideration, and I also hope to make those who are suffering aware of other resources that may help them reduce the burden dealing with the shooting has placed on them.

7. Use statistical decision thresholds that reflect the relative difficulty of sampling from your community.

This is a point that’s likely of interest only to researchers, but it bears heavily on how you conceptualize your study’s design and analytic framework when writing for professional publication. In this case, I knew I was dealing with (hopefully) a once-in-a-lifetime sample. Originally, I was swayed by arguments to define statistical significance more stringently and computed power estimates based on finding statistically significant effects at the .005 level with 80% power using one-tailed tests. My initial thought was that I wanted any results about which I wrote to have as much evidential value as possible.

However, as I took to heart calls I’ve joined to justify one’s threshold for discussing results instead of accepting a blanket threshold, I realized that was too stringent a standard to uphold given the unrepeatable nature of this sample. I recognized I was willing to trade a lower evidential threshold for the ability to discuss more fully the results of our study. To that end, I’m now thinking we should use an alpha level of .05, though corrected for multiple comparisons within a family using the Holm-Bonferroni method within fairly narrowly defined families of results to correct for multiple comparisons.

Specifically, for each conceptual set of measures (i.e., psychopathology, normal-range personality, other personality, well-being, beliefs about the event, and demographics), I’ll adjust the critical p value through dividing .05 by the number of measures in that family. We have two measures of psychopathology (i.e., the PCL-5 and PHQ-9), 11 normal-range personality traits, 3 other personality traits, 5 measures of well-being, and (probably) 2 measures of beliefs. Thus, if I’m interested in how those at the festival vs. those who weren’t at the festival differed in their normal range personality traits, I could conduct a series of 11 independent sample Welch’s t tests (potentially after a MANOVA involving all traits suggested there are some variables whose means differ between groups).

I’d evaluate the significance of the largest difference at a critical value of .05/11, the second largest (if that first one is significant) at a critical value of .05/10, and so on until the comparison is no longer significant. For my psychopathology variables, I’d evaluate (likely) the PTSD difference first at a critical value of .05/2, then (likely) the depression difference at a critical value of .05/1 (or .05).

That way, I’ll keep my overall error rate at .05 within a conceptual family of comparisons without overcorrecting for multiple comparisons. When dealing with correlations of variables across families of comparisons, I’ll use the larger family’s number in the initial critical value’s denominator. This procedure seems to balance having some kind of evidential value (albeit potentially small) with these findings and a reasonable amount of statistical rigor. Using the new suggested threshold, I’d have to divide .005 by the number of comparisons in a family to maintain my stated family-wise error rate, which would make for some incredibly difficult thresholds to meet!

There are other design decisions I made (e.g., imputing missing values of many study measures using mice rather than only using complete cases in analyses) that also furthered my desire to keep as many voices represented as possible and make our findings as plausible as we can. In our initial study design, we also did not pay participants so that a) there wouldn’t be undue incitement to participate, b) we could accurately estimate the costs of the study when having no idea how many people might actually sign up, and c) we wouldn’t have  to worry as much about the validity of responses that may have been driven more by the desire to obtain money than to provide accurate information. In each case, I intend on reporting these justifications and registering them before conducting data analyses to provide as much transparency as possible, even in a situation in which genuine preregistration wasn’t possible.

A template for reviewing papers

Peer review’s technology (but not volume) has changed over the decades.

The current culture of science thrives on peer review – that is, the willingness of your colleagues to read through your work, critique it, and thereby improve it. Science magazine recently collected a slew of tips on how to review papers, which give people getting started in the process of peer reviewing some lovely overarching strategies about how to prepare a review.

But how can you keep in your head all those pieces of good advice and apply them to the specifics of a paper in front of you? I’d argue that like many human endeavors, it’s impossible. There are too many complexities in each paper to collate loads of disparate recommendations and keep them straight in your head. To that end, I’ve created a template for reviewing papers our lab either puts out or critiques. Not incidentally, I highly recommend using your lab group as a first round of review before sending papers out for review, as even the greenest RA can parse the paper for problems in logic and comprehensibility (inculding teh dreded “tpyoese”).

To help my lab out in doing this, I’ve prepared the following template. It organizes questions I typically have about various pieces of manuscripts, and I’ve found that undergrads given nice reviews with its help. In particular, I find it helps them focus on things beyond the analytic details to which they may have not been exposed so that they don’t feel so overwhelmed. It may also be helpful for more experienced reviewers to judge what they could contribute as a reviewer in an unfamiliar topic or analytical approach. I encourage my lab members to copy and paste it verbatim when they draft their feedback, so please do the same if it’s useful to you!


Summarize in a sentence or two the strengths of the manuscript. Summarize in a sentence or two the chief weaknesses of the manuscript that must be addressed.

 

INTRODUCTION

How coherent, crisp, and focused is the literature summary? Are all the studies discussed relevant to the topic at hand?

 

Are there important pieces of literature that are omitted? If so, note what they are, and provide full citations at the end of the review.

 

Does the literature summary flow directly into the questions posed by this study? Are their hypotheses clearly laid out?

 

METHOD

Are the participants’ ages, sexes, and ethnic/racial distribution reasonably characterized? Is it clear from what population the sample is drawn? Are any criteria used to exclude participants from overall analyses clearly specified?

 

Are the measures described in brief but with enough data so that the naive reader knows what to expect? Are there internal consistency or other reliability statistics presented for inventories and other measures that can have these presented?

 

For any experimental task, is it described in sufficient detail to allow a naive reader to replicate the task and understand how it works? Are all critical experimental measures and dependent variables clearly explained?

 

Was the procedure sufficiently detailed to allow you to know what the experience was like from the perspective of the participant? Could you rerun the study with this description and that provided above of the measures and tasks?

 

Is each step that the authors took to get from raw data to the data that were analyzed laid out plainly? Are particular equipment settings, scoring algorithms, or the like described in sufficient detail that you could take the authors’ data and get out exactly what they analyzed?

 

Do the authors specify the analyses they used to test all of their hypotheses? Are those analytic tools proper to use given their research design and data at hand? Are any post hoc analyses properly described as such? Is the criterion used for statistical significance given? What measure of effect size do the authors report? Does there appear to be adequate power to test the effects of interest? Do the authors report what software they used to analyze their data?

 

RESULTS

How easily can you read the Results section? How does it flow from analysis to analysis, and from section to section? Do the authors use appropriate references to tables and/or figures to clarify the patterns they discuss?

 

How correct are the statistics? Are they correctly annotated in tables and/or figures? Do the degrees of freedom match up to what they should based on what’s reported in the Method section?

 

Do the authors provide reasonable numbers to substantiate the verbal descriptions they use in the text?

 

If differences among groups or correlations are given, are there actual statistical tests performed that assess these differences, or do the authors simply rely on results falling on either side of a line of statistical significance?

 

If models are being compared, are the fit indexes both varied in their domains they assess (e.g., error of approximation, percentage of variance explained relative to a null model, containing more information given the number of parameters) and interpreted appropriately?

 

DISCUSSION

Are all the findings reported on in the Results mentioned in the Discussion?

 

Does the discussion contextualize the findings of this study back into the broader literature in a way that flows, is sensible, and appropriately characterizes the findings and the state of the literature? If any relevant citations are missing, again give the full citation at the end of the review

 

How reasonable is the authors’ scope in the Discussion? Do they exceed the boundaries of their data substantially at any point?

 

What limitations of the study do the authors acknowledge? Are there major ones they omitted?

 

Are compelling future directions given for future research? Are you left with a sense of the broader impact of these findings beyond the narrow scope of this study?

 

REFERENCES FOR THIS REVIEW (only if you cited articles beyond what the authors already included in the manuscript)

50 years of Star Trek: best episode and reflections on autism

The 50th anniversary of the TV show Star Trek‘s first broadcast is today. It was a formative franchise for me growing up, informing many of my first ideas about space exploration, heroism, and a collaborative society. Debates redound about the best episode of the series. However, I agree with Business Insider’s choice of the episode Balance of Terror. It’s essentially a space version of submarine warfare, for which I’ve been a sucker ever since the game Red Storm Rising for the Commodore 64. This episode has everything: Lore building of the political and technological history of the Federation, the introduction of a new opponent, a glimpse of life on the lower decks, and character development galore for multiple cast members – including a guest star.

One of the moments that always stuck with me was one in the Captain’s quarters as the Enterprise and its Romulan counterpart wait each other out in silence. Dr. McCoy comes to speak with Captain Kirk, who expresses a rare moment of self-doubt regarding his decisions during tactical combat. The doctor’s compassionate nature comes through as he reminds the captain how across 3 million Earth-like planets that might exist, recapitulated across 3 million million galaxies, there’s only one of each of us – and not to destroy the one named Kirk. The lesson of that moment resonates 50 years later and is one I like to revisit when I feel myself beset by doubts about myself or my career.

Another moment I appreciate is the imperfection allowed in Spock’s character without being under the influence of spores, temporal vortices, or other sci-fi contrivances. Already, he has been accused of being a Romulan spy by a bigoted member of the crew who lost multiple family members in a war with the Romulans decades before visual communication was possible. Now, Spock breaks the silence under which the Enterprise was operating with a clumsy grip on the console he is repairing. Is this the action of a spy? Or just an errant mistake that anyone could make, especially when under heightened scrutiny?

Indeed, this error might be expected when Mr. Spock operates under stereotype threat. Just hours earlier, he was revealed to share striking physiological similarities with the Romulan enemies, who Spock described as possible warrior offshoots of the Vulcan race before Vulcans embraced logic. This revelation caused Lt. Stiles, who had branches of his family wiped out in the prior war with the Romulans, to view Spock with distrust and outright bigotry that was so blatant that the captain called him on it on the bridge. Still, Stiles’s prejudice against Spock is keenly displayed throughout the episode, making it more likely that Spock would conform to the sabotaging behavior expected of him by his bridgemate.

On their own ship, the sneaky and cunning Romulans were not depicted as mere stereotypes of those adjectives but instead as a richly developed martial culture. Their commander and his centurion have a deep bond that extends over a hundred campaigns; the regard these two have for each other is highlighted in the actors’ subtle inflections and camaraderie. The internal politics of the Romulan empire are detailed through select lines of dialog surrounding the character of Decius and the pique that character elicits in his commander. In the end, the Romulan commander is shown to be sensitive to the demands of his culture and his subordinates in the culminating action of the episode, though the conflict between these and his own plans is palpable.

The contrast between Romulans and Spock highlights how alien Vulcan logic seems to everyone else. Spock is a character who represents the outsider, the one struggling for acceptance among an emotional human crew even as he struggles to maintain his culture’s logical discipline. Authors with autism have even remarked how Spock helped them understand how they perceive the world differently from neurotypicals in a highly logical fashion. However, given the emotional militarism of the Romulans, I believe that Vulcan logic is a strongly culturally conditioned behavior rather than a reflection of fundamental differences in baseline neurobiological processing.

There are neurobiological differences in sustained attention to different kinds of objects in autism compared to neurotypical controls. Work I did in collaboration with Gabriel Dichter has demonstrated that individuals with autism spectrum disorders have heightened attention to objects of high interest to these individuals (e.g., trains, computers) compared to faces, whereas neurotypicals show the opposite pattern of attention (access here). Based on decades of cultural influence, Mr. Spock might be expected to show equal attention to objects and faces, but Dr. McCoy, Captain Kirk, and the Romulans all would be expected to be exquisitely sensitive to faces, as they convey a lot of information about the social world.

Preregistration as a guide to reproducibility and scientific competence

UPDATE 20190820: This post led to this paper in the special issue of the Journal of Abnormal Psychology about increasing replicability, transparency, and openness in clinical psychological research. In it, we describe a two-dimensional continuum of registration efforts and now describe preregistrations as those that occur before data are collected, coregistrations as those that occur after data collection starts but before data analysis begins, and postregistrations as those that occur after data analysis begins. The preprint is here.

This is a long post written for both professionals and curious lay people; the links below allow you to jump among the post’s sections. The links in all CAPS represent the portions of this post I view as its unique intellectual contributions.

Navigation: Prelude | Reproducibility | Conflicts of Interest | SCIENTIFIC COMPETENCE | Model | TEMPLATE

The Preregistration Knights who say Ni require a shrubbery instead of a garden of forking paths.Preregistration: prelude, problems addressed, and concerns

Psychology is beset with ways to find things that are untrue. Many famous and influential findings in the field are not standing up to closer scrutiny with tightly controlled designs and methods for analyzing data. For instance, a registered replication report in which my lab was involved found that holding a pen between your lips in a smiling pose does not, in fact, make cartoons funnier. Indeed, less than half of 100 studies published in top-tier psychology journals replicated.

But it’s not only psychology that has this problem. Only 6 out of 53 “landmark” cancer therapy studies replicated. An attempt to induce other labs to reproduce findings in cancer research has scaled back substantially in the face of technical and logistical difficulties. Nearly two thirds of relatively recent economics papers failed to replicate, though this improved to about half when the researchers had help from the original teams. In fact, some argue that most published research findings are false due to the myriad ways researchers can find statistically significant results from their data.

One proposal for solving these problems is preregistration. Preregistration refers to making available – in an accessible repository – a detailed plan about how researchers will conduct a study and analyze its results. Any report that is subsequently written on the study would ideally refer to this plan and hew closely to it in its initial methods and results descriptions. Preregistration can help mitigate a host of questionable research practices that take advantage of researcher degrees of freedom, or the hidden steps behind the scenes that researchers can take to influence their results. This garden of forking paths can transmute data from almost any study into something statistically significant that could be written up somewhere; preregistration prunes this garden into a single, well-defined shrub for any set of studies.

Yet prominent figures doubt the benefits of preregistration. Some even deny there’s a replication crisis that would require these kinds of corrections. And to be sure, there are other steps to take to solve the reproducibility crisis. However, I argue that preregistration has three virtues, which I describe below. In addition to enhancing reproducibility of scientific findings, it provides a method for managing conflicts of interest in a transparent way above and beyond required institutional disclosures. Furthermore, I also believe preregistration permits a lab to demonstrate its increasing competence and a field’s cumulative knowledge. 

Enhancing reproducibility

Chief among the proposed benefits of preregistration is the ability of science to know what actually happened in a study. Preregistration is one part of a larger open science movement that aims to make science more transparent to everyone – fellow researchers and the public alike. Preregistration is probably more useful for people on the inside, though, as it helps people knowledgeable in the field assess how a study was done and what the boundaries were on the initial design and analysis. Nevertheless, letting the general public see how science is conducted would hopefully foster trust in the research enterprise, even if it may be challenging to understand the particulars without formal training.

Here are some of the problems preregistration promises to solve:

  • Hypothesizing After the Results are Known (HARKing): You can’t say you thought all along something you found in your data if it’s not described in your preregistration.
  • Altering sample sizes to stop data collection prematurely (if you find the effect you want) or prolong it (to increase the power, or the likelihood you have to detect effects): You said how many observations you were going to make, so you have a preregistered point to stop. Ideally, this stopping point would be determined from a power analysis using reasonable assumptions from the literature or basic study design about the expected effect sizes (e.g., differences between conditions or strengths of relationships between variables).
  • Eliminating participants or data points that don’t yield the effect you want: There are many reasons to drop participants after you’ve seen the data, but preregistering reasons for eliminating any participants or data from your analyses stops you from doing so to “jazz up” your results.
  • Dropping variables that were analyzed: If you collect lots of measures, you’ve got lots of ways to avoid putting your hypotheses to rigorous tests; preregistration forces you to specify which variables are focal tests of your hypothesis beforehand. It also ensures you think about making appropriate corrections for making lots of tests. If you run 20 different analyses, each with a 5% chance (or .05 probability) of yielding a result you want (a typical setup in psychology), then you’re likely to find 1 significant result by chance alone!
  • Dropping conditions or groups that “didn’t work”: Though it may be convenient to collect some conditions “just to see what happens”, preregistering your conditions and groups makes you consider them when you write them up.
  • Invoking hidden moderators to explain group differences: Preregistering all the things you believe might change your results ensures you won’t pull an analytic rabbit out of your hat.

Many of these solutions can be summed up in 21 words. Ultimately, rather than having lots of hidden “lab secrets” about how to get an effect to work or a multitude of unknown ingredients working their way into the fruit of the garden of forking paths, research will be cleanly defined and obvious, with bright and shiny fruit from its shrubbery.

Managing conflicts of interest

As I was renewing my CITI training (the stuff we researchers have to refresh every 4 years to ensure we keep up to date on performing research ethically and responsibly), I also realized that preregistration of analytic plans creates a conflict of interest management plan. Preregistered methods and data analytic plans ensure researchers to describe exactly what they’re going to do in a study. Those plans can be reviewed by experts to detect ways in which their own interests might be put ahead of the integrity of the data or analyses in the study, including officials at an individual’s university, at a funding agency, or in a journal’s editorial processes. Conscientious researchers can also scrutinize their own plans to see how their own best interests might have crept ahead of the most scientifically justifiable procedures to follow in a study.

These considerations led the clinical trials field to adopt a set of guidelines to prevent conflicts of interest from altering the scientific record. Far more than institutional disclosure forms, these guidelines force scientists to show their work and stick to the script of their initial study design. Since adopting these guidelines, the number of clinical trials showing null outcomes has increased dramatically. This pattern suggests that conflicts of interest may have guided some of the positive findings for various therapies rather than scientific evidence analyzed according to best practices. The preregistered shrub may not bear as much fruit as the garden of forking paths, but the fruit preregistered science bears is less likely to be poisonous to the consumer of the research literature.

Demonstrating scientific competence and cumulative knowledge

One underappreciated benefit of preregistration is the way it allows researchers to demonstrate their increasing competence in an area of study. When we start out exploring something totally new, we have ideas about basic things to consider in designing, implementing, and analyzing our studies. However, we often don’t think of all the probable ways that data might not comport with our assumptions, the procedural shifts that might be needed to make things work better, or the optimal analytic paths to follow.

When you run a first study, loads of these issues creep up. For example, I didn’t realize how hard it was going to be to recruit depressed patients from our clinic for my grant work on depression (especially after changing institutions right as the grant started), so I had to switch recruitment strategies. Right as we were starting to recruit participants, there was also a conference talk in 2013 that totally changed the way I wanted to analyze our data, as the mood reactivity item was better for what we wanted to look at than an entire set of diagnostic subtypes. In dealing with those challenges, you learn a lot for the second time you run a similar study. Now I know how to specify my recruitment population, and I can point to that talk as a reason for doing things a different way than my grant described. Over time, I’ll know more and more about this topic and the experimental methods in it, plugging additional things into my preregistrations to reflect my increased mastery of the domain.

Ideally, the transition from less detailed exploratory analyses to more detailed confirmatory work is a marker of a lab’s competence with a specific set of techniques. One could even judge a lab’s technical proficiency by the number of considerations advanced in their preregistrations. Surveying preregistered projects for various studies might let you know who the really skilled scientists in an area are. That information could be useful to graduate students wanting to know with whom they’d like to work – or potential collaborators seeking out expertise in a particular topic. Ideally, a set of techniques would be well-established enough within a lab to develop a standard operating procedure (SOP) for analyzing data, just as many labs have SOPs for collecting data.

In this way, the fruits of research become clearer and more readily picked. Rather than taking fruitless dead ends down the garden of forking paths with hidden practices and ad hoc revisions to study designs, the well-manicured shrubbery of preregistered research and SOPs gives everyone a way to evaluate the soundness of a lab’s methods without ever having to visit. Indeed, some journals take preregistration so seriously now that they are willing to provisionally pre-accept papers with sound, rigorous, and preregistered methodology. Tenure committees can likewise peek behind the hood of the studies you’ve conducted, which could alleviate a bit of the publish-or-perish culture in academia. A university’s standards could even reward an investigator’s rigor of research beyond a publication history (which may be more like a lottery than a meritocracy).

A model for confirmatory and exploratory reporting and review

In my ideal world, results sections would be divided into confirmatory and exploratory sections. Literally. Whether written as RESULTS: CONFIRMATORY and RESULTS: EXPLORATORYPREREGISTERED RESULTS and EXPLORATORY RESULTS, or some other set of headings, it should be glaringly obvious to the reader which is which. The confirmatory section contains all the stuff in the preregistered plan; the exploratory section contains all the stuff that came after. Right now, I would prefer that details about the exploratory analyses be kept in that exploratory results section to make it clear it came after the fact and to create a narrative of the process of discovery. However, similar Data Analysis: Confirmatory and Data Analysis: Exploratory or Preregistered Data Analysis and Exploratory Data Analysis sections might make it easier to separate the data analytics from the meat of the results.

It’s also important to recognize that exploratory analyses shouldn’t be pooh-poohed. Curious scientists who didn’t find what they expected could systematically explore a number of questions in their data subsequent to its collection and preliminary analysis. However, it is critical that all deviations from the preregistration be reported in full detail and with sufficient justification to convince the skeptical reader that the extra analyses were reasonable to perform. Much of the problem with our existing literature is that we haven’t reported these details and justifications; in my view, we just need to make them explicit to bolster confidence in exploratory findings.

Reviewers should ask about those justifications if they’re not present, but exploratory analyses should be held to essentially the same standards as we hold current results sections. After all, without preregistration, we’re all basically doing exploratory analyses! As time passes, confirmatory analyses will likely hold more weight with reviewers. However, for the next 5-10 years, we should all recall that we came from an exploratory framework, and to an exploratory framework we may return when justified. When considering an article, reviewers should also look carefully at the confirmatory plan (which should be provided as an appendix to a reviewed article if a link that would not compromise reviewer anonymity cannot be provided). If the researchers deviated from their preregistered plan, call them on it and make them run their preregistered analyses! In any case, preregistration’s goals can fail if reviewers don’t exercise due diligence in following up the correspondence between the preregistration and the final report.

The broad strokes of a paper I’m working on right now demonstrates the value of preregistration in correcting mistakes and the ways exploratory results might be described. I was showing a graduate student a dataset I’d collected years before, and there were three primary dependent variables I planned on analyzing. To my chagrin, when the student looked through the data, that student pointed out one of those three variables had never been computed! Had I preregistered my data analytic plan, I would have remembered to compute that variable before conducting all of my analyses. When that variable turned out to be the only one with interesting effects, we also thought of ways to drill down and better understand the conditions under which the effect we found held true. We found these breakdowns were justifiable in the literature but were not part of our original analytic plan. Preregistration would have given us a cleaner way to separate these exploratory analyses from the original confirmatory analyses.

In any future work with the experimental paradigm, we’ll preregister both our original and follow-up analyses so there’s no confusion. Such preregistration also acts as a signal of our growing competence with this paradigm. We’ll be able to give sample sizes based on power analyses from the original work, prespecify criteria for excluding data and methods of dealing with missing values, and more precisely articulate how we will conduct our analyses.

My template

Many people talk about the difficulties of preregistering studies, so I advance a template I’ve been working on. In it, I pose a bunch of questions in a format structured like a journal article to guide researchers through questions I’d like to have answered as I start a study. It’s a work in progress, and I hope to add to it as my own thoughts on what all could be preregistered grows. I also hope we can publish some data analytic SOPs along with our psychophysiological SOPs that we use in the lab (a shortened version of which we have available for participants to view). I hope it’s useful in considering your own work and the way you’d preregister. If this seems too daunting, a simplified version of preregistration that hosts the registration for you can get you started!

css.php