September 4, 2015

Motivating Problem

Social scientific datasets accocomodate many ways of analyzing the data. Everything can be p-hacked!

  • Multiple treatment arms and numerous covariates collected for experiments
  • Dozens (typical) or hundreds (ANES) of survey questions
  • Various measures or indices as independent variables

Conducting many tests

  • Reject when p < .05 leads to no control of an overall error rate.
  • Family-wise error rate is too conservative, doesn't adapt when making more than one rejection
  • False discovery rate is good compromise, but doesn't reflect researcher's utility

Controlling the Weighted False Discovery Rate is the best choice

Weighted False Discovery Rate (Benjamini and Hochberg, 1997)

\[ \begin{align} Q(w) &= \begin{cases} \frac{ \sum_{i=1}^k w_i V_i}{\sum_{i=1}^k w_i},& R > 0 \\ 0,& R= 0 \end{cases} \end{align} \]

Order the p-values from smallest to largest, and reject the first \(k\) hypotheses, choosing k by:

\[ \begin{align} \hat{k}= \max \{k : p_k \leq \tfrac{\sum_{i=1}^k w_i}{m} \alpha \} \label{eqn:wbh} \end{align} \]

Graphical Intiution

Use-Case 1: Preregistered Experiments

In a pre-registered setting, researchers are free to specify any set weights and hypotheses

  • Weights must be pre-specified, otherwise you're probably HARKing (Kerr, 1998)
  • Consider the minimal p-value thresolds you're willing to subject your hypotheses to.
  • For 5 hypotheses, would I accept a guaranteed p < .04 rejection, for guarantees at p<.0025 for the rest?

Example Weights

Consider an experiment that collected race and gender as covariates. Hypothetical weights:

Hypotheses % of Total Individual Weights
Top-Line 1: ATE Treatment vs. Control 1/2 5.5
1st Order 4: 1 Gender + 3 Race 1/3 .92
2nd Order 7: Gender x Race 1/6 .26

Pre-analysis plan suggestions

  • Discard underpowered hypotheses
  • Put a lot of weight on the hypotheses you care about
  • Think in terms of worst-case scenarios. If treatment is less potent than you guessed, what paper can you write?

Use-Case 2: No preregisteration

Default weights must be used, here are some rules:

  • Every plausible research hypothesis should get positive weight
  • Hypotheses at the same level of interaction get the same weight
  • More specific groups of hypotheses get less weight
Type Hypotheses % of Total
Top-Line ATE 1/2
1st Order Heterogeneous Treatment Effect 1/3
2+'th Order Interaction Heterogeneous Treatment Effect 1/6

Example: Iyengar, Hahn, Krosnick and Walker (2008)

Use-Case 3: Forensic Analysis

If you think a paper was p-hacked, are their results still believable?

  • Make a guess about how many hypotheses were tested but not reported
  • Assume p > .05 for those
  • Assume reported hypotheses were the lowest p-values
  • Do weighted BH with default weights and unweighted BH

Do not fall into a pit of despair