September 4, 2015

## Motivating Problem

Social scientific datasets accocomodate many ways of analyzing the data. Everything can be p-hacked!

• Multiple treatment arms and numerous covariates collected for experiments
• Dozens (typical) or hundreds (ANES) of survey questions
• Various measures or indices as independent variables

## Conducting many tests

• Reject when p < .05 leads to no control of an overall error rate.
• Family-wise error rate is too conservative, doesn't adapt when making more than one rejection
• False discovery rate is good compromise, but doesn't reflect researcher's utility

Controlling the Weighted False Discovery Rate is the best choice

## Weighted False Discovery Rate (Benjamini and Hochberg, 1997)

\begin{align} Q(w) &= \begin{cases} \frac{ \sum_{i=1}^k w_i V_i}{\sum_{i=1}^k w_i},& R > 0 \\ 0,& R= 0 \end{cases} \end{align}

Order the p-values from smallest to largest, and reject the first $$k$$ hypotheses, choosing k by:

\begin{align} \hat{k}= \max \{k : p_k \leq \tfrac{\sum_{i=1}^k w_i}{m} \alpha \} \label{eqn:wbh} \end{align}

## Use-Case 1: Preregistered Experiments

In a pre-registered setting, researchers are free to specify any set weights and hypotheses

• Weights must be pre-specified, otherwise you're probably HARKing (Kerr, 1998)
• Consider the minimal p-value thresolds you're willing to subject your hypotheses to.
• For 5 hypotheses, would I accept a guaranteed p < .04 rejection, for guarantees at p<.0025 for the rest?

## Example Weights

Consider an experiment that collected race and gender as covariates. Hypothetical weights:

Hypotheses % of Total Individual Weights
Top-Line 1: ATE Treatment vs. Control 1/2 5.5
1st Order 4: 1 Gender + 3 Race 1/3 .92
2nd Order 7: Gender x Race 1/6 .26

## Pre-analysis plan suggestions

• Put a lot of weight on the hypotheses you care about
• Think in terms of worst-case scenarios. If treatment is less potent than you guessed, what paper can you write?

## Use-Case 2: No preregisteration

Default weights must be used, here are some rules:

• Every plausible research hypothesis should get positive weight
• Hypotheses at the same level of interaction get the same weight
• More specific groups of hypotheses get less weight
Type Hypotheses % of Total
Top-Line ATE 1/2
1st Order Heterogeneous Treatment Effect 1/3
2+'th Order Interaction Heterogeneous Treatment Effect 1/6

## Use-Case 3: Forensic Analysis

If you think a paper was p-hacked, are their results still believable?

• Make a guess about how many hypotheses were tested but not reported
• Assume p > .05 for those
• Assume reported hypotheses were the lowest p-values
• Do weighted BH with default weights and unweighted BH

Do not fall into a pit of despair