Pitfalls of data snooping

STATS 191

2024-04-01

Inference after selection

  • Most model selection procedures will choose features that have large \(T\)-statistics when testing whether they are 0 or not…

  • Even when nothing is happening some features will have large \(T\)-statistics!

Consequences

  • Using \(p\)-values from summary() of a selected model is misleading.

  • Using confidence intervals from confint() of a selected model is misleading.

Null example

  • Let’s make some data where there is no relation between Y and X.

Behavior of \(p\)-values

[1] 0.001327097

Best single variable model

  • Repeat 100 times, taking
  • What proportion below 5%?
[1] 0.83
  • 80% of the time we’ll falsely declare a true relationship between Y and X!

  • 80% of our confidence intervals won’t cover 0 (truth)…

Inference after AIC selection

  • Let’s look at a selection procedure we have used…

  • We’ll build up 100 null data sets and store them for a few analyses

  • In practice, there will likely be some signals – here there are none…

Behavior of \(p\)-values from AIC selected model

Distribution of \(p\)-values

  • Distribution function here should be diagonal…

  • 50% of our 95% confidence intervals will not cover 0 (truth)

[1] 0.5005834

Data splitting

  • Randomly select some % to select, remaining to construct CIs, compute \(p\)-values

Data splitting \(p\)-values

[1] 0.04