Pitfalls of data snooping

STATS 191

2024-04-01

Inference after selection

Most model selection procedures will choose features that have large \(T\)-statistics when testing whether they are 0 or not…
Even when nothing is happening some features will have large \(T\)-statistics!

Consequences

Using \(p\)-values from summary() of a selected model is misleading.
Using confidence intervals from confint() of a selected model is misleading.

Null example

Let’s make some data where there is no relation between Y and X.

Behavior of \(p\)-values

[1] 0.001327097

Best single variable model

Repeat 100 times, taking

What proportion below 5%?

[1] 0.83

80% of the time we’ll falsely declare a true relationship between Y and X!
80% of our confidence intervals won’t cover 0 (truth)…

Inference after AIC selection

Let’s look at a selection procedure we have used…
We’ll build up 100 null data sets and store them for a few analyses
In practice, there will likely be some signals – here there are none…

Behavior of \(p\)-values from AIC selected model

Distribution of \(p\)-values

Distribution function here should be diagonal…
50% of our 95% confidence intervals will not cover 0 (truth)

[1] 0.5005834

Data splitting

Randomly select some % to select, remaining to construct CIs, compute \(p\)-values

Data splitting \(p\)-values

[1] 0.04