Inference & Robustness

Wild bootstrap p-values and multiple testing corrections (Bonferroni, BH, Holm) for robust event study inference.

Standard event study inference relies on asymptotic theory — assuming residuals are normal and independent. When these assumptions fail, two tools strengthen your conclusions: wild bootstrap for more accurate p-values, and multiple testing corrections to control false discovery when testing many hypotheses.

Wild Bootstrap

Why Bootstrap?

Asymptotic p-values (from t-tests) can be unreliable when:

  • The estimation window is short (< 60 days)
  • Residuals are non-normal or heavy-tailed
  • The sample size (number of firms) is small
  • There is cross-sectional dependence

The wild bootstrap generates the null distribution by resampling residuals with random sign flips, preserving the heteroskedasticity structure of the original data.

Algorithm

  1. Compute abnormal returns \(AR_{i,t}\) and the test statistic \(T\) (e.g., CAAR t-statistic)
  2. For \(b = 1, \ldots, B\) bootstrap replications:
    1. Draw Rademacher weights \(\eta_i \in \{-1, +1\}\) with equal probability (or Mammen weights)
    2. Construct bootstrap residuals: \(AR^*_{i,t} = \eta_i \cdot AR_{i,t}\)
    3. Recompute the test statistic \(T^*_b\) from the bootstrap residuals
  3. The bootstrap p-value is:

\[ p_{\text{boot}} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}[|T^*_b| \geq |T|] \]

Running in R

# Wild bootstrap for AAR/CAAR significance
boot_result <- bootstrap_test(
  task,
  n_boot = 999,
  weight_type = "rademacher"
)
Adding missing grouping variables: `group`
Adding missing grouping variables: `group`
boot_result
# A tibble: 11 × 5
   relative_index observed_aar observed_caar boot_p_aar boot_p_caar
            <int>        <dbl>         <dbl>      <dbl>       <dbl>
 1             -5     0.0103         0.0103       0.059       0.059
 2             -4     0.000819       0.0111       1           0.269
 3             -3    -0.00645        0.00464      0.374       0.646
 4             -2    -0.00296        0.00168      0.921       0.881
 5             -1    -0.00340       -0.00172      0.709       0.942
 6              0    -0.00291       -0.00462      0.643       0.863
 7              1     0.00625        0.00162      0.393       0.942
 8              2    -0.00732       -0.00569      0.576       0.81 
 9              3    -0.0108        -0.0165       0.181       0.548
10              4    -0.00809       -0.0246       0.424       0.416
11              5     0.0108        -0.0138       0.173       0.799
Parameter Options Default
n_boot Number of bootstrap replications 999
weight_type Weight distribution: "rademacher" or "mammen" "rademacher"
statistic Which statistic to bootstrap: "aar", "caar", or "both" "both"

Rademacher vs. Mammen. Rademacher weights (\(\pm 1\)) are simpler and work well in most cases. Mammen weights (\(\pm\sqrt{5}/2\)) better preserve skewness in the residuals. Use Mammen when residuals are notably skewed.

Multiple Testing Corrections

The Problem

When you test multiple event windows, subgroups, or test statistics, the probability of at least one false positive grows rapidly. With \(m\) independent tests at \(\alpha = 0.05\):

\[ P(\text{at least one false positive}) = 1 - (1 - \alpha)^m \]

For \(m = 10\) tests, the family-wise error rate is 40%. Multiple testing corrections adjust p-values to control this inflation.

Available Methods

# Adjust p-values from multi-event test statistics for multiple testing
adjusted <- adjust_p_values(task, method = "BH")
adjusted
# A tibble: 11 × 15
   relative_index       aar n_events n_valid_events n_pos n_neg  aar_t     caar
            <int>     <dbl>    <int>          <int> <int> <int>  <dbl>    <dbl>
 1             -5  0.0103          5              5     5     0  2.48   0.0103 
 2             -4  0.000819        5              5     1     4  0.208  0.0111 
 3             -3 -0.00645         5              5     2     3 -1.05   0.00464
 4             -2 -0.00296         5              5     3     2 -0.293  0.00168
 5             -1 -0.00340         5              5     3     2 -0.413 -0.00172
 6              0 -0.00291         5              5     2     3 -0.772 -0.00462
 7              1  0.00625         5              5     3     2  1.17   0.00162
 8              2 -0.00732         5              5     1     4 -0.815 -0.00569
 9              3 -0.0108          5              5     1     4 -1.62  -0.0165 
10              4 -0.00809         5              5     2     3 -0.865 -0.0246 
11              5  0.0108          5              5     4     1  2.16  -0.0138 
# ℹ 7 more variables: caar_t <dbl>, car_window <chr>, p_raw_aar <dbl>,
#   p_adj_aar <dbl>, p_raw_caar <dbl>, p_adj_caar <dbl>, group <chr>
Method R Code Controls Best For
Bonferroni method = "bonferroni" Family-wise error rate (FWER) Conservative, few tests
Holm method = "holm" FWER (step-down) Default for FWER control
Hochberg method = "hochberg" FWER (step-up) Independent tests
Benjamini-Hochberg method = "BH" False discovery rate (FDR) Many tests, exploratory
Benjamini-Yekutieli method = "BY" FDR under dependence Correlated test statistics

FWER vs. FDR

FWER (Bonferroni, Holm) FDR (BH, BY)
Controls Probability of any false positive Expected proportion of false positives
Strictness Very conservative Less conservative
Power Lower — many true effects missed Higher — better detection
Use when Few tests, all must be valid Many tests, some false positives acceptable

Recommendation. Use Holm (step-down Bonferroni) as the default for FWER control — it is uniformly more powerful than Bonferroni. Use Benjamini-Hochberg when testing many hypotheses (e.g., AAR significance at each event-time day) and you can tolerate a controlled rate of false discoveries.

Practical Example

Adjust p-values from different test statistics or across subgroups:

# Adjust using Holm's step-down method (more powerful than Bonferroni)
adjusted_holm <- adjust_p_values(task, method = "holm")

# Adjust using Benjamini-Yekutieli (for dependent tests)
adjusted_by <- adjust_p_values(task, method = "BY")

When to Use Each Tool

Scenario Recommended Approach
Short estimation window (< 60 days) Wild bootstrap
Non-normal residuals Wild bootstrap
Multiple event windows tested BH or Holm correction
Subgroup comparisons BH correction
Single pre-specified window No correction needed
Maximum robustness Wild bootstrap + BH correction

Literature

  • Cameron, A.C., Gelbach, J.B. & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.
  • Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

Next Steps