Standard event study inference relies on asymptotic theory — assuming residuals are normal and independent. When these assumptions fail, two tools strengthen your conclusions: wild bootstrap for more accurate p-values, and multiple testing corrections to control false discovery when testing many hypotheses.

Wild Bootstrap

Why Bootstrap?

Asymptotic p-values (from t-tests) can be unreliable when:

The estimation window is short (< 60 days)
Residuals are non-normal or heavy-tailed
The sample size (number of firms) is small
There is cross-sectional dependence

The wild bootstrap generates the null distribution by resampling residuals with random sign flips, preserving the heteroskedasticity structure of the original data.

Algorithm

Compute abnormal returns \(AR_{i,t}\) and the test statistic \(T\) (e.g., CAAR t-statistic)
For \(b = 1, \ldots, B\) bootstrap replications:
1. Draw Rademacher weights \(\eta_i \in \{-1, +1\}\) with equal probability (or Mammen weights)
2. Construct bootstrap residuals: \(AR^*_{i,t} = \eta_i \cdot AR_{i,t}\)
3. Recompute the test statistic \(T^*_b\) from the bootstrap residuals
The bootstrap p-value is:

\[ p_{\text{boot}} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}[|T^*_b| \geq |T|] \]

Running in R

# Wild bootstrap for AAR/CAAR significance
boot_result <- bootstrap_test(
  task,
  n_boot = 999,
  weight_type = "rademacher"
)

Adding missing grouping variables: `group`
Adding missing grouping variables: `group`

boot_result

# A tibble: 11 × 5
   relative_index observed_aar observed_caar boot_p_aar boot_p_caar
            <int>        <dbl>         <dbl>      <dbl>       <dbl>
 1             -5     0.0103         0.0103       0.059       0.059
 2             -4     0.000819       0.0111       1           0.269
 3             -3    -0.00645        0.00464      0.374       0.646
 4             -2    -0.00296        0.00168      0.921       0.881
 5             -1    -0.00340       -0.00172      0.709       0.942
 6              0    -0.00291       -0.00462      0.643       0.863
 7              1     0.00625        0.00162      0.393       0.942
 8              2    -0.00732       -0.00569      0.576       0.81 
 9              3    -0.0108        -0.0165       0.181       0.548
10              4    -0.00809       -0.0246       0.424       0.416
11              5     0.0108        -0.0138       0.173       0.799

Parameter	Options	Default
`n_boot`	Number of bootstrap replications	999
`weight_type`	Weight distribution: `"rademacher"` or `"mammen"`	`"rademacher"`
`statistic`	Which statistic to bootstrap: `"aar"`, `"caar"`, or `"both"`	`"both"`

Rademacher vs. Mammen. Rademacher weights (\(\pm 1\)) are simpler and work well in most cases. Mammen weights (\(\pm\sqrt{5}/2\)) better preserve skewness in the residuals. Use Mammen when residuals are notably skewed.

Multiple Testing Corrections

The Problem

When you test multiple event windows, subgroups, or test statistics, the probability of at least one false positive grows rapidly. With \(m\) independent tests at \(\alpha = 0.05\):

\[ P(\text{at least one false positive}) = 1 - (1 - \alpha)^m \]

For \(m = 10\) tests, the family-wise error rate is 40%. Multiple testing corrections adjust p-values to control this inflation.

Available Methods

# Adjust p-values from multi-event test statistics for multiple testing
adjusted <- adjust_p_values(task, method = "BH")
adjusted

# A tibble: 11 × 15
   relative_index       aar n_events n_valid_events n_pos n_neg  aar_t     caar
            <int>     <dbl>    <int>          <int> <int> <int>  <dbl>    <dbl>
 1             -5  0.0103          5              5     5     0  2.48   0.0103 
 2             -4  0.000819        5              5     1     4  0.208  0.0111 
 3             -3 -0.00645         5              5     2     3 -1.05   0.00464
 4             -2 -0.00296         5              5     3     2 -0.293  0.00168
 5             -1 -0.00340         5              5     3     2 -0.413 -0.00172
 6              0 -0.00291         5              5     2     3 -0.772 -0.00462
 7              1  0.00625         5              5     3     2  1.17   0.00162
 8              2 -0.00732         5              5     1     4 -0.815 -0.00569
 9              3 -0.0108          5              5     1     4 -1.62  -0.0165 
10              4 -0.00809         5              5     2     3 -0.865 -0.0246 
11              5  0.0108          5              5     4     1  2.16  -0.0138 
# ℹ 7 more variables: caar_t <dbl>, car_window <chr>, p_raw_aar <dbl>,
#   p_adj_aar <dbl>, p_raw_caar <dbl>, p_adj_caar <dbl>, group <chr>

Method	R Code	Controls	Best For
Bonferroni	`method = "bonferroni"`	Family-wise error rate (FWER)	Conservative, few tests
Holm	`method = "holm"`	FWER (step-down)	Default for FWER control
Hochberg	`method = "hochberg"`	FWER (step-up)	Independent tests
Benjamini-Hochberg	`method = "BH"`	False discovery rate (FDR)	Many tests, exploratory
Benjamini-Yekutieli	`method = "BY"`	FDR under dependence	Correlated test statistics

FWER vs. FDR

	FWER (Bonferroni, Holm)	FDR (BH, BY)
Controls	Probability of any false positive	Expected proportion of false positives
Strictness	Very conservative	Less conservative
Power	Lower — many true effects missed	Higher — better detection
Use when	Few tests, all must be valid	Many tests, some false positives acceptable

Recommendation. Use Holm (step-down Bonferroni) as the default for FWER control — it is uniformly more powerful than Bonferroni. Use Benjamini-Hochberg when testing many hypotheses (e.g., AAR significance at each event-time day) and you can tolerate a controlled rate of false discoveries.

Practical Example

Adjust p-values from different test statistics or across subgroups:

# Adjust using Holm's step-down method (more powerful than Bonferroni)
adjusted_holm <- adjust_p_values(task, method = "holm")

# Adjust using Benjamini-Yekutieli (for dependent tests)
adjusted_by <- adjust_p_values(task, method = "BY")

When to Use Each Tool

Scenario	Recommended Approach
Short estimation window (< 60 days)	Wild bootstrap
Non-normal residuals	Wild bootstrap
Multiple event windows tested	BH or Holm correction
Subgroup comparisons	BH correction
Single pre-specified window	No correction needed
Maximum robustness	Wild bootstrap + BH correction

Literature

Cameron, A.C., Gelbach, J.B. & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.
Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

Next Steps

Test Statistics — choosing the right significance test
Diagnostics & Export — model validation
Power Analysis — Monte Carlo simulation for study design