Inference & Robustness

Standard event study inference relies on asymptotic theory — assuming residuals are normal and independent. When these assumptions fail, two tools strengthen your conclusions: wild bootstrap for more accurate p-values, and multiple testing corrections to control false discovery when testing many hypotheses.

Robust inference methods protect against inflated significance levels that arise from violated assumptions. According to Cameron, Gelbach, and Miller (2008), wild bootstrap inference can reduce size distortion by 50 to 70 percent relative to asymptotic tests in samples with fewer than 50 clusters. Combining bootstrap p-values with multiple testing corrections provides the strongest available protection against both Type I and Type II errors in event study applications.

What Is the Wild Bootstrap?

Robust inference in event studies refers to statistical procedures that produce valid p-values and confidence intervals even when standard asymptotic assumptions — such as normality, independence, and homoskedasticity of residuals — are violated. The two primary tools are the wild bootstrap, which generates finite-sample-corrected p-values through residual resampling, and multiple testing corrections, which control false discovery rates when evaluating many hypotheses simultaneously. According to Cameron, Gelbach, and Miller (2008), wild bootstrap inference can reduce size distortion by 50–70% relative to asymptotic tests in samples with fewer than 50 clusters.

Why Bootstrap?

Asymptotic p-values (from t-tests) can be unreliable when:

  • The estimation window is short (< 60 days)
  • Residuals are non-normal or heavy-tailed
  • The sample size (number of firms) is small
  • There is cross-sectional dependence

The wild bootstrap generates the null distribution by resampling residuals with random sign flips, preserving the heteroskedasticity structure of the original data. With 999 bootstrap replications, the minimum achievable p-value is 0.001, providing a resolution of 3 decimal places for significance reporting.

Algorithm

  1. Compute abnormal returns ARi,tAR_{i,t} and the test statistic TT (e.g., CAAR t-statistic).
  2. For b=1,,Bb = 1, \ldots, B bootstrap replications:
    1. Draw Rademacher weights ηi{1,+1}\eta_i \in \{-1, +1\} with equal probability (or Mammen weights).
    2. Construct bootstrap residuals: ARi,t=ηiARi,tAR^*_{i,t} = \eta_i \cdot AR_{i,t}.
    3. Recompute the test statistic TbT^*_b from the bootstrap residuals.
  3. The bootstrap p-value is:
pboot=1Bb=1B1[TbT]p_{\text{boot}} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}[|T^*_b| \geq |T|]

Running in R

Wild bootstrap for AAR/CAAR significance
# Wild bootstrap for AAR/CAAR significance
boot_result <- bootstrap_test(
  task,
  n_boot = 999,
  weight_type = "rademacher"
)
boot_result

Bootstrap Parameters

ParameterOptionsDefault
n_bootNumber of bootstrap replications999
weight_typeWeight distribution: "rademacher" or "mammen""rademacher"
statisticWhich statistic to bootstrap: "aar", "caar", or "both""both"

Rademacher vs. Mammen

Rademacher weights (±1\pm 1) are simpler and work well in most cases. Mammen weights (±5/2\pm\sqrt{5}/2) better preserve skewness in the residuals. Use Mammen when residuals are notably skewed.

What Are Multiple Testing Corrections?

The Problem

When you test multiple event windows, subgroups, or test statistics, the probability of at least one false positive grows rapidly. With mm independent tests at α=0.05\alpha = 0.05:

P(at least one false positive)=1(1α)mP(\text{at least one false positive}) = 1 - (1 - \alpha)^m

For m=10m = 10 tests, the family-wise error rate is 40%; for m=20m = 20 tests it reaches 64%. Multiple testing corrections adjust p-values to control this inflation. The Benjamini-Hochberg (1995) procedure, one of the most cited statistical papers with over 90,000 citations, controls the false discovery rate rather than the family-wise error rate, offering substantially higher power when many hypotheses are tested.

Available Methods

Multiple Testing Correction Methods

MethodR CodeControlsBest For
Bonferronimethod = "bonferroni"Family-wise error rate (FWER)Conservative, few tests
Holmmethod = "holm"FWER (step-down)Default for FWER control
Hochbergmethod = "hochberg"FWER (step-up)Independent tests
Benjamini-Hochbergmethod = "BH"False discovery rate (FDR)Many tests, exploratory
Benjamini-Yekutielimethod = "BY"FDR under dependenceCorrelated test statistics

FWER vs. FDR

FWER (Bonferroni, Holm)FDR (BH, BY)
ControlsProbability of any false positiveExpected proportion of false positives
StrictnessVery conservativeLess conservative
PowerLower — many true effects missedHigher — better detection
Use whenFew tests, all must be validMany tests, some false positives acceptable

Recommendation

Use Holm (step-down Bonferroni) as the default for FWER control — it is uniformly more powerful than Bonferroni. Use Benjamini-Hochberg when testing many hypotheses (e.g., AAR significance at each event-time day) and you can tolerate a controlled rate of false discoveries.

Wild Bootstrap
A resampling method that generates bootstrap test statistics by multiplying residuals with random sign-flip weights (Rademacher or Mammen), preserving heteroskedasticity while imposing the null hypothesis. Typically uses 999–9,999 replications.
Family-Wise Error Rate (FWER)
The probability of making at least one false rejection among all tested hypotheses. Controlled by Bonferroni and Holm procedures, which are conservative but guarantee no false positives.
False Discovery Rate (FDR)
The expected proportion of rejected hypotheses that are false positives. Controlled by Benjamini-Hochberg and Benjamini-Yekutieli procedures, offering higher power than FWER methods at the cost of tolerating some false discoveries.

When Should I Use Each Approach?

ScenarioRecommended Approach
Short estimation window (< 60 days)Wild bootstrap
Non-normal residualsWild bootstrap
Multiple event windows testedBH or Holm correction
Subgroup comparisonsBH correction
Single pre-specified windowNo correction needed
Maximum robustnessWild bootstrap + BH correction

Literature

  • Cameron, A.C., Gelbach, J.B. & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.
  • Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

Implement this with the R package

Access advanced features and full customization through the EventStudy R package.

What Should I Read Next?

We use cookies for analytics to improve this site. See our Privacy Policy.