What is wild bootstrap inference in event studies?

Wild bootstrap generates bootstrap p-values by resampling residuals with random sign flips, providing valid inference even with non-normal residuals or small samples.

Why do I need multiple testing corrections?

When testing many events simultaneously, the probability of false positives increases. Corrections like Bonferroni, Holm, or Benjamini-Hochberg control the family-wise or false discovery rate.

When should I use robust inference?

Use robust inference when residuals are non-normal, the sample is small, events cluster in calendar time, or when you test multiple hypotheses simultaneously.

Inference & Robustness

Standard event study inference relies on asymptotic theory — assuming residuals are normal and independent. When these assumptions fail, two tools strengthen your conclusions: wild bootstrap for more accurate p-values, and multiple testing corrections to control false discovery when testing many hypotheses.

Robust inference methods protect against inflated significance levels that arise from violated assumptions. According to Cameron, Gelbach, and Miller (2008), wild bootstrap inference can reduce size distortion by 50 to 70 percent relative to asymptotic tests in samples with fewer than 50 clusters. Combining bootstrap p-values with multiple testing corrections provides the strongest available protection against both Type I and Type II errors in event study applications.

What Is the Wild Bootstrap?

Robust inference in event studies refers to statistical procedures that produce valid p-values and confidence intervals even when standard asymptotic assumptions — such as normality, independence, and homoskedasticity of residuals — are violated. The two primary tools are the wild bootstrap, which generates finite-sample-corrected p-values through residual resampling, and multiple testing corrections, which control false discovery rates when evaluating many hypotheses simultaneously. According to Cameron, Gelbach, and Miller (2008), wild bootstrap inference can reduce size distortion by 50–70% relative to asymptotic tests in samples with fewer than 50 clusters.

Why Bootstrap?

Asymptotic p-values (from t-tests) can be unreliable when:

The estimation window is short (< 60 days)
Residuals are non-normal or heavy-tailed
The sample size (number of firms) is small
There is cross-sectional dependence

The wild bootstrap generates the null distribution by resampling residuals with random sign flips, preserving the heteroskedasticity structure of the original data. With 999 bootstrap replications, the minimum achievable p-value is 0.001, providing a resolution of 3 decimal places for significance reporting.

Algorithm

Compute abnormal returns $AR_{i,t}$ and the test statistic $T$ (e.g., CAAR t-statistic).
For $b = 1, \ldots, B$ bootstrap replications:
1. Draw Rademacher weights $\eta_i \in \{-1, +1\}$ with equal probability (or Mammen weights).
2. Construct bootstrap residuals: $AR^*_{i,t} = \eta_i \cdot AR_{i,t}$ .
3. Recompute the test statistic $T^*_b$ from the bootstrap residuals.
The bootstrap p-value is:

p_{\text{boot}} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}[|T^*_b| \geq |T|]

Running in R

Wild bootstrap for AAR/CAAR significance

# Wild bootstrap for AAR/CAAR significance
boot_result <- bootstrap_test(
  task,
  n_boot = 999,
  weight_type = "rademacher"
)
boot_result

Bootstrap Parameters

Parameter	Options	Default
n_boot	Number of bootstrap replications	999
weight_type	Weight distribution: "rademacher" or "mammen"	"rademacher"
statistic	Which statistic to bootstrap: "aar", "caar", or "both"	"both"

Rademacher vs. Mammen

Rademacher weights ( $\pm 1$ ) are simpler and work well in most cases. Mammen weights ( $\pm\sqrt{5}/2$ ) better preserve skewness in the residuals. Use Mammen when residuals are notably skewed.

What Are Multiple Testing Corrections?

The Problem

When you test multiple event windows, subgroups, or test statistics, the probability of at least one false positive grows rapidly. With $m$ independent tests at $\alpha = 0.05$ :

P(\text{at least one false positive}) = 1 - (1 - \alpha)^m

For $m = 10$ tests, the family-wise error rate is 40%; for $m = 20$ tests it reaches 64%. Multiple testing corrections adjust p-values to control this inflation. The Benjamini-Hochberg (1995) procedure, one of the most cited statistical papers with over 90,000 citations, controls the false discovery rate rather than the family-wise error rate, offering substantially higher power when many hypotheses are tested.

Available Methods

Multiple Testing Correction Methods

Method	R Code	Controls	Best For
Bonferroni	method = "bonferroni"	Family-wise error rate (FWER)	Conservative, few tests
Holm	method = "holm"	FWER (step-down)	Default for FWER control
Hochberg	method = "hochberg"	FWER (step-up)	Independent tests
Benjamini-Hochberg	method = "BH"	False discovery rate (FDR)	Many tests, exploratory
Benjamini-Yekutieli	method = "BY"	FDR under dependence	Correlated test statistics

FWER vs. FDR

	FWER (Bonferroni, Holm)	FDR (BH, BY)
Controls	Probability of any false positive	Expected proportion of false positives
Strictness	Very conservative	Less conservative
Power	Lower — many true effects missed	Higher — better detection
Use when	Few tests, all must be valid	Many tests, some false positives acceptable

Recommendation

Use Holm (step-down Bonferroni) as the default for FWER control — it is uniformly more powerful than Bonferroni. Use Benjamini-Hochberg when testing many hypotheses (e.g., AAR significance at each event-time day) and you can tolerate a controlled rate of false discoveries.

Wild Bootstrap: A resampling method that generates bootstrap test statistics by multiplying residuals with random sign-flip weights (Rademacher or Mammen), preserving heteroskedasticity while imposing the null hypothesis. Typically uses 999–9,999 replications.
Family-Wise Error Rate (FWER): The probability of making at least one false rejection among all tested hypotheses. Controlled by Bonferroni and Holm procedures, which are conservative but guarantee no false positives.
False Discovery Rate (FDR): The expected proportion of rejected hypotheses that are false positives. Controlled by Benjamini-Hochberg and Benjamini-Yekutieli procedures, offering higher power than FWER methods at the cost of tolerating some false discoveries.

When Should I Use Each Approach?

Scenario	Recommended Approach
Short estimation window (< 60 days)	Wild bootstrap
Non-normal residuals	Wild bootstrap
Multiple event windows tested	BH or Holm correction
Subgroup comparisons	BH correction
Single pre-specified window	No correction needed
Maximum robustness	Wild bootstrap + BH correction

Literature

Cameron, A.C., Gelbach, J.B. & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.
Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

Implement this with the R package

Access advanced features and full customization through the EventStudy R package.

View on GitHub Or try the App

What Should I Read Next?

Test Statistics — choosing the right significance test
Diagnostics & Export — model validation
Power Analysis — Monte Carlo simulation for study design