Inference & Robustness
Standard event study inference relies on asymptotic theory — assuming residuals are normal and independent. When these assumptions fail, two tools strengthen your conclusions: wild bootstrap for more accurate p-values, and multiple testing corrections to control false discovery when testing many hypotheses.
Robust inference methods protect against inflated significance levels that arise from violated assumptions. According to Cameron, Gelbach, and Miller (2008), wild bootstrap inference can reduce size distortion by 50 to 70 percent relative to asymptotic tests in samples with fewer than 50 clusters. Combining bootstrap p-values with multiple testing corrections provides the strongest available protection against both Type I and Type II errors in event study applications.
What Is the Wild Bootstrap?
Robust inference in event studies refers to statistical procedures that produce valid p-values and confidence intervals even when standard asymptotic assumptions — such as normality, independence, and homoskedasticity of residuals — are violated. The two primary tools are the wild bootstrap, which generates finite-sample-corrected p-values through residual resampling, and multiple testing corrections, which control false discovery rates when evaluating many hypotheses simultaneously. According to Cameron, Gelbach, and Miller (2008), wild bootstrap inference can reduce size distortion by 50–70% relative to asymptotic tests in samples with fewer than 50 clusters.
Why Bootstrap?
Asymptotic p-values (from t-tests) can be unreliable when:
- The estimation window is short (< 60 days)
- Residuals are non-normal or heavy-tailed
- The sample size (number of firms) is small
- There is cross-sectional dependence
The wild bootstrap generates the null distribution by resampling residuals with random sign flips, preserving the heteroskedasticity structure of the original data. With 999 bootstrap replications, the minimum achievable p-value is 0.001, providing a resolution of 3 decimal places for significance reporting.
Algorithm
- Compute abnormal returns and the test statistic (e.g., CAAR t-statistic).
- For bootstrap replications:
- Draw Rademacher weights with equal probability (or Mammen weights).
- Construct bootstrap residuals: .
- Recompute the test statistic from the bootstrap residuals.
- The bootstrap p-value is:
Running in R
# Wild bootstrap for AAR/CAAR significance
boot_result <- bootstrap_test(
task,
n_boot = 999,
weight_type = "rademacher"
)
boot_resultBootstrap Parameters
| Parameter | Options | Default |
|---|---|---|
| n_boot | Number of bootstrap replications | 999 |
| weight_type | Weight distribution: "rademacher" or "mammen" | "rademacher" |
| statistic | Which statistic to bootstrap: "aar", "caar", or "both" | "both" |
Rademacher vs. Mammen
Rademacher weights () are simpler and work well in most cases. Mammen weights () better preserve skewness in the residuals. Use Mammen when residuals are notably skewed.
What Are Multiple Testing Corrections?
The Problem
When you test multiple event windows, subgroups, or test statistics, the probability of at least one false positive grows rapidly. With independent tests at :
For tests, the family-wise error rate is 40%; for tests it reaches 64%. Multiple testing corrections adjust p-values to control this inflation. The Benjamini-Hochberg (1995) procedure, one of the most cited statistical papers with over 90,000 citations, controls the false discovery rate rather than the family-wise error rate, offering substantially higher power when many hypotheses are tested.
Available Methods
Multiple Testing Correction Methods
| Method | R Code | Controls | Best For |
|---|---|---|---|
| Bonferroni | method = "bonferroni" | Family-wise error rate (FWER) | Conservative, few tests |
| Holm | method = "holm" | FWER (step-down) | Default for FWER control |
| Hochberg | method = "hochberg" | FWER (step-up) | Independent tests |
| Benjamini-Hochberg | method = "BH" | False discovery rate (FDR) | Many tests, exploratory |
| Benjamini-Yekutieli | method = "BY" | FDR under dependence | Correlated test statistics |
FWER vs. FDR
| FWER (Bonferroni, Holm) | FDR (BH, BY) | |
|---|---|---|
| Controls | Probability of any false positive | Expected proportion of false positives |
| Strictness | Very conservative | Less conservative |
| Power | Lower — many true effects missed | Higher — better detection |
| Use when | Few tests, all must be valid | Many tests, some false positives acceptable |
Recommendation
Use Holm (step-down Bonferroni) as the default for FWER control — it is uniformly more powerful than Bonferroni. Use Benjamini-Hochberg when testing many hypotheses (e.g., AAR significance at each event-time day) and you can tolerate a controlled rate of false discoveries.
- Wild Bootstrap
- A resampling method that generates bootstrap test statistics by multiplying residuals with random sign-flip weights (Rademacher or Mammen), preserving heteroskedasticity while imposing the null hypothesis. Typically uses 999–9,999 replications.
- Family-Wise Error Rate (FWER)
- The probability of making at least one false rejection among all tested hypotheses. Controlled by Bonferroni and Holm procedures, which are conservative but guarantee no false positives.
- False Discovery Rate (FDR)
- The expected proportion of rejected hypotheses that are false positives. Controlled by Benjamini-Hochberg and Benjamini-Yekutieli procedures, offering higher power than FWER methods at the cost of tolerating some false discoveries.
When Should I Use Each Approach?
| Scenario | Recommended Approach |
|---|---|
| Short estimation window (< 60 days) | Wild bootstrap |
| Non-normal residuals | Wild bootstrap |
| Multiple event windows tested | BH or Holm correction |
| Subgroup comparisons | BH correction |
| Single pre-specified window | No correction needed |
| Maximum robustness | Wild bootstrap + BH correction |
Literature
- Cameron, A.C., Gelbach, J.B. & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.
- Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.
Implement this with the R package
Access advanced features and full customization through the EventStudy R package.
What Should I Read Next?
- Test Statistics — choosing the right significance test
- Diagnostics & Export — model validation
- Power Analysis — Monte Carlo simulation for study design