Data Preparation for Event Studies
Correct data preparation is the foundation of a reliable event study. The EventStudy R package requires three input datasets: firm stock prices, a market index, and an event request table. Each must follow a specific format. This page covers the required data structures, date formatting, how to handle missing observations, and quality checks you should perform before running an analysis.
Part of the Methodology Guide
This page is part of the Event Study Methodology Guide.
What Data Do I Need?
Every event study requires three datasets. The firm stock prices and market index provide the return data used to estimate the expected return model and compute abnormal returns. The event request table defines which events to analyze and how.
| Dataset | Purpose | Required Columns | Rows |
|---|---|---|---|
| Firm stock prices | Daily adjusted closing prices for each firm | symbol, date, adjusted | One row per firm per trading day |
| Market index | Benchmark return data (e.g., S&P 500) | symbol, date, adjusted | One row per trading day |
| Event request table | Defines events and window parameters | event_id, firm_symbol, index_symbol, event_date, group, window params | One row per event |
Use adjusted closing prices
Always use adjusted closing prices that account for stock splits, dividends, and other corporate actions. Unadjusted prices create artificial return spikes on ex-dividend and split dates, which contaminate both the estimation and event windows.
Firm Stock Price Data
The firm stock price data frame contains daily adjusted closing prices for each firm in your study. It must have exactly three columns:
| Column | Type | Description | Example |
|---|---|---|---|
| symbol | Character | Unique firm identifier (ticker or ISIN) | AAPL |
| date | Character | Trading date in DD.MM.YYYY format | 03.01.2020 |
| adjusted | Numeric | Adjusted closing price | 102.50 |
The data must span the entire period needed for each event: the estimation window plus the event window. For example, if you use a 250-day estimation window with a 10-day gap and a [-5, +5] event window, you need at least 270 trading days of data before and 5 days after each event date.
library(EventStudy)
library(tidyverse)
# Load price data (e.g., from Yahoo Finance via tidyquant)
# library(tidyquant)
# raw_prices <- tq_get(c("AAPL", "MSFT", "GOOGL"),
# from = "2018-01-01", to = "2023-12-31")
# Format for the EventStudy package
firm_data <- raw_prices |>
transmute(
symbol = symbol,
date = format(date, "%d.%m.%Y"),
adjusted = adjusted
)
head(firm_data)
# # A tibble: 6 x 3
# symbol date adjusted
# <chr> <chr> <dbl>
# 1 AAPL 02.01.2018 172.
# 2 AAPL 03.01.2018 172.
# 3 AAPL 04.01.2018 173.
# 4 AAPL 05.01.2018 175.
# 5 AAPL 08.01.2018 174.
# 6 AAPL 09.01.2018 174.Data sources
Common sources for adjusted stock prices include Yahoo Finance (via the tidyquant or quantmod R packages), Bloomberg, Refinitiv (formerly Thomson Reuters), and CRSP for academic research.
Market Index Data
The market index data frame has the same three-column structure as the firm data. It provides the benchmark return for the expected return model (e.g., the Market Model). Choose an index that represents the market in which your firms trade.
| Market | Common Index | Symbol Example |
|---|---|---|
| US equities | S&P 500, CRSP Value-Weighted | SP500 |
| European equities | STOXX Europe 600, DAX | STOXX600 |
| UK equities | FTSE 100 | FTSE100 |
| Japanese equities | Nikkei 225, TOPIX | TOPIX |
| Emerging markets | MSCI Emerging Markets | MSCIEM |
# Market index data (same structure as firm data)
index_data <- sp500_prices |>
transmute(
symbol = "SP500",
date = format(date, "%d.%m.%Y"),
adjusted = adjusted
)
# Ensure the index covers the same date range as firm data
stopifnot(min(index_data$date) <= min(firm_data$date))
stopifnot(max(index_data$date) >= max(firm_data$date))Index must match firm dates
The market index must cover at least the same date range as the firm data. If the index has trading days that a firm does not (e.g., the firm was not yet listed), the package handles this automatically. But if the index data is shorter than the required estimation + event window, the event will fail.
Event Request Table
The event request table defines each event you want to analyze. Each row represents one event: a specific firm experiencing a specific event on a specific date.
| Column | Type | Description | Example |
|---|---|---|---|
| event_id | Integer | Unique identifier for the event | 1 |
| firm_symbol | Character | Must match a symbol in firm data | AAPL |
| index_symbol | Character | Must match a symbol in index data | SP500 |
| event_date | Character | Date of the event in DD.MM.YYYY | 15.03.2022 |
| group | Character | Grouping variable for cross-sectional analysis | Tech |
| event_window_start | Integer | Start of event window relative to event date | -5 |
| event_window_end | Integer | End of event window relative to event date | 5 |
| shift_estimation_window | Integer | Gap between estimation and event windows | -10 |
| estimation_window_length | Integer | Number of trading days in estimation window | 250 |
# Define events
request <- tibble(
event_id = 1:3,
firm_symbol = c("AAPL", "MSFT", "GOOGL"),
index_symbol = "SP500",
event_date = c("15.03.2022", "20.04.2022", "10.05.2022"),
group = c("Tech", "Tech", "Tech"),
event_window_start = -5,
event_window_end = 5,
shift_estimation_window = -10,
estimation_window_length = 250
)
# The estimation window will be:
# [event_date + shift - estimation_window_length,
# event_date + shift]
# i.e., 250 days ending 10 days before the eventThe shift_estimation_window parameter creates a gap between the end of the estimation window and the start of the event window. This gap prevents the event from contaminating the estimation of normal returns. A typical gap is 10 trading days. For more on window selection, see Event Window & Estimation Window Selection.
Date Formatting
The EventStudy package uses the DD.MM.YYYY format for all dates. This applies to both the price data and the event request table. Dates must represent trading days — weekends and public holidays are not valid event dates.
# Convert from Date object to DD.MM.YYYY string
date_obj <- as.Date("2022-03-15")
formatted <- format(date_obj, "%d.%m.%Y")
# "15.03.2022"
# Convert from common formats
# From YYYY-MM-DD (ISO)
as.Date("2022-03-15") |> format("%d.%m.%Y")
# From MM/DD/YYYY (US)
as.Date("03/15/2022", format = "%m/%d/%Y") |> format("%d.%m.%Y")
# From YYYYMMDD (compact)
as.Date("20220315", format = "%Y%m%d") |> format("%d.%m.%Y")
# Batch convert a data frame column
df$date <- format(as.Date(df$date), "%d.%m.%Y")Common date mistakes
The most frequent data preparation error is incorrect date formatting. Ensure you use periods as separators (not hyphens or slashes) and the day-month-year order (not month-day-year). A date like "03.01.2020" means January 3, 2020 — not March 1.
Handling Missing Data
Missing data arises from trading halts, newly listed firms, delistings, and data provider gaps. How you handle missing observations depends on where they appear.
| Location | Impact | Recommended Action |
|---|---|---|
| Estimation window (few days) | Slightly fewer observations for model fitting | Usually acceptable; the package handles this automatically |
| Estimation window (>20% missing) | Unreliable model parameter estimates | Exclude the event from the study |
| Event window | Gap in abnormal returns; cumulative AR is unreliable | Exclude the event or use the last available price (carry forward) |
| Market index | Cannot compute market return for that day | Fill from alternative source or exclude affected days |
# Check for missing values
sum(is.na(firm_data$adjusted))
# Check for gaps in trading dates per firm
firm_data |>
group_by(symbol) |>
summarise(
n_days = n(),
first_date = min(date),
last_date = max(date),
n_missing = sum(is.na(adjusted))
)
# Remove firms with insufficient data
min_required_days <- 250 + 10 + 5 # estimation + gap + event window
firm_data <- firm_data |>
group_by(symbol) |>
filter(n() >= min_required_days) |>
ungroup()
# Optional: forward-fill isolated missing prices
# (use with caution -- this assumes zero return on missing days)
firm_data <- firm_data |>
group_by(symbol) |>
arrange(date) |>
fill(adjusted, .direction = "down") |>
ungroup()Data Quality Checks
Before running an event study, perform the following quality checks to catch common data issues that can invalidate results.
| Check | What to Look For | How to Fix |
|---|---|---|
| Duplicate rows | Same symbol + date appears more than once | distinct(symbol, date, .keep_all = TRUE) |
| Zero or negative prices | Adjusted price <= 0 | Remove or replace from alternative source |
| Extreme returns | Daily return > 100% or < -90% | Verify against source; likely a data error or split not adjusted |
| Date alignment | Firm and index dates do not overlap | Ensure both cover the same trading calendar |
| Sufficient history | Fewer days than estimation_window_length + gap + event_window | Extend data range or shorten estimation window |
| Event date is a trading day | Event date falls on weekend or holiday | Shift to next trading day or verify against trading calendar |
# 1. Remove duplicates
firm_data <- firm_data |> distinct(symbol, date, .keep_all = TRUE)
index_data <- index_data |> distinct(symbol, date, .keep_all = TRUE)
# 2. Check for zero or negative prices
stopifnot(all(firm_data$adjusted > 0, na.rm = TRUE))
stopifnot(all(index_data$adjusted > 0, na.rm = TRUE))
# 3. Check for extreme returns (flag but don't auto-remove)
firm_returns <- firm_data |>
group_by(symbol) |>
arrange(date) |>
mutate(ret = adjusted / lag(adjusted) - 1) |>
ungroup()
extreme <- firm_returns |> filter(abs(ret) > 1)
if (nrow(extreme) > 0) {
warning(paste(nrow(extreme), "extreme returns detected. Verify data."))
print(extreme)
}
# 4. Check that event dates exist in firm data
event_dates_valid <- request |>
rowwise() |>
mutate(
date_in_firm = event_date %in%
(firm_data |> filter(symbol == firm_symbol) |> pull(date)),
date_in_index = event_date %in%
(index_data |> filter(symbol == index_symbol) |> pull(date))
)
invalid_events <- event_dates_valid |>
filter(!date_in_firm | !date_in_index)
if (nrow(invalid_events) > 0) {
warning("Some event dates are not trading days. Adjust or remove:")
print(invalid_events)
}Automate quality checks
Wrap these checks in a function and run them every time you prepare data. This prevents subtle data errors from silently producing unreliable results.
Complete Example: From Raw Data to Event Study
Here is a full workflow that downloads data, prepares it, runs quality checks, and executes an event study.
library(EventStudy)
library(tidyverse)
# --- Step 1: Prepare firm prices ---
firm_data <- tibble(
symbol = rep(c("AAPL", "MSFT"), each = 300),
date = rep(format(
seq(as.Date("2021-01-04"), by = "day", length.out = 430) |>
((d) d[!weekdays(d) %in% c("Saturday", "Sunday")])() |>
head(300),
"%d.%m.%Y"
), 2),
adjusted = c(
cumprod(1 + rnorm(300, 0.0005, 0.015)) * 130,
cumprod(1 + rnorm(300, 0.0004, 0.012)) * 240
)
)
# --- Step 2: Prepare market index ---
index_data <- tibble(
symbol = "SP500",
date = unique(firm_data$date),
adjusted = cumprod(1 + rnorm(300, 0.0003, 0.010)) * 4200
)
# --- Step 3: Create event request ---
request <- tibble(
event_id = 1:2,
firm_symbol = c("AAPL", "MSFT"),
index_symbol = "SP500",
event_date = firm_data |>
filter(symbol == "AAPL") |>
slice(260) |>
pull(date),
group = "Tech",
event_window_start = -5,
event_window_end = 5,
shift_estimation_window = -10,
estimation_window_length = 240
)
# --- Step 4: Quality checks ---
stopifnot(all(firm_data$adjusted > 0))
stopifnot(nrow(distinct(firm_data, symbol, date)) == nrow(firm_data))
# --- Step 5: Run event study ---
task <- EventStudyTask$new(firm_data, index_data, request)
ps <- ParameterSet$new()
task <- run_event_study(task, ps)
# --- Step 6: View results ---
task$get_aar()Run this in R
The EventStudy R package lets you run these calculations programmatically with full control over parameters.
What Should I Do Next?
- Event Window & Estimation Window Selection — choose appropriate window lengths for your study
- Expected Return Models — select the right benchmark model
- Return Calculation — simple vs. log returns
- Diagnostics & Export — validate your results and export for publication