Interactive notebook

This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.

Stacked DiD (Wing, Freedman & Hollingsworth 2024)#

This tutorial demonstrates the StackedDiD estimator, which implements the stacked difference-in-differences method from Wing, Freedman & Hollingsworth (2024), “Stacked Difference-in-Differences”, NBER Working Paper 32054.

When to use StackedDiD:

Staggered adoption where you want a regression-based event-study framework — ideal for practitioners who think in OLS terms
When you want to inspect the clean comparison dataset directly — the stacked data is a first-class output
As a robustness check alongside Callaway-Sant’Anna or Imputation DiD

Topics covered:

Basic usage and overall ATT
Event study estimation and visualization
Inside the stacked dataset — sub-experiments, event times, and Q-weights
Event window and trimming (IC1/IC2)
Q-weight schemes (aggregate, population, sample share)
Clean control definitions (not-yet-treated, strict, never-treated)
Comparison with Callaway-Sant’Anna and Imputation DiD
Advanced features: anticipation and clustering

See also:Tutorial 02for Callaway-Sant’Anna and Sun-Abraham,Tutorial 11for Imputation DiD,Tutorial 12for Two-Stage DiD.

[ ]:

import numpy as np

from diff_diff import (
    StackedDiD, CallawaySantAnna, ImputationDiD,
    generate_staggered_data, plot_event_study
)

# For nicer plots (optional)
try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    HAS_MATPLOTLIB = True
except ImportError:
    HAS_MATPLOTLIB = False
    print("matplotlib not installed - visualization examples will be skipped")

Basic Usage#

The stacked DiD estimator follows a four-step process:

Partition the data into sub-experiments — one per adoption cohort, each with its own treated units and clean controls
Restrict each sub-experiment to an event window defined by kappa_pre and kappa_post (which can differ)
Compute Q-weights that correct for compositional imbalance across sub-experiments
Run a pooled WLS regression on the weighted stacked dataset

The main modeling choice is the event window size (kappa_pre, kappa_post).

[ ]:

# Generate staggered adoption data with known treatment effect
data = generate_staggered_data(n_units=300, n_periods=10, treatment_effect=2.0, seed=42)

# Fit the stacked DiD estimator
est = StackedDiD(kappa_pre=2, kappa_post=2)
results = est.fit(data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')
results.print_summary()

Event Study#

Event study estimates effects at each relative time horizon. The reference period is e = -1 (last pre-treatment period). Pre-treatment coefficients assess parallel trends; post-treatment coefficients capture dynamic effects.

[ ]:

# Fit with event study aggregation
est = StackedDiD(kappa_pre=2, kappa_post=2)
results_es = est.fit(data, outcome='outcome', unit='unit', time='period',
                     first_treat='first_treat', aggregate='event_study')

# Plot event study
if HAS_MATPLOTLIB:
    plot_event_study(results_es, title='Stacked DiD Event Study')
else:
    print("Install matplotlib to see visualizations: pip install matplotlib")

[ ]:

# View event study effects as a table
results_es.to_dataframe(level='event_study')

Inside the Stacked Dataset#

A key feature of StackedDiD is that the stacked dataset is a first-class output. You can inspect results.stacked_data to see exactly how the estimator constructs its comparisons.

The stacked data contains four added columns:

_sub_exp: Which adoption cohort defines this sub-experiment
_event_time: Relative time to treatment (e.g., -2, -1, 0, 1, 2)
_D_sa: Treatment indicator (1 = treated unit, 0 = clean control)
_Q_weight: Corrective weight for compositional balance

Each sub-experiment is a “mini DiD” with its own treated cohort and a set of clean controls. The same control unit can appear in multiple sub-experiments. Q-weights correct for the fact that naive stacking implicitly overweights cohorts with more controls.

[ ]:

# Inspect stacked data structure
sd = results.stacked_data
print(f"Original data shape:  {data.shape}")
print(f"Stacked data shape:   {sd.shape}")
print(f"Row expansion factor: {len(sd) / len(data):.1f}x")
print(f"Number of sub-experiments: {results.n_sub_experiments}")
print(f"Added columns: {[c for c in sd.columns if c.startswith('_')]}")
print()

# Rows per sub-experiment
print("Rows per sub-experiment:")
print(sd.groupby('_sub_exp').size().to_string())

[ ]:

# Q-weight summary by sub-experiment
print("Q-Weight Summary by Sub-Experiment")
print("=" * 56)
print(f"{'Sub-Exp':>8} {'Treated':>10} {'Controls':>10} {'Avg Q (ctrl)':>14}")
print("-" * 56)

for sub_exp in sorted(sd['_sub_exp'].unique()):
    sub = sd[sd['_sub_exp'] == sub_exp]
    treated = sub[sub['_D_sa'] == 1]
    controls = sub[sub['_D_sa'] == 0]
    n_treated = treated['unit'].nunique()
    n_controls = controls['unit'].nunique()
    avg_q = controls['_Q_weight'].mean() if len(controls) > 0 else 0.0
    print(f"{int(sub_exp):>8} {n_treated:>10} {n_controls:>10} {avg_q:>14.3f}")

print()
print("Note: Treated units always have Q = 1. Controls get adjusted weights.")

Event Window and Trimming#

The kappa_pre and kappa_post parameters define the event window (they can differ for asymmetric windows). Not all cohorts can be included at every window size:

IC1 (Window fits in panel): The event window [a - kappa_pre, a + kappa_post] must fall within the panel’s time range
IC2 (Clean controls exist): At least one clean control unit must exist for the sub-experiment

Cohorts that fail either condition are trimmed. When this happens, the estimator emits a UserWarning telling you which cohorts were dropped and why. You should expect to see these warnings in the next cell as we increase the window size — they are informative, not errors.

Tradeoff: A wider window gives more pre/post periods for trend assessment and dynamic effects, but trims more cohorts.

[ ]:

# Trimming with different kappa values
print("Effect of Event Window Size on Trimming")
print("=" * 65)
print(f"{'Window':>10} {'Included':>12} {'Trimmed':>12} {'ATT':>10} {'SE':>10}")
print("-" * 65)

for kp, kq in [(1, 1), (2, 2), (3, 3), (4, 4)]:
    try:
        r = StackedDiD(kappa_pre=kp, kappa_post=kq).fit(
            data, outcome='outcome', unit='unit', time='period', first_treat='first_treat'
        )
        window = f"[{-kp}, {kq}]"
        incl = str(r.groups)
        trim = str(r.trimmed_groups) if r.trimmed_groups else "[]"
        print(f"{window:>10} {incl:>12} {trim:>12} {r.overall_att:>10.3f} {r.overall_se:>10.3f}")
    except ValueError as e:
        window = f"[{-kp}, {kq}]"
        print(f"{window:>10}   All cohorts trimmed - window too wide")

Reading the trimming warnings. At kappa=3, you should see a warning like:

Trimmed 1 adoption event(s) that don’t satisfy inclusion criteria: [7.0]. IC1 requires event window [-3, 3] to fit within data range [0, 9]. IC2 requires clean controls to exist.

This tells you cohort 7 was dropped because its event window [7-3, 7+3] = [4, 10] extends past the last observed period (9). At kappa=4, cohort 3 is also trimmed — its window [3-4, 3+4] = [-1, 7] starts before the first observed period (0).

What to do when you see these warnings:

Check which cohorts were lost. Inspect results.trimmed_groups — if the trimmed cohorts are central to your research question, the wider window may not be appropriate.
Assess the bias-variance tradeoff. Wider windows give you more pre-treatment periods to assess parallel trends and more post-treatment periods to capture dynamic effects — but at the cost of dropping cohorts at the panel edges. In the table above, notice how the point estimate and SE change as cohorts are trimmed.
Use an asymmetric window. You don’t need kappa_pre == kappa_post. If you need 3 pre-treatment periods for trend assessment but cohort 7 is trimmed because the post-treatment side overflows the panel, you can shorten just kappa_post.

Let’s walk through option 3. The symmetric window [-3, 3] trimmed cohort 7 because 7 + 3 = 10 exceeds the last period (9). If we keep 3 pre-treatment periods but reduce kappa_post to 2, the window becomes [7-3, 7+2] = [4, 9] — which fits:

[ ]:

# Asymmetric window recovers cohort 7 — no warning this time
r_asym = StackedDiD(kappa_pre=3, kappa_post=2).fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')
print(f"Asymmetric [-3, 2]: groups={r_asym.groups}, trimmed={r_asym.trimmed_groups}")
print(f"\nAll 3 cohorts included. ATT={r_asym.overall_att:.3f} (SE={r_asym.overall_se:.3f})")

Q-Weight Schemes#

Wing et al. (2024, Table 1) define three target estimands, each with a different Q-weight formula:

``”aggregate”`` (default): Weight by treated cohort size (N_a^D / N_Ω^D) — the trimmed aggregate ATT. (For unbalanced panels, weights are computed at the observation level per (event_time, sub_exp), which reduces to cohort-size weighting when panels are balanced.)
``”population”``: Weight by population size of treated cohort (requires a population column)
``”sample_share”``: Weight by sample share of each sub-experiment

The choice depends on whether you want cohorts weighted by their treated unit count (aggregate), by an external population measure (population), or by their share of the stacked sample (sample_share).

[ ]:

# Add a unit-level population column representing cohort size
# (constant per unit, since Q-weight computation groups by [unit, sub_exp])
pop_map = {3: 1000, 5: 2000, 7: 500}  # cohort-level population sizes
data['population'] = data['first_treat'].map(pop_map).fillna(0).astype(int)

# Compare Q-weight schemes
print("Q-Weight Scheme Comparison")
print("=" * 60)
print(f"{'Scheme':<16} {'ATT':>10} {'SE':>10} {'CI Width':>12}")
print("-" * 60)

for scheme in ['aggregate', 'population', 'sample_share']:
    kwargs = {'kappa_pre': 2, 'kappa_post': 2, 'weighting': scheme}
    fit_kwargs = dict(outcome='outcome', unit='unit', time='period', first_treat='first_treat')
    if scheme == 'population':
        fit_kwargs['population'] = 'population'
    r = StackedDiD(**kwargs).fit(data, **fit_kwargs)
    ci_width = r.overall_conf_int[1] - r.overall_conf_int[0]
    print(f"{scheme:<16} {r.overall_att:>10.3f} {r.overall_se:>10.3f} {ci_width:>12.3f}")

Clean Control Definitions#

The clean_control parameter determines which units serve as controls in each sub-experiment:

``”not_yet_treated”`` (default): Units adopted after a + kappa_post — most inclusive, maximizes statistical power
``”strict”``: Units adopted after a + kappa_post + kappa_pre — more conservative, excludes units treated during the window
``”never_treated”``: Only units with first_treat = inf — most restrictive, strongest identification

More restrictive definitions yield fewer controls and wider standard errors, but provide stronger causal identification.

[ ]:

# Compare clean control definitions
print("Clean Control Definition Comparison")
print("=" * 70)
print(f"{'Definition':<18} {'ATT':>8} {'SE':>8} {'Ctrl Units':>12} {'Cohorts':>10}")
print("-" * 70)

for cc in ['not_yet_treated', 'strict', 'never_treated']:
    r = StackedDiD(kappa_pre=2, kappa_post=2, clean_control=cc).fit(
        data, outcome='outcome', unit='unit', time='period', first_treat='first_treat'
    )
    print(f"{cc:<18} {r.overall_att:>8.3f} {r.overall_se:>8.3f} {r.n_control_units:>12} {len(r.groups):>10}")

Comparison with Other Estimators#

StackedDiD, CallawaySantAnna, and ImputationDiD all address TWFE bias in staggered settings, but via different approaches:

StackedDiD: Constructs sub-experiments, applies Q-weights, runs pooled WLS
CallawaySantAnna: Computes group-time ATT(g,t) effects, then aggregates
ImputationDiD: Imputes counterfactual Y(0) via fixed effect model

Under homogeneous treatment effects, all three should produce similar point estimates. Disagreement flags potential treatment effect heterogeneity. Agreement across estimators strengthens causal claims.

[ ]:

# Fit all three estimators on the same data
sd_r = StackedDiD(kappa_pre=2, kappa_post=2).fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')
cs_r = CallawaySantAnna().fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')
imp_r = ImputationDiD().fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')

print("Estimator Comparison (True effect = 2.0)")
print("=" * 55)
print(f"{'Estimator':<25} {'ATT':>8} {'SE':>8} {'CI Width':>10}")
print("-" * 55)

for name, r in [("StackedDiD", sd_r), ("CallawaySantAnna", cs_r), ("ImputationDiD", imp_r)]:
    ci_width = r.overall_conf_int[1] - r.overall_conf_int[0]
    print(f"{name:<25} {r.overall_att:>8.3f} {r.overall_se:>8.3f} {ci_width:>10.3f}")

Advanced Features#

Anticipation#

If treatment effects begin before the official treatment date (e.g., firms change behavior in anticipation of a policy), use the anticipation parameter. Setting anticipation=k shifts the reference period from e = -1 to e = -1 - k, classifying periods e >= -k as post-treatment.

[ ]:

# Compare ATT with and without anticipation
est_no_antic = StackedDiD(kappa_pre=2, kappa_post=2)
results_no_antic = est_no_antic.fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')

est_antic = StackedDiD(kappa_pre=2, kappa_post=2, anticipation=1)
results_antic = est_antic.fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')

print(f"ATT (no anticipation):       {results_no_antic.overall_att:.3f}")
print(f"ATT (1-period anticipation): {results_antic.overall_att:.3f}")

Clustering#

Standard errors can be clustered at two levels:

``cluster=’unit’`` (default): Conservative — accounts for the fact that the same unit appears across multiple sub-experiments
``cluster=’unit_subexp’``: Treats each sub-experiment appearance as independent — narrower SEs, but assumes independence across sub-experiments

[ ]:

# Clustering comparison
r_unit = StackedDiD(kappa_pre=2, kappa_post=2, cluster='unit').fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')
r_subexp = StackedDiD(kappa_pre=2, kappa_post=2, cluster='unit_subexp').fit(
    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')

print("Clustering Comparison")
print("=" * 50)
print(f"{'Cluster Level':<20} {'ATT':>10} {'SE':>10}")
print("-" * 50)
print(f"{'unit':<20} {r_unit.overall_att:>10.3f} {r_unit.overall_se:>10.3f}")
print(f"{'unit_subexp':<20} {r_subexp.overall_att:>10.3f} {r_subexp.overall_se:>10.3f}")
print()
print("Point estimates are identical; SEs differ due to clustering level.")

Summary#

Feature	StackedDiD	CallawaySantAnna	ImputationDiD
Approach	Stack sub-experiments, pooled WLS	Group-time ATT(g,t) aggregation	Impute Y(0) via FE model
Framework	Regression (event-study)	Nonparametric	Regression (imputation)
Event window	Explicit (kappa_pre, kappa_post)	Implicit (all periods)	Implicit (all periods)
Group effects	No (pooled regression)	Yes	Yes
Control group	Configurable (3 options)	Never-treated or not-yet-treated	All untreated
Inspectable data	Yes (stacked_data)	No	Yes (treatment_effects)
Best for	Regression intuition, transparency	Heterogeneous effects	Maximum efficiency

Reference: Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked Difference-in-Differences. NBER Working Paper 32054.