Interactive notebook

This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.

Staggered Difference-in-Differences#

This notebook demonstrates how to handle staggered treatment adoption using modern DiD estimators. In staggered DiD settings:

  • Different units get treated at different times

  • Traditional TWFE can give biased estimates due to “forbidden comparisons”

  • Modern estimators compute group-time specific effects and aggregate them properly

We’ll cover:

  1. Understanding staggered adoption

  2. The problem with TWFE (and Goodman-Bacon decomposition)

  3. The Callaway-Sant’Anna estimator

  4. Group-time effects ATT(g,t)

  5. Aggregating effects (simple, group, event-study)

  6. Bootstrap inference for valid standard errors

  7. Visualization

  8. Pre-treatment effects and parallel trends testing

  9. Different control group options

  10. Handling anticipation effects

  11. Adding covariates

  12. Comparing with MultiPeriodDiD

  13. Sun-Abraham interaction-weighted estimator

  14. Comparing CS and SA as a robustness check

[ ]:
import numpy as np
import pandas as pd
from diff_diff import CallawaySantAnna, SunAbraham, MultiPeriodDiD
from diff_diff.visualization import plot_event_study, plot_group_effects

# For nicer plots (optional)
try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    HAS_MATPLOTLIB = True
except ImportError:
    HAS_MATPLOTLIB = False
    print("matplotlib not installed - visualization examples will be skipped")

1. Understanding Staggered Adoption#

In a staggered adoption design, units adopt treatment at different times. We call the period when a unit first receives treatment its cohort or group.

[ ]:
# Generate staggered adoption data using the library function
from diff_diff import generate_staggered_data

# Generate data with 100 units, 8 periods, two treatment cohorts (periods 3 and 5),
# and 40% never-treated
df = generate_staggered_data(
    n_units=100,
    n_periods=8,
    cohort_periods=[3, 5],  # Treatment cohorts at periods 3 and 5
    never_treated_frac=0.4,
    treatment_effect=2.0,
    dynamic_effects=True,
    effect_growth=0.5,  # Effect grows 0.5 per period
    unit_fe_sd=2.0,
    noise_sd=0.5,
    seed=42
)

# The DGP returns 'first_treat' column: 0 = never-treated, >0 = first treatment period

print(f"Dataset: {len(df)} observations, {df['unit'].nunique()} units, {df['period'].nunique()} periods")
df.head(10)
[ ]:
# Examine treatment timing
cohort_summary = df.groupby('unit').agg({'first_treat': 'first', 'treated': 'sum'}).reset_index()
print("Treatment cohorts:")
print(cohort_summary.groupby('first_treat').size())

print("\nTreatment adoption over time:")
print(df.groupby('period')['treated'].mean().round(3))

2. The Problem with TWFE in Staggered Settings#

Traditional Two-Way Fixed Effects (TWFE) can give biased estimates because:

  • It uses already-treated units as controls for newly-treated units

  • With heterogeneous treatment effects, this leads to “negative weighting”

Let’s see what TWFE would give us:

[ ]:
from diff_diff import TwoWayFixedEffects

# TWFE estimation (potentially biased with heterogeneous effects)
twfe = TwoWayFixedEffects()
results_twfe = twfe.fit(
    df,
    outcome="outcome",
    treatment="treated",
    unit="unit",
    time="period"
)

print("TWFE Estimate (potentially biased):")
print(f"ATT: {results_twfe.att:.4f}")

Understanding Why TWFE Fails: Goodman-Bacon Decomposition#

The Goodman-Bacon (2021) decomposition reveals exactly why TWFE can be biased. It shows that the TWFE estimate is a weighted average of all possible 2x2 DiD comparisons, including problematic “forbidden comparisons” where already-treated units are used as controls.

There are three types of comparisons:

  1. Treated vs Never-treated (green): Clean comparisons using never-treated units

  2. Earlier vs Later treated (blue): Uses later-treated as controls before they’re treated

  3. Later vs Earlier treated (red): Uses already-treated as controls — the “forbidden comparisons”

When treatment effects are heterogeneous (as in our data where effects grow over time), the forbidden comparisons can bias the TWFE estimate.

[ ]:
from diff_diff import bacon_decompose, plot_bacon

# Perform the Goodman-Bacon decomposition
bacon_results = bacon_decompose(
    df,
    outcome='outcome',
    unit='unit',
    time='period',
    first_treat='first_treat'  # 0 means never-treated
)

# View the decomposition summary
bacon_results.print_summary()
[ ]:
# Visualize the decomposition
if HAS_MATPLOTLIB:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Scatter plot: shows each 2x2 comparison
    plot_bacon(bacon_results, ax=axes[0], plot_type='scatter', show=False)

    # Bar chart: shows total weight by comparison type
    plot_bacon(bacon_results, ax=axes[1], plot_type='bar', show=False)

    plt.tight_layout()
    plt.show()

    # Interpret the results
    forbidden_weight = bacon_results.total_weight_later_vs_earlier
    print(f"\n⚠️  {forbidden_weight:.1%} of the TWFE weight comes from 'forbidden comparisons'")
    print("   where already-treated units are used as controls.")
    print("\n→ This explains why TWFE can be biased. Use Callaway-Sant'Anna instead!")

3. Callaway-Sant’Anna Estimator#

The CS estimator avoids these problems by:

  1. Computing separate effects for each (group, time) pair: ATT(g,t)

  2. Only using not-yet-treated or never-treated units as controls

  3. Properly aggregating these effects

[ ]:
# Callaway-Sant'Anna estimation
cs = CallawaySantAnna(
    control_group="never_treated",  # Use never-treated as controls
    anticipation=0  # No anticipation effects
)

results_cs = cs.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",  # Column with first treatment period (0 = never treated)
    aggregate="all"  # Compute all aggregations (simple, event_study, group)
)

print(results_cs.summary())

4. Group-Time Effects ATT(g,t)#

The CS estimator computes separate effects for each combination of:

  • g: Treatment cohort (when the group was first treated)

  • t: Calendar time period

[ ]:
# View all group-time effects
print("Group-Time Effects ATT(g,t):")
print("=" * 60)

for (g, t), data in results_cs.group_time_effects.items():
    sig = "*" if data['p_value'] < 0.05 else ""
    print(f"ATT({g},{t}): {data['effect']:>7.4f} "
          f"(SE: {data['se']:.4f}, p: {data['p_value']:.3f}) {sig}")
[ ]:
# Convert to DataFrame for easier analysis
gt_df = results_cs.to_dataframe()
print("\nGroup-time effects as DataFrame:")
gt_df

5. Aggregating Effects#

We often want to summarize the group-time effects into a single number or event-study style estimates.

[ ]:
# Simple aggregation: weighted average across all (g,t)
# This is computed automatically and stored in overall_att/overall_se
print("Simple Aggregation (Overall ATT):")
print(f"ATT: {results_cs.overall_att:.4f}")
print(f"SE: {results_cs.overall_se:.4f}")
print(f"95% CI: [{results_cs.overall_conf_int[0]:.4f}, {results_cs.overall_conf_int[1]:.4f}]")
[ ]:
# Group aggregation: average effect by cohort
# Requires aggregate="group" or "all" in fit()
print("\nGroup Aggregation (ATT by cohort):")
for cohort, effects in results_cs.group_effects.items():
    print(f"Cohort {cohort}: ATT = {effects['effect']:.4f} (SE: {effects['se']:.4f})")
[ ]:
# Event-study aggregation: average effect by time relative to treatment
# Requires aggregate="event_study" or "all" in fit()
print("\nEvent-Study Aggregation (ATT by event time):")
print(f"{'Event Time':>12} {'ATT':>10} {'SE':>10} {'95% CI':>25}")
print("-" * 60)

for event_time in sorted(results_cs.event_study_effects.keys()):
    effects = results_cs.event_study_effects[event_time]
    ci = effects['conf_int']
    print(f"{event_time:>12} {effects['effect']:>10.4f} {effects['se']:>10.4f} "
          f"[{ci[0]:>8.4f}, {ci[1]:>8.4f}]")

6. Bootstrap Inference#

With few clusters or when analytical standard errors may be unreliable, the multiplier bootstrap provides valid inference. This implements the approach from Callaway & Sant’Anna (2021), perturbing unit-level influence functions.

Why use bootstrap?

  • Analytical SEs may understate uncertainty with few clusters

  • Bootstrap provides finite-sample valid confidence intervals

  • P-values are computed from the bootstrap distribution

Weight types:

  • 'rademacher' - Default, ±1 with p=0.5, good for most cases

  • 'mammen' - Two-point distribution, matches first 3 moments

  • 'webb' - Six-point distribution, recommended for very few clusters (<10)

[ ]:
# Callaway-Sant'Anna with bootstrap inference
cs_boot = CallawaySantAnna(
    control_group="never_treated",
    n_bootstrap=499,              # Number of bootstrap iterations
    bootstrap_weights='rademacher',  # or 'mammen', 'webb'
    seed=42                        # For reproducibility
)

results_boot = cs_boot.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",    # Column with first treatment period
    aggregate="event_study"       # Compute event study aggregation
)

# Access bootstrap results
print("Bootstrap Inference Results:")
print("=" * 60)
print(f"\nOverall ATT: {results_boot.overall_att:.4f}")
print(f"Bootstrap SE: {results_boot.bootstrap_results.overall_att_se:.4f}")
print(f"Bootstrap 95% CI: [{results_boot.bootstrap_results.overall_att_ci[0]:.4f}, "
      f"{results_boot.bootstrap_results.overall_att_ci[1]:.4f}]")
print(f"Bootstrap p-value: {results_boot.bootstrap_results.overall_att_p_value:.4f}")
[ ]:
# Event study with bootstrap confidence intervals
print("\nEvent Study with Bootstrap Inference:")
print(f"{'Event Time':>12} {'ATT':>10} {'Boot SE':>10} {'Boot 95% CI':>25} {'p-value':>10}")
print("-" * 70)

event_ses = results_boot.bootstrap_results.event_study_ses
event_cis = results_boot.bootstrap_results.event_study_cis
event_pvals = results_boot.bootstrap_results.event_study_p_values

for event_time in sorted(event_ses.keys()):
    att = results_boot.event_study_effects[event_time]['effect']
    se = event_ses[event_time]
    ci = event_cis[event_time]
    pval = event_pvals[event_time]
    sig = "*" if pval < 0.05 else ""
    print(f"{event_time:>12} {att:>10.4f} {se:>10.4f} [{ci[0]:>8.4f}, {ci[1]:>8.4f}] {pval:>10.4f} {sig}")

7. Visualization#

Event-study plots are the standard way to visualize DiD results with multiple periods.

[ ]:
if HAS_MATPLOTLIB:
    # Event study plot
    fig, ax = plt.subplots(figsize=(10, 6))
    plot_event_study(
        results=results_cs,
        ax=ax,
        title="Event Study: Effect of Treatment Over Time",
        xlabel="Periods Since Treatment",
        ylabel="ATT"
    )
    plt.tight_layout()
    plt.show()
else:
    print("Install matplotlib to see visualizations: pip install matplotlib")
[ ]:
if HAS_MATPLOTLIB:
    # Plot effects by cohort
    fig, ax = plt.subplots(figsize=(10, 6))
    plot_group_effects(
        results=results_cs,
        ax=ax,
        title="Treatment Effects by Cohort"
    )
    plt.tight_layout()
    plt.show()

9. Different Control Group Options#

The CS estimator supports different control group specifications:

  • "never_treated": Only use units that are never treated

  • "not_yet_treated": Use units that haven’t been treated yet at time t

[ ]:
# Using not-yet-treated as control
cs_nyt = CallawaySantAnna(
    control_group="not_yet_treated"
)

results_nyt = cs_nyt.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat"
)

# Compare using overall_att/overall_se attributes
print("Comparison of control group specifications:")
print(f"{'Control Group':<20} {'ATT':>10} {'SE':>10}")
print("-" * 40)
print(f"{'Never-treated':<20} {results_cs.overall_att:>10.4f} {results_cs.overall_se:>10.4f}")
print(f"{'Not-yet-treated':<20} {results_nyt.overall_att:>10.4f} {results_nyt.overall_se:>10.4f}")

10. Handling Anticipation Effects#

If units start changing behavior before official treatment (anticipation), you can specify the anticipation period.

[ ]:
# Allow for 1 period of anticipation
cs_antic = CallawaySantAnna(
    control_group="never_treated",
    anticipation=1  # Treatment effects may start 1 period early
)

results_antic = cs_antic.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat"
)

print(f"With anticipation=1: ATT = {results_antic.overall_att:.4f}")

11. Adding Covariates#

You can include covariates to improve precision through outcome regression or propensity score methods.

[ ]:
# Add covariates to data
df['size'] = np.random.normal(100, 20, len(df))
df['age'] = np.random.normal(10, 3, len(df))

# Fit with covariates
cs_cov = CallawaySantAnna(
    control_group="never_treated"
)

results_cov = cs_cov.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",
    covariates=["size", "age"]
)

print(f"With covariates: ATT = {results_cov.overall_att:.4f} (SE: {results_cov.overall_se:.4f})")

12. Comparing with MultiPeriodDiD#

For comparison, here’s how you would use MultiPeriodDiD which estimates period-specific effects.

Important: MultiPeriodDiD assumes simultaneous treatment timing (all treated units get treated at the same time). For staggered adoption, always use CallawaySantAnna or SunAbraham instead.

To demonstrate MultiPeriodDiD properly, we’ll create a simple dataset where all treated units receive treatment at the same time.

[ ]:
# Create a simple dataset with simultaneous treatment timing
# This is the appropriate data structure for MultiPeriodDiD
from diff_diff import generate_did_data

# Generate data with simultaneous treatment at period 4
mp_data = generate_did_data(
    n_units=100,
    n_periods=8,
    treatment_period=4,  # All treated units get treatment at period 4
    treatment_fraction=0.5,
    treatment_effect=2.5,
    seed=42
)

print(f"MultiPeriodDiD dataset: {len(mp_data)} obs")
print(f"Treatment starts at period 4 for all treated units")

mp_did = MultiPeriodDiD()
results_mp = mp_did.fit(
    mp_data,
    outcome="outcome",
    treatment="treated",
    time="period",
    post_periods=[4, 5, 6, 7]
)

print(results_mp.summary())
[ ]:
# Period-specific effects from MultiPeriodDiD
print("\nPeriod-specific effects:")
for period, pe in results_mp.period_effects.items():
    print(f"Period {period}: {pe.effect:.4f} (SE: {pe.se:.4f})")

13. Sun-Abraham Interaction-Weighted Estimator#

The Sun-Abraham (2021) estimator provides an alternative approach to staggered DiD. While Callaway-Sant’Anna aggregates 2x2 DiD comparisons, Sun-Abraham uses an interaction-weighted regression approach:

  1. Run a saturated regression with cohort × relative-time indicators

  2. Weight cohort-specific effects by each cohort’s share of treated observations at each relative time

Key differences from CS:

  • Regression-based vs. 2x2 DiD aggregation

  • Different weighting scheme

  • More efficient under homogeneous effects

  • Consistent under heterogeneous effects (like CS)

When to use both: Running both CS and SA provides a useful robustness check. When they agree, results are more credible.

[ ]:
# Sun-Abraham estimation
sa = SunAbraham(
    control_group="never_treated",  # Use never-treated as controls
    anticipation=0                   # No anticipation effects
)

results_sa = sa.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat"  # Column with first treatment period (0 = never treated)
)

# View summary
results_sa.print_summary()
[ ]:
# Event study effects by relative time
print("Sun-Abraham Event Study Effects:")
print(f"{'Rel. Time':>12} {'Effect':>10} {'SE':>10} {'p-value':>10}")
print("-" * 45)

for rel_time in sorted(results_sa.event_study_effects.keys()):
    eff = results_sa.event_study_effects[rel_time]
    sig = "*" if eff['p_value'] < 0.05 else ""
    print(f"{rel_time:>12} {eff['effect']:>10.4f} {eff['se']:>10.4f} {eff['p_value']:>10.4f} {sig}")

# Cohort weights show how each cohort contributes to event-study estimates
print("\n\nCohort Weights by Relative Time:")
for rel_time in sorted(results_sa.cohort_weights.keys()):
    weights = results_sa.cohort_weights[rel_time]
    print(f"e={rel_time}: {weights}")

14. Comparing CS and SA as a Robustness Check#

Running both estimators provides a useful robustness check. When they agree, results are more credible.

Understanding Pre-Period Differences#

You may notice that post-treatment effects align closely between CS and SA, but pre-treatment effects can differ in magnitude and significance. This is expected methodological behavior, not a bug.

Why the difference?

  1. Callaway-Sant’Anna with ``base_period=”varying”`` (default):

    • Pre-treatment effects use consecutive period comparisons (period t vs period t-1)

    • Each pre-period coefficient represents a one-period change

    • These smaller incremental changes often yield lower t-statistics

  2. Sun-Abraham:

    • Uses a fixed reference period (e=-1 when anticipation=0, or e=-1-anticipation otherwise)

    • All coefficients are deviations from this single reference

    • Pre-period coefficients show cumulative difference from the reference

To make CS pre-periods more comparable to SA, use base_period="universal":

cs_universal = CallawaySantAnna(base_period="universal")

This makes CS compare all periods to g-1 (like SA), producing more similar pre-treatment estimates.

[ ]:
# Compare overall ATT from both estimators
cs_label = "Callaway-Sant'Anna (varying)"
print("Robustness Check: CS vs SA")
print("=" * 60)
print(f"{'Estimator':<30} {'Overall ATT':>12} {'SE':>10}")
print("-" * 60)
print(f"{cs_label:<30} {results_cs.overall_att:>12.4f} {results_cs.overall_se:>10.4f}")
print(f"{'Sun-Abraham':<30} {results_sa.overall_att:>12.4f} {results_sa.overall_se:>10.4f}")

# Also fit CS with universal base period for comparison
cs_universal = CallawaySantAnna(control_group="never_treated", base_period="universal")
results_cs_univ = cs_universal.fit(
    df, outcome="outcome", unit="unit",
    time="period", first_treat="first_treat",
    aggregate="event_study"
)

# Compare event study effects
print("\n\nEvent Study Comparison:")
print("Note: Pre-periods differ due to base period methodology (see explanation above)")
print(f"{'Rel. Time':>10} {'CS (vary)':>12} {'CS (univ)':>12} {'SA':>10} {'Note':>20}")
print("-" * 70)

for rel_time in sorted(results_sa.event_study_effects.keys()):
    sa_eff = results_sa.event_study_effects[rel_time]['effect']
    cs_vary = results_cs.event_study_effects.get(rel_time, {}).get('effect', np.nan)
    cs_univ = results_cs_univ.event_study_effects.get(rel_time, {}).get('effect', np.nan)

    note = "pre (differs)" if rel_time < 0 else "post (matches)"
    print(f"{rel_time:>10} {cs_vary:>12.4f} {cs_univ:>12.4f} {sa_eff:>10.4f} {note:>20}")

print("\nPost-treatment effects should be similar across all methods")
print("Pre-treatment differences are expected due to base period methodology")

Summary#

Key takeaways:

  1. TWFE can be biased with staggered adoption and heterogeneous effects

  2. Goodman-Bacon decomposition reveals why TWFE fails by showing:

    • The implicit 2x2 comparisons and their weights

    • How much weight falls on “forbidden comparisons” (already-treated as controls)

  3. Callaway-Sant’Anna properly handles staggered adoption by:

    • Computing group-time specific effects ATT(g,t)

    • Only using valid comparison groups

    • Properly aggregating effects

  4. Sun-Abraham provides an alternative approach using:

    • Interaction-weighted regression with cohort x relative-time indicators

    • Different weighting scheme than CS

    • More efficient under homogeneous effects

  5. Run both CS and SA as a robustness check—when they agree, results are more credible

  6. Aggregation options:

    • "simple": Overall ATT

    • "group": ATT by cohort

    • "event": ATT by event time (for event-study plots)

  7. Bootstrap inference provides valid standard errors and confidence intervals:

    • Use n_bootstrap parameter to enable multiplier bootstrap

    • Choose weight type: 'rademacher', 'mammen', or 'webb'

    • Bootstrap results include SEs, CIs, and p-values for all aggregations

  8. Pre-treatment effects provide parallel trends diagnostics:

    • Use base_period="varying" for consecutive period comparisons

    • Pre-treatment ATT(g,t) should be near zero

    • 95% CIs including zero is consistent with parallel trends

    • See Tutorial 07 for pre-trends power analysis (Roth 2022)

  9. Control group choices affect efficiency and assumptions:

    • "never_treated": Stronger parallel trends assumption

    • "not_yet_treated": Weaker assumption, uses more data

  10. CS vs SA pre-period differences are expected:

    • Post-treatment effects should be similar (robustness check)

    • Pre-treatment effects differ due to base period methodology

    • CS (varying): consecutive comparisons → one-period changes

    • SA: fixed reference (e=-1-anticipation) → cumulative deviations

    • Use base_period="universal" in CS for comparable pre-periods

For more details, see:

  • Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics.

  • Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics.

  • Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics.