Interactive notebook

This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.

Staggered Difference-in-Differences#

This notebook demonstrates how to handle staggered treatment adoption using modern DiD estimators. In staggered DiD settings:

Different units get treated at different times
Traditional TWFE can give biased estimates due to “forbidden comparisons”
Modern estimators compute group-time specific effects and aggregate them properly

We’ll cover:

Understanding staggered adoption
The problem with TWFE (and Goodman-Bacon decomposition)
The Callaway-Sant’Anna estimator
Group-time effects ATT(g,t)
Aggregating effects (simple, group, event-study)
Bootstrap inference for valid standard errors
Visualization
Pre-treatment effects and parallel trends testing
Different control group options
Handling anticipation effects
Adding covariates
Comparing with MultiPeriodDiD
Sun-Abraham interaction-weighted estimator
Comparing CS and SA as a robustness check

[ ]:

import numpy as np
import pandas as pd
from diff_diff import CallawaySantAnna, SunAbraham, MultiPeriodDiD
from diff_diff.visualization import plot_event_study, plot_group_effects

# For nicer plots (optional)
try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    HAS_MATPLOTLIB = True
except ImportError:
    HAS_MATPLOTLIB = False
    print("matplotlib not installed - visualization examples will be skipped")

1. Understanding Staggered Adoption#

In a staggered adoption design, units adopt treatment at different times. We call the period when a unit first receives treatment its cohort or group.

[ ]:

# Generate staggered adoption data using the library function
from diff_diff import generate_staggered_data

# Generate data with 100 units, 8 periods, two treatment cohorts (periods 3 and 5),
# and 40% never-treated
df = generate_staggered_data(
    n_units=100,
    n_periods=8,
    cohort_periods=[3, 5],  # Treatment cohorts at periods 3 and 5
    never_treated_frac=0.4,
    treatment_effect=2.0,
    dynamic_effects=True,
    effect_growth=0.5,  # Effect grows 0.5 per period
    unit_fe_sd=2.0,
    noise_sd=0.5,
    seed=42
)

# The DGP returns 'first_treat' column: 0 = never-treated, >0 = first treatment period

print(f"Dataset: {len(df)} observations, {df['unit'].nunique()} units, {df['period'].nunique()} periods")
df.head(10)

[ ]:

# Examine treatment timing
cohort_summary = df.groupby('unit').agg({'first_treat': 'first', 'treated': 'sum'}).reset_index()
print("Treatment cohorts:")
print(cohort_summary.groupby('first_treat').size())

print("\nTreatment adoption over time:")
print(df.groupby('period')['treated'].mean().round(3))

2. The Problem with TWFE in Staggered Settings#

Traditional Two-Way Fixed Effects (TWFE) can give biased estimates because:

It uses already-treated units as controls for newly-treated units
With heterogeneous treatment effects, this leads to “negative weighting”

Let’s see what TWFE would give us:

[ ]:

from diff_diff import TwoWayFixedEffects

# TWFE estimation (potentially biased with heterogeneous effects)
twfe = TwoWayFixedEffects()
results_twfe = twfe.fit(
    df,
    outcome="outcome",
    treatment="treated",
    unit="unit",
    time="period"
)

print("TWFE Estimate (potentially biased):")
print(f"ATT: {results_twfe.att:.4f}")

Understanding Why TWFE Fails: Goodman-Bacon Decomposition#

The Goodman-Bacon (2021) decomposition reveals exactly why TWFE can be biased. It shows that the TWFE estimate is a weighted average of all possible 2x2 DiD comparisons, including problematic “forbidden comparisons” where already-treated units are used as controls.

There are three types of comparisons:

Treated vs Never-treated (green): Clean comparisons using never-treated units
Earlier vs Later treated (blue): Uses later-treated as controls before they’re treated
Later vs Earlier treated (red): Uses already-treated as controls — the “forbidden comparisons”

When treatment effects are heterogeneous (as in our data where effects grow over time), the forbidden comparisons can bias the TWFE estimate.

[ ]:

from diff_diff import bacon_decompose, plot_bacon

# Perform the Goodman-Bacon decomposition
bacon_results = bacon_decompose(
    df,
    outcome='outcome',
    unit='unit',
    time='period',
    first_treat='first_treat'  # 0 means never-treated
)

# View the decomposition summary
bacon_results.print_summary()

[ ]:

# Visualize the decomposition
if HAS_MATPLOTLIB:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Scatter plot: shows each 2x2 comparison
    plot_bacon(bacon_results, ax=axes[0], plot_type='scatter', show=False)

    # Bar chart: shows total weight by comparison type
    plot_bacon(bacon_results, ax=axes[1], plot_type='bar', show=False)

    plt.tight_layout()
    plt.show()

    # Interpret the results
    forbidden_weight = bacon_results.total_weight_later_vs_earlier
    print(f"\n⚠️  {forbidden_weight:.1%} of the TWFE weight comes from 'forbidden comparisons'")
    print("   where already-treated units are used as controls.")
    print("\n→ This explains why TWFE can be biased. Use Callaway-Sant'Anna instead!")

3. Callaway-Sant’Anna Estimator#

The CS estimator avoids these problems by:

Computing separate effects for each (group, time) pair: ATT(g,t)
Only using not-yet-treated or never-treated units as controls
Properly aggregating these effects

[ ]:

# Callaway-Sant'Anna estimation
cs = CallawaySantAnna(
    control_group="never_treated",  # Use never-treated as controls
    anticipation=0  # No anticipation effects
)

results_cs = cs.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",  # Column with first treatment period (0 = never treated)
    aggregate="all"  # Compute all aggregations (simple, event_study, group)
)

print(results_cs.summary())

4. Group-Time Effects ATT(g,t)#

The CS estimator computes separate effects for each combination of:

g: Treatment cohort (when the group was first treated)
t: Calendar time period

[ ]:

# View all group-time effects
print("Group-Time Effects ATT(g,t):")
print("=" * 60)

for (g, t), data in results_cs.group_time_effects.items():
    sig = "*" if data['p_value'] < 0.05 else ""
    print(f"ATT({g},{t}): {data['effect']:>7.4f} "
          f"(SE: {data['se']:.4f}, p: {data['p_value']:.3f}) {sig}")

[ ]:

# Convert to DataFrame for easier analysis
gt_df = results_cs.to_dataframe()
print("\nGroup-time effects as DataFrame:")
gt_df

5. Aggregating Effects#

We often want to summarize the group-time effects into a single number or event-study style estimates.

[ ]:

# Simple aggregation: weighted average across all (g,t)
# This is computed automatically and stored in overall_att/overall_se
print("Simple Aggregation (Overall ATT):")
print(f"ATT: {results_cs.overall_att:.4f}")
print(f"SE: {results_cs.overall_se:.4f}")
print(f"95% CI: [{results_cs.overall_conf_int[0]:.4f}, {results_cs.overall_conf_int[1]:.4f}]")

[ ]:

# Group aggregation: average effect by cohort
# Requires aggregate="group" or "all" in fit()
print("\nGroup Aggregation (ATT by cohort):")
for cohort, effects in results_cs.group_effects.items():
    print(f"Cohort {cohort}: ATT = {effects['effect']:.4f} (SE: {effects['se']:.4f})")

[ ]:

# Event-study aggregation: average effect by time relative to treatment
# Requires aggregate="event_study" or "all" in fit()
print("\nEvent-Study Aggregation (ATT by event time):")
print(f"{'Event Time':>12} {'ATT':>10} {'SE':>10} {'95% CI':>25}")
print("-" * 60)

for event_time in sorted(results_cs.event_study_effects.keys()):
    effects = results_cs.event_study_effects[event_time]
    ci = effects['conf_int']
    print(f"{event_time:>12} {effects['effect']:>10.4f} {effects['se']:>10.4f} "
          f"[{ci[0]:>8.4f}, {ci[1]:>8.4f}]")

6. Bootstrap Inference#

With few clusters or when analytical standard errors may be unreliable, the multiplier bootstrap provides valid inference. This implements the approach from Callaway & Sant’Anna (2021), perturbing unit-level influence functions.

Why use bootstrap?

Analytical SEs may understate uncertainty with few clusters
Bootstrap provides finite-sample valid confidence intervals
P-values are computed from the bootstrap distribution

Weight types:

'rademacher' - Default, ±1 with p=0.5, good for most cases
'mammen' - Two-point distribution, matches first 3 moments
'webb' - Six-point distribution, recommended for very few clusters (<10)

[ ]:

# Callaway-Sant'Anna with bootstrap inference
cs_boot = CallawaySantAnna(
    control_group="never_treated",
    n_bootstrap=499,              # Number of bootstrap iterations
    bootstrap_weights='rademacher',  # or 'mammen', 'webb'
    seed=42                        # For reproducibility
)

results_boot = cs_boot.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",    # Column with first treatment period
    aggregate="event_study"       # Compute event study aggregation
)

# Access bootstrap results
print("Bootstrap Inference Results:")
print("=" * 60)
print(f"\nOverall ATT: {results_boot.overall_att:.4f}")
print(f"Bootstrap SE: {results_boot.bootstrap_results.overall_att_se:.4f}")
print(f"Bootstrap 95% CI: [{results_boot.bootstrap_results.overall_att_ci[0]:.4f}, "
      f"{results_boot.bootstrap_results.overall_att_ci[1]:.4f}]")
print(f"Bootstrap p-value: {results_boot.bootstrap_results.overall_att_p_value:.4f}")

[ ]:

# Event study with bootstrap confidence intervals
print("\nEvent Study with Bootstrap Inference:")
print(f"{'Event Time':>12} {'ATT':>10} {'Boot SE':>10} {'Boot 95% CI':>25} {'p-value':>10}")
print("-" * 70)

event_ses = results_boot.bootstrap_results.event_study_ses
event_cis = results_boot.bootstrap_results.event_study_cis
event_pvals = results_boot.bootstrap_results.event_study_p_values

for event_time in sorted(event_ses.keys()):
    att = results_boot.event_study_effects[event_time]['effect']
    se = event_ses[event_time]
    ci = event_cis[event_time]
    pval = event_pvals[event_time]
    sig = "*" if pval < 0.05 else ""
    print(f"{event_time:>12} {att:>10.4f} {se:>10.4f} [{ci[0]:>8.4f}, {ci[1]:>8.4f}] {pval:>10.4f} {sig}")

7. Visualization#

Event-study plots are the standard way to visualize DiD results with multiple periods.

[ ]:

if HAS_MATPLOTLIB:
    # Event study plot
    fig, ax = plt.subplots(figsize=(10, 6))
    plot_event_study(
        results=results_cs,
        ax=ax,
        title="Event Study: Effect of Treatment Over Time",
        xlabel="Periods Since Treatment",
        ylabel="ATT"
    )
    plt.tight_layout()
    plt.show()
else:
    print("Install matplotlib to see visualizations: pip install matplotlib")

[ ]:

if HAS_MATPLOTLIB:
    # Plot effects by cohort
    fig, ax = plt.subplots(figsize=(10, 6))
    plot_group_effects(
        results=results_cs,
        ax=ax,
        title="Treatment Effects by Cohort"
    )
    plt.tight_layout()
    plt.show()

8. Pre-Treatment Effects and Parallel Trends Testing#

The Callaway-Sant’Anna estimator can compute pre-treatment effects ATT(g,t) for periods before treatment. These should be near zero if parallel trends holds.

The base_period parameter controls how the reference period is selected:

"varying" (default): For pre-treatment periods, compares t to t-1 (consecutive comparisons)
"universal": Always compares to g-1 (or g-anticipation-1 when anticipation > 0)

Both produce identical post-treatment effects; they differ only for pre-treatment diagnostics.

[ ]:

# CallawaySantAnna with explicit base_period for pre-treatment effects
cs_pretrends = CallawaySantAnna(
    control_group="never_treated",
    base_period="varying"  # Default: consecutive comparisons for pre-periods
)

results_pretrends = cs_pretrends.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",
    aggregate="event_study"
)

# The base_period is recorded in results
print(f"Base period method: {results_pretrends.base_period}")

[ ]:

# Examine pre-treatment effects (event time < 0)
print("Pre-Treatment Effects (Parallel Trends Diagnostic):")
print("=" * 65)
print(f"{'Event Time':>12} {'ATT':>10} {'SE':>10} {'95% CI':>25} {'Test'}")
print("-" * 65)

pre_period_effects = []
for event_time in sorted(results_pretrends.event_study_effects.keys()):
    if event_time < 0:
        effects = results_pretrends.event_study_effects[event_time]
        ci = effects['conf_int']
        includes_zero = ci[0] <= 0 <= ci[1]
        marker = "Pass" if includes_zero else "Fail"
        pre_period_effects.append(effects['effect'])
        print(f"{event_time:>12} {effects['effect']:>10.4f} {effects['se']:>10.4f} "
              f"[{ci[0]:>8.4f}, {ci[1]:>8.4f}] {marker}")

if pre_period_effects:
    print(f"\n-> All pre-treatment effects should be close to zero")
    print(f"   Mean pre-treatment effect: {np.mean(pre_period_effects):.4f}")
else:
    print("No pre-treatment effects computed (insufficient pre-periods)")

Comparing Base Period Methods#

Let’s compare the two base period methods to understand their difference:

[ ]:

# Compare varying vs universal base period
cs_universal = CallawaySantAnna(
    control_group="never_treated",
    base_period="universal"  # Always use g-1 as base (g-anticipation-1 if anticipation > 0)
)

results_universal = cs_universal.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",
    aggregate="event_study"
)

print("Pre-Treatment Effects: Varying vs Universal Base Period")
print("=" * 70)
print(f"{'Event Time':>12} {'Varying':>12} {'Universal':>12} {'Difference':>12}")
print("-" * 70)

for event_time in sorted(results_pretrends.event_study_effects.keys()):
    if event_time < 0:
        varying_eff = results_pretrends.event_study_effects[event_time]['effect']
        universal_eff = results_universal.event_study_effects.get(event_time, {}).get('effect', np.nan)
        diff = varying_eff - universal_eff if not np.isnan(universal_eff) else np.nan
        print(f"{event_time:>12} {varying_eff:>12.4f} {universal_eff:>12.4f} {diff:>12.4f}")

print("\nNote: 'Varying' uses consecutive period comparisons (t vs t-1)")
print("      'Universal' compares all periods to g-1 (g-anticipation-1 if anticipation > 0)")

Interpreting Pre-Treatment Effects#

What we’re testing:

Pre-treatment ATT(g,t) should be approximately zero if parallel trends holds
Significant non-zero pre-treatment effects suggest potential parallel trends violations

Key insights:

Visual inspection in the event study plot shows pre-period coefficients
Formal tests: 95% CIs including zero is consistent with parallel trends
Important caveat: A “passing” test doesn’t prove parallel trends—the test may lack power

When concerned about pre-trends:

Add covariates for precision (Section 11)
Use control_group="not_yet_treated" for more data (Section 9)
Apply Honest DiD sensitivity analysis to bound effects under violations (Tutorial 05)
Assess pre-trends test power using Tutorial 07

For comprehensive parallel trends testing: Tutorial 04 For pre-trends power analysis (Roth 2022): Tutorial 07

9. Different Control Group Options#

The CS estimator supports different control group specifications:

"never_treated": Only use units that are never treated
"not_yet_treated": Use units that haven’t been treated yet at time t

[ ]:

# Using not-yet-treated as control
cs_nyt = CallawaySantAnna(
    control_group="not_yet_treated"
)

results_nyt = cs_nyt.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat"
)

# Compare using overall_att/overall_se attributes
print("Comparison of control group specifications:")
print(f"{'Control Group':<20} {'ATT':>10} {'SE':>10}")
print("-" * 40)
print(f"{'Never-treated':<20} {results_cs.overall_att:>10.4f} {results_cs.overall_se:>10.4f}")
print(f"{'Not-yet-treated':<20} {results_nyt.overall_att:>10.4f} {results_nyt.overall_se:>10.4f}")

10. Handling Anticipation Effects#

If units start changing behavior before official treatment (anticipation), you can specify the anticipation period.

[ ]:

# Allow for 1 period of anticipation
cs_antic = CallawaySantAnna(
    control_group="never_treated",
    anticipation=1  # Treatment effects may start 1 period early
)

results_antic = cs_antic.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat"
)

print(f"With anticipation=1: ATT = {results_antic.overall_att:.4f}")

11. Adding Covariates#

You can include covariates to improve precision through outcome regression or propensity score methods.

[ ]:

# Add covariates to data
df['size'] = np.random.normal(100, 20, len(df))
df['age'] = np.random.normal(10, 3, len(df))

# Fit with covariates
cs_cov = CallawaySantAnna(
    control_group="never_treated"
)

results_cov = cs_cov.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat",
    covariates=["size", "age"]
)

print(f"With covariates: ATT = {results_cov.overall_att:.4f} (SE: {results_cov.overall_se:.4f})")

12. Comparing with MultiPeriodDiD#

For comparison, here’s how you would use MultiPeriodDiD which estimates period-specific effects.

Important: MultiPeriodDiD assumes simultaneous treatment timing (all treated units get treated at the same time). For staggered adoption, always use CallawaySantAnna or SunAbraham instead.

To demonstrate MultiPeriodDiD properly, we’ll create a simple dataset where all treated units receive treatment at the same time.

[ ]:

# Create a simple dataset with simultaneous treatment timing
# This is the appropriate data structure for MultiPeriodDiD
from diff_diff import generate_did_data

# Generate data with simultaneous treatment at period 4
mp_data = generate_did_data(
    n_units=100,
    n_periods=8,
    treatment_period=4,  # All treated units get treatment at period 4
    treatment_fraction=0.5,
    treatment_effect=2.5,
    seed=42
)

print(f"MultiPeriodDiD dataset: {len(mp_data)} obs")
print(f"Treatment starts at period 4 for all treated units")

mp_did = MultiPeriodDiD()
results_mp = mp_did.fit(
    mp_data,
    outcome="outcome",
    treatment="treated",
    time="period",
    post_periods=[4, 5, 6, 7]
)

print(results_mp.summary())

[ ]:

# Period-specific effects from MultiPeriodDiD
print("\nPeriod-specific effects:")
for period, pe in results_mp.period_effects.items():
    print(f"Period {period}: {pe.effect:.4f} (SE: {pe.se:.4f})")

13. Sun-Abraham Interaction-Weighted Estimator#

The Sun-Abraham (2021) estimator provides an alternative approach to staggered DiD. While Callaway-Sant’Anna aggregates 2x2 DiD comparisons, Sun-Abraham uses an interaction-weighted regression approach:

Run a saturated regression with cohort × relative-time indicators
Weight cohort-specific effects by each cohort’s share of treated observations at each relative time

Key differences from CS:

Regression-based vs. 2x2 DiD aggregation
Different weighting scheme
More efficient under homogeneous effects
Consistent under heterogeneous effects (like CS)

When to use both: Running both CS and SA provides a useful robustness check. When they agree, results are more credible.

[ ]:

# Sun-Abraham estimation
sa = SunAbraham(
    control_group="never_treated",  # Use never-treated as controls
    anticipation=0                   # No anticipation effects
)

results_sa = sa.fit(
    df,
    outcome="outcome",
    unit="unit",
    time="period",
    first_treat="first_treat"  # Column with first treatment period (0 = never treated)
)

# View summary
results_sa.print_summary()

[ ]:

# Event study effects by relative time
print("Sun-Abraham Event Study Effects:")
print(f"{'Rel. Time':>12} {'Effect':>10} {'SE':>10} {'p-value':>10}")
print("-" * 45)

for rel_time in sorted(results_sa.event_study_effects.keys()):
    eff = results_sa.event_study_effects[rel_time]
    sig = "*" if eff['p_value'] < 0.05 else ""
    print(f"{rel_time:>12} {eff['effect']:>10.4f} {eff['se']:>10.4f} {eff['p_value']:>10.4f} {sig}")

# Cohort weights show how each cohort contributes to event-study estimates
print("\n\nCohort Weights by Relative Time:")
for rel_time in sorted(results_sa.cohort_weights.keys()):
    weights = results_sa.cohort_weights[rel_time]
    print(f"e={rel_time}: {weights}")

14. Comparing CS and SA as a Robustness Check#

Running both estimators provides a useful robustness check. When they agree, results are more credible.

Understanding Pre-Period Differences#

You may notice that post-treatment effects align closely between CS and SA, but pre-treatment effects can differ in magnitude and significance. This is expected methodological behavior, not a bug.

Why the difference?

Callaway-Sant’Anna with ``base_period=”varying”`` (default):
- Pre-treatment effects use consecutive period comparisons (period t vs period t-1)
- Each pre-period coefficient represents a one-period change
- These smaller incremental changes often yield lower t-statistics
Sun-Abraham:
- Uses a fixed reference period (e=-1 when anticipation=0, or e=-1-anticipation otherwise)
- All coefficients are deviations from this single reference
- Pre-period coefficients show cumulative difference from the reference

To make CS pre-periods more comparable to SA, use base_period="universal":

cs_universal = CallawaySantAnna(base_period="universal")

This makes CS compare all periods to g-1 (like SA), producing more similar pre-treatment estimates.

[ ]:

# Compare overall ATT from both estimators
cs_label = "Callaway-Sant'Anna (varying)"
print("Robustness Check: CS vs SA")
print("=" * 60)
print(f"{'Estimator':<30} {'Overall ATT':>12} {'SE':>10}")
print("-" * 60)
print(f"{cs_label:<30} {results_cs.overall_att:>12.4f} {results_cs.overall_se:>10.4f}")
print(f"{'Sun-Abraham':<30} {results_sa.overall_att:>12.4f} {results_sa.overall_se:>10.4f}")

# Also fit CS with universal base period for comparison
cs_universal = CallawaySantAnna(control_group="never_treated", base_period="universal")
results_cs_univ = cs_universal.fit(
    df, outcome="outcome", unit="unit",
    time="period", first_treat="first_treat",
    aggregate="event_study"
)

# Compare event study effects
print("\n\nEvent Study Comparison:")
print("Note: Pre-periods differ due to base period methodology (see explanation above)")
print(f"{'Rel. Time':>10} {'CS (vary)':>12} {'CS (univ)':>12} {'SA':>10} {'Note':>20}")
print("-" * 70)

for rel_time in sorted(results_sa.event_study_effects.keys()):
    sa_eff = results_sa.event_study_effects[rel_time]['effect']
    cs_vary = results_cs.event_study_effects.get(rel_time, {}).get('effect', np.nan)
    cs_univ = results_cs_univ.event_study_effects.get(rel_time, {}).get('effect', np.nan)

    note = "pre (differs)" if rel_time < 0 else "post (matches)"
    print(f"{rel_time:>10} {cs_vary:>12.4f} {cs_univ:>12.4f} {sa_eff:>10.4f} {note:>20}")

print("\nPost-treatment effects should be similar across all methods")
print("Pre-treatment differences are expected due to base period methodology")

Summary#

Key takeaways:

TWFE can be biased with staggered adoption and heterogeneous effects
Goodman-Bacon decomposition reveals why TWFE fails by showing:
- The implicit 2x2 comparisons and their weights
- How much weight falls on “forbidden comparisons” (already-treated as controls)
Callaway-Sant’Anna properly handles staggered adoption by:
- Computing group-time specific effects ATT(g,t)
- Only using valid comparison groups
- Properly aggregating effects
Sun-Abraham provides an alternative approach using:
- Interaction-weighted regression with cohort x relative-time indicators
- Different weighting scheme than CS
- More efficient under homogeneous effects
Run both CS and SA as a robustness check—when they agree, results are more credible
Aggregation options:
- "simple": Overall ATT
- "group": ATT by cohort
- "event": ATT by event time (for event-study plots)
Bootstrap inference provides valid standard errors and confidence intervals:
- Use n_bootstrap parameter to enable multiplier bootstrap
- Choose weight type: 'rademacher', 'mammen', or 'webb'
- Bootstrap results include SEs, CIs, and p-values for all aggregations
Pre-treatment effects provide parallel trends diagnostics:
- Use base_period="varying" for consecutive period comparisons
- Pre-treatment ATT(g,t) should be near zero
- 95% CIs including zero is consistent with parallel trends
- See Tutorial 07 for pre-trends power analysis (Roth 2022)
Control group choices affect efficiency and assumptions:
- "never_treated": Stronger parallel trends assumption
- "not_yet_treated": Weaker assumption, uses more data
CS vs SA pre-period differences are expected:
- Post-treatment effects should be similar (robustness check)
- Pre-treatment effects differ due to base period methodology
- CS (varying): consecutive comparisons → one-period changes
- SA: fixed reference (e=-1-anticipation) → cumulative deviations
- Use base_period="universal" in CS for comparable pre-periods

For more details, see:

Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics.
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics.