Two-Stage DiD (Gardner 2022)#

Two-stage residualization estimator for staggered Difference-in-Differences.

This module implements the methodology from Gardner (2022), “Two-stage differences in differences”. The method:

  1. Estimates unit + time fixed effects on untreated observations only

  2. Residualizes ALL outcomes using the estimated fixed effects

  3. Regresses residualized outcomes on treatment indicators (Stage 2)

Inference uses the GMM sandwich variance estimator from Butts & Gardner (2022) that correctly accounts for first-stage estimation uncertainty. Point estimates are identical to ImputationDiD (Borusyak et al. 2024); the key difference is the variance estimator (GMM sandwich vs. conservative).

When to use TwoStageDiD:

  • Staggered adoption settings where you want efficient point estimates with variance that accounts for first-stage estimation uncertainty

  • When you prefer the GMM sandwich variance over the conservative variance used by ImputationDiD — the sandwich estimator can yield tighter confidence intervals when first-stage uncertainty is small

  • As a robustness check alongside CallawaySantAnna and ImputationDiD: if all estimators agree, results are robust; if they disagree, investigate treatment effect heterogeneity

  • When you need an event study that is free of TWFE contamination bias

Reference: Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943. Butts, K. & Gardner, J. (2022). did2s: Two-Stage Difference-in-Differences. R Journal, 14(1), 162-173.

TwoStageDiD#

Main estimator class for two-stage DiD estimation.

class diff_diff.TwoStageDiD[source]

Bases: TwoStageDiDBootstrapMixin

Gardner (2022) two-stage Difference-in-Differences estimator.

This estimator addresses TWFE bias under heterogeneous treatment effects by: 1. Estimating unit + time FEs on untreated observations only 2. Residualizing ALL outcomes using estimated FEs 3. Regressing residualized outcomes on treatment indicators

Point estimates are identical to ImputationDiD (Borusyak et al. 2024). The key difference is the variance estimator: TwoStageDiD uses a GMM sandwich variance that accounts for first-stage estimation uncertainty, while ImputationDiD uses the conservative variance from Theorem 3.

Parameters:
  • anticipation (int, default=0) – Number of periods before treatment where effects may occur.

  • alpha (float, default=0.05) – Significance level for confidence intervals.

  • cluster (str, optional) – Column name for cluster-robust standard errors. If None, clusters at the unit level by default.

  • n_bootstrap (int, default=0) – Number of bootstrap iterations. If 0, uses analytical GMM sandwich inference.

  • bootstrap_weights (str, default="rademacher") – Type of bootstrap weights: “rademacher”, “mammen”, or “webb”.

  • seed (int, optional) – Random seed for reproducibility.

  • rank_deficient_action (str, default="warn") – Action when design matrix is rank-deficient: - “warn”: Issue warning and drop linearly dependent columns - “error”: Raise ValueError - “silent”: Drop columns silently

  • horizon_max (int, optional) – Maximum event-study horizon. If set, event study effects are only computed for abs(h) <= horizon_max.

  • pretrends (bool, default=False) – If True, event study includes pre-treatment horizons for visual pre-trends assessment. Pre-period effects should be ~0 under parallel trends. Only affects event_study aggregation; overall ATT and group aggregation are unchanged.

results_

Estimation results after calling fit().

Type:

TwoStageDiDResults

is_fitted_

Whether the model has been fitted.

Type:

bool

Examples

Basic usage:

>>> from diff_diff import TwoStageDiD, generate_staggered_data
>>> data = generate_staggered_data(n_units=200, seed=42)
>>> est = TwoStageDiD()
>>> results = est.fit(data, outcome='outcome', unit='unit',
...                   time='period', first_treat='first_treat')
>>> results.print_summary()

With event study:

>>> est = TwoStageDiD()
>>> results = est.fit(data, outcome='outcome', unit='unit',
...                   time='period', first_treat='first_treat',
...                   aggregate='event_study')
>>> from diff_diff import plot_event_study
>>> plot_event_study(results)

Notes

The two-stage estimator uses ALL untreated observations (never-treated + not-yet-treated periods of eventually-treated units) to estimate the counterfactual model.

References

Gardner, J. (2022). Two-stage differences in differences.

arXiv:2207.05943.

Butts, K. & Gardner, J. (2022). did2s: Two-Stage

Difference-in-Differences. R Journal, 14(1), 162-173.

Methods

fit(data, outcome, unit, time, first_treat)

Fit the two-stage DiD estimator.

get_params()

Get estimator parameters (sklearn-compatible).

set_params(**params)

Set estimator parameters (sklearn-compatible).

__init__(anticipation=0, alpha=0.05, cluster=None, n_bootstrap=0, bootstrap_weights='rademacher', seed=None, rank_deficient_action='warn', horizon_max=None, pretrends=False)[source]
Parameters:
  • anticipation (int)

  • alpha (float)

  • cluster (str | None)

  • n_bootstrap (int)

  • bootstrap_weights (str)

  • seed (int | None)

  • rank_deficient_action (str)

  • horizon_max (int | None)

  • pretrends (bool)

alpha: float
n_bootstrap: int
bootstrap_weights: str
seed: int | None
horizon_max: int | None
results_: TwoStageDiDResults | None
fit(data, outcome, unit, time, first_treat, covariates=None, aggregate=None, balance_e=None, survey_design=None)[source]

Fit the two-stage DiD estimator.

Parameters:
  • data (pd.DataFrame) – Panel data with unit and time identifiers.

  • outcome (str) – Name of outcome variable column.

  • unit (str) – Name of unit identifier column.

  • time (str) – Name of time period column.

  • first_treat (str) – Name of column indicating when unit was first treated. Use 0 (or np.inf) for never-treated units.

  • covariates (list of str, optional) – List of covariate column names.

  • aggregate (str, optional) – Aggregation mode: None/”simple” (overall ATT only), “event_study”, “group”, or “all”.

  • balance_e (int, optional) – When computing event study, restrict to cohorts observed at all relative times in [-balance_e, max_h].

  • survey_design (SurveyDesign, optional) – Survey design specification for design-based inference. Supports pweight only (aweight/fweight raise ValueError). Supports strata, PSU, and FPC for design-based GMM sandwich variance. Strata enters survey df for t-distribution inference. Both analytical (n_bootstrap=0) and bootstrap inference are supported.

Returns:

Object containing all estimation results.

Return type:

TwoStageDiDResults

Raises:

ValueError – If required columns are missing or data validation fails.

get_params()[source]

Get estimator parameters (sklearn-compatible).

Return type:

Dict[str, Any]

set_params(**params)[source]

Set estimator parameters (sklearn-compatible).

Return type:

TwoStageDiD

summary()[source]

Get summary of estimation results.

Return type:

str

print_summary()[source]

Print summary to stdout.

Return type:

None

TwoStageDiDResults#

Results container for two-stage DiD estimation.

class diff_diff.TwoStageDiDResults[source]

Bases: object

Results from Gardner (2022) two-stage DiD estimation.

treatment_effects

Per-observation treatment effects with columns: unit, time, tau_hat, weight. tau_hat is the residualized outcome y_tilde for treated observations; weight is 1/n_treated.

Type:

pd.DataFrame

overall_att

Overall average treatment effect on the treated.

Type:

float

overall_se

Standard error of overall ATT (GMM sandwich).

Type:

float

overall_t_stat

T-statistic for overall ATT.

Type:

float

overall_p_value

P-value for overall ATT.

Type:

float

overall_conf_int

Confidence interval for overall ATT.

Type:

tuple

event_study_effects

Dictionary mapping relative time h to effect dict with keys: ‘effect’, ‘se’, ‘t_stat’, ‘p_value’, ‘conf_int’, ‘n_obs’.

Type:

dict, optional

group_effects

Dictionary mapping cohort g to effect dict.

Type:

dict, optional

groups

List of treatment cohorts.

Type:

list

time_periods

List of all time periods.

Type:

list

n_obs

Total number of observations.

Type:

int

n_treated_obs

Number of treated observations.

Type:

int

n_untreated_obs

Number of untreated observations.

Type:

int

n_treated_units

Number of ever-treated units.

Type:

int

n_control_units

Number of units contributing to untreated observations.

Type:

int

alpha

Significance level used.

Type:

float

bootstrap_results

Bootstrap inference results.

Type:

TwoStageBootstrapResults, optional

Methods

summary([alpha])

Generate formatted summary of estimation results.

print_summary([alpha])

Print summary to stdout.

to_dataframe([level])

Convert results to DataFrame.

treatment_effects: DataFrame
overall_att: float
overall_se: float
overall_t_stat: float
overall_p_value: float
overall_conf_int: Tuple[float, float]
event_study_effects: Dict[int, Dict[str, Any]] | None
group_effects: Dict[Any, Dict[str, Any]] | None
groups: List[Any]
time_periods: List[Any]
n_obs: int
n_treated_obs: int
n_untreated_obs: int
n_treated_units: int
n_control_units: int
alpha: float = 0.05
anticipation: int = 0
bootstrap_results: TwoStageBootstrapResults | None = None
survey_metadata: Any | None = None
property att: float
property se: float
property conf_int: Tuple[float, float]
property p_value: float
property t_stat: float
__repr__()[source]

Concise string representation.

Return type:

str

property coef_var: float

SE / abs(overall ATT). NaN when ATT is 0 or SE non-finite.

Type:

Coefficient of variation

summary(alpha=None)[source]

Generate formatted summary of estimation results.

Parameters:

alpha (float, optional) – Significance level. Defaults to alpha used in estimation.

Returns:

Formatted summary.

Return type:

str

print_summary(alpha=None)[source]

Print summary to stdout.

Parameters:

alpha (float | None)

Return type:

None

to_dataframe(level='event_study')[source]

Convert results to DataFrame.

Parameters:

level (str, default="event_study") – Level of aggregation: - “event_study”: Event study effects by relative time - “group”: Group (cohort) effects - “observation”: Per-observation treatment effects

Returns:

Results as DataFrame.

Return type:

pd.DataFrame

property is_significant: bool

Check if overall ATT is significant.

property significance_stars: str

Significance stars for overall ATT.

__init__(treatment_effects, overall_att, overall_se, overall_t_stat, overall_p_value, overall_conf_int, event_study_effects, group_effects, groups, time_periods, n_obs, n_treated_obs, n_untreated_obs, n_treated_units, n_control_units, alpha=0.05, anticipation=0, bootstrap_results=None, survey_metadata=None)
Parameters:
Return type:

None

TwoStageBootstrapResults#

Bootstrap inference results.

class diff_diff.TwoStageBootstrapResults[source]

Bases: object

Results from TwoStageDiD bootstrap inference.

Bootstrap uses multiplier bootstrap on the GMM influence function, consistent with other library estimators. The R did2s package uses block bootstrap by default; multiplier bootstrap is asymptotically equivalent.

n_bootstrap

Number of bootstrap iterations.

Type:

int

weight_type

Type of bootstrap weights: “rademacher”, “mammen”, or “webb”.

Type:

str

alpha

Significance level used for confidence intervals.

Type:

float

overall_att_se

Bootstrap standard error for overall ATT.

Type:

float

overall_att_ci

Bootstrap confidence interval for overall ATT.

Type:

tuple

overall_att_p_value

Bootstrap p-value for overall ATT.

Type:

float

event_study_ses

Bootstrap SEs for event study effects.

Type:

dict, optional

event_study_cis

Bootstrap CIs for event study effects.

Type:

dict, optional

event_study_p_values

Bootstrap p-values for event study effects.

Type:

dict, optional

group_ses

Bootstrap SEs for group effects.

Type:

dict, optional

group_cis

Bootstrap CIs for group effects.

Type:

dict, optional

group_p_values

Bootstrap p-values for group effects.

Type:

dict, optional

bootstrap_distribution

Full bootstrap distribution of overall ATT.

Type:

np.ndarray, optional

n_bootstrap: int
weight_type: str
alpha: float
overall_att_se: float
overall_att_ci: Tuple[float, float]
overall_att_p_value: float
event_study_ses: Dict[int, float] | None = None
event_study_cis: Dict[int, Tuple[float, float]] | None = None
event_study_p_values: Dict[int, float] | None = None
group_ses: Dict[Any, float] | None = None
group_cis: Dict[Any, Tuple[float, float]] | None = None
group_p_values: Dict[Any, float] | None = None
bootstrap_distribution: ndarray | None = None
__init__(n_bootstrap, weight_type, alpha, overall_att_se, overall_att_ci, overall_att_p_value, event_study_ses=None, event_study_cis=None, event_study_p_values=None, group_ses=None, group_cis=None, group_p_values=None, bootstrap_distribution=None)
Parameters:
Return type:

None

Convenience Function#

diff_diff.two_stage_did(data, outcome, unit, time, first_treat, covariates=None, aggregate=None, balance_e=None, survey_design=None, **kwargs)[source]#

Convenience function for two-stage DiD estimation.

This is a shortcut for creating a TwoStageDiD estimator and calling fit().

Parameters:
  • data (pd.DataFrame) – Panel data.

  • outcome (str) – Outcome variable column name.

  • unit (str) – Unit identifier column name.

  • time (str) – Time period column name.

  • first_treat (str) – Column indicating first treatment period (0 for never-treated).

  • covariates (list of str, optional) – Covariate column names.

  • aggregate (str, optional) – Aggregation mode: None, “simple”, “event_study”, “group”, “all”.

  • balance_e (int, optional) – Balance event study to cohorts observed at all relative times.

  • survey_design (SurveyDesign, optional) – Survey design specification for design-based inference. Supports pweight only (aweight/fweight raise ValueError). Supports strata, PSU, and FPC for design-based GMM sandwich variance. Strata enters survey df for t-distribution inference. Both analytical (n_bootstrap=0) and bootstrap inference are supported.

  • **kwargs – Additional keyword arguments passed to TwoStageDiD constructor.

Returns:

Estimation results.

Return type:

TwoStageDiDResults

Examples

>>> from diff_diff import two_stage_did, generate_staggered_data
>>> data = generate_staggered_data(seed=42)
>>> results = two_stage_did(data, 'outcome', 'unit', 'period',
...                         'first_treat', aggregate='event_study')
>>> results.print_summary()

Example Usage#

Basic usage:

from diff_diff import TwoStageDiD, generate_staggered_data

data = generate_staggered_data(n_units=200, seed=42)
est = TwoStageDiD()
results = est.fit(data, outcome='outcome', unit='unit',
                  time='period', first_treat='first_treat')
results.print_summary()

Event study with visualization:

from diff_diff import TwoStageDiD, plot_event_study

est = TwoStageDiD()
results = est.fit(data, outcome='outcome', unit='unit',
                  time='period', first_treat='first_treat',
                  aggregate='event_study')
plot_event_study(results)

Comparison with other estimators:

from diff_diff import TwoStageDiD, CallawaySantAnna, ImputationDiD

# All three should agree under homogeneous effects
ts = TwoStageDiD().fit(data, outcome='outcome', unit='unit',
                       time='period', first_treat='first_treat')
cs = CallawaySantAnna().fit(data, outcome='outcome', unit='unit',
                            time='period', first_treat='first_treat')
imp = ImputationDiD().fit(data, outcome='outcome', unit='unit',
                          time='period', first_treat='first_treat')

print(f"Two-Stage ATT: {ts.overall_att:.3f} (SE: {ts.overall_se:.3f})")
print(f"CS ATT:        {cs.overall_att:.3f} (SE: {cs.overall_se:.3f})")
print(f"Imputation ATT:{imp.overall_att:.3f} (SE: {imp.overall_se:.3f})")

Estimator Comparison#

TwoStageDiD vs. CallawaySantAnna vs. ImputationDiD#

Feature

TwoStageDiD

CallawaySantAnna

ImputationDiD

Point estimates

Identical to ImputationDiD

Group-time ATT(g,t)

Identical to TwoStageDiD

Variance estimator

GMM sandwich (accounts for first-stage uncertainty)

Analytical IF/WIF or multiplier bootstrap

Conservative (Theorem 3)

Control group

Never-treated + not-yet-treated

Never-treated or not-yet-treated

Never-treated + not-yet-treated

Efficiency

High (uses all untreated obs)

Lower (2x2 comparisons)

High (uses all untreated obs)

Heterogeneous effects

Consistent under homogeneity

Robust to heterogeneity

Consistent under homogeneity

Covariates

Supported

Supported (outcome regression or IPW)

Supported

Bootstrap

Multiplier bootstrap on GMM influence function

Multiplier bootstrap (IF/WIF)

Multiplier bootstrap