Stacked Difference-in-Differences#

Stacked DiD estimator for staggered adoption designs with corrective Q-weights.

This module implements the methodology from Wing, Freedman & Hollingsworth (2024), which addresses bias in naive stacked DiD regressions by:

Constructing sub-experiments: One per adoption cohort with clean controls
Applying corrective Q-weights: Ensures proper weighting of treatment and control group trends across sub-experiments
Running weighted event-study regression: WLS with Q-weights identifies the “trimmed aggregate ATT”

When to use Stacked DiD:

Staggered adoption design with multiple treatment cohorts
Want an intuitive sub-experiment-based approach (vs. aggregation methods)
Desire compositional balance: treatment group composition fixed across event times
Need direct access to the stacked dataset for custom analysis

Reference: Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked Difference-in-Differences. NBER Working Paper 32054. http://www.nber.org/papers/w32054

StackedDiD#

Main estimator class for Stacked Difference-in-Differences.

class diff_diff.StackedDiD[source]

Bases: object

Stacked Difference-in-Differences estimator.

Implements Wing, Freedman & Hollingsworth (2024). Builds a stacked dataset of sub-experiments (one per adoption cohort), applies corrective Q-weights to address implicit weighting bias in naive stacked regressions, and runs a weighted event-study regression.

Parameters:

kappa_pre (int, default=1) – Number of pre-treatment event-time periods in the event window. The event window spans [-kappa_pre, …, kappa_post].
kappa_post (int, default=1) – Number of post-treatment event-time periods.
weighting (str, default="aggregate") – Target estimand weighting scheme per Table 1 of the paper: - “aggregate”: Equal weight per adoption event (trimmed aggregate ATT) - “population”: Weight by population size of treated cohort - “sample_share”: Weight by sample share of each sub-experiment
clean_control (str, default="not_yet_treated") – How to define clean controls per Appendix A of the paper: - “not_yet_treated”: Units with A_s > a + kappa_post - “strict”: Units with A_s > a + kappa_post + kappa_pre - “never_treated”: Only units with A_s = infinity
cluster (str, default="unit") – Clustering level for standard errors: - “unit”: Cluster on original unit identifier - “unit_subexp”: Cluster on (unit, sub_experiment) pairs
alpha (float, default=0.05) – Significance level for confidence intervals.
anticipation (int, default=0) – Number of anticipation periods. When anticipation > 0: - Reference period shifts from e=-1 to e=-1-anticipation - Post-treatment includes anticipation periods (e >= -anticipation) - Event window expands by anticipation pre-periods Consistent with ImputationDiD, TwoStageDiD, SunAbraham.
rank_deficient_action (str, default="warn") – Action when design matrix is rank-deficient: - “warn”: Issue warning and drop linearly dependent columns - “error”: Raise ValueError - “silent”: Drop columns silently
vcov_type ({"classical","hc1","hc2","hc2_bm"}, default="hc1") –
Analytical variance family for the stacked WLS regression. StackedDiD is intrinsically clustered (cluster is required, no cluster=None opt-out), so one-way families that don’t compose with cluster_ids are rejected at __init__:
- "hc1" (default): CR1 Liang-Zeger cluster-robust on the Q-weighted design via solve_ols(weights=composed_weights, vcov_type="hc1"). Bit-equal to the prior bake-Q-into-X output up to float64 multiplication ordering at machine precision (HC1 WLS sandwich is algebraically invariant between the two forms). Matches clubSandwich::vcovCR(lm(weights=Q,...), cluster=~unit, type="CR1S") at atol=1e-10 (target is CR1S — Stata-style G/(G-1) * (n-1)/(n-p) finite-sample correction — NOT plain CR1 which omits the (n-1)/(n-p) factor and would diverge by ~1.4%).
- "hc2_bm": CR2 Bell-McCaffrey via solve_ols(weights=composed_weights, vcov_type="hc2_bm"), routed through the clubSandwich WLS-CR2 port (matches clubSandwich::vcovCR(lm(weights=Q,...), cluster=~unit, type="CR2") + coef_test()$df_Satt at atol=1e-10). See REGISTRY.md Phase 1a hc2_bm + weights row for the algebra (W not √W in hat matrix, W² in bias term, unweighted residuals in score).
- "classical" and "hc2" are REJECTED at __init__ with a cluster-incompatibility ValueError: StackedDiD requires a cluster structure, so one-way families don’t compose with the linalg validator. Use "hc1" or "hc2_bm".
- "conley" is REJECTED at __init__ for a methodology reason (NOT plumbing): the stacked design replicates units across sub-experiments, so Conley would see same-unit copies at distance 0; no conleyreg anchor; paper-gated. Tracked in TODO.md.
Survey-design precedence: when survey_design= is supplied to fit() with vcov_type != "hc1", a NotImplementedError is raised — the survey Taylor-series linearization (or replicate-weight refit) variance overrides the analytical sandwich. Use the default vcov_type="hc1" for survey designs.
balance ({"none", "entropy"}, default="none") – Within-sub-experiment covariate balancing (Covariate-Balanced Weighted Stacked DID; Ustyuzhanin 2026). With "entropy" and a fit(..., covariates=[...]) list, each clean-control group is reweighted by entropy balancing (Hainmueller 2012) so its covariate means match the treated cohort’s (measured at the last pre-treatment period), and the resulting design weights b_sa are composed with the Wing corrective weights via the effective control mass into the final stacked weights W_sa. This is control-only reweighting, so it preserves the trimmed-aggregate-ATT estimand (it changes only how untreated trends are estimated, not the treated-cohort weights); at b_sa=1 it reduces to the paper’s unit-count weighted stacked DID, equal to weighting="aggregate" on balanced event windows. v1 requires weighting="aggregate" and balanced event windows (ragged windows raise a ValueError), and does not support survey_design=; matching-based balancing and the repeated-treatment extension are out of scope. Default "none" reproduces plain weighted stacked DID.

results_

Estimation results after calling fit().

Type:: StackedDiDResults

is_fitted_

Whether the model has been fitted.

Type:: bool

Examples

Basic usage:

>>> from diff_diff import StackedDiD, generate_staggered_data
>>> data = generate_staggered_data(n_units=200, seed=42)
>>> est = StackedDiD(kappa_pre=2, kappa_post=2)
>>> results = est.fit(data, outcome='outcome', unit='unit',
...                   time='period', first_treat='first_treat')
>>> results.print_summary()

With event study:

>>> results = est.fit(data, outcome='outcome', unit='unit',
...                   time='period', first_treat='first_treat',
...                   aggregate='event_study')
>>> from diff_diff import plot_event_study
>>> plot_event_study(results)

Notes

The stacked estimator addresses TWFE bias by: 1. Creating one sub-experiment per adoption cohort with clean controls 2. Applying Q-weights to reweight the stacked regression 3. Running a single event-study WLS regression on the weighted stack

References

Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked: Difference-in-Differences. NBER Working Paper 32054.

Methods

`fit`(data, outcome, unit, time, first_treat)	Fit the stacked DiD estimator.
`get_params`()	Get estimator parameters (sklearn-compatible).
`set_params`(**params)	Set estimator parameters (sklearn-compatible).

__init__(kappa_pre=1, kappa_post=1, weighting='aggregate', clean_control='not_yet_treated', cluster='unit', alpha=0.05, anticipation=0, rank_deficient_action='warn', vcov_type='hc1', balance='none')[source]

Parameters:

kappa_pre (int)
kappa_post (int)
weighting (str)
clean_control (str)
cluster (str)
alpha (float)
anticipation (int)
rank_deficient_action (str)
vcov_type (str)
balance (str)

fit(data, outcome, unit, time, first_treat, aggregate=None, population=None, survey_design=None, covariates=None)[source]

Fit the stacked DiD estimator.

Parameters:

data (pd.DataFrame) – Panel data with unit and time identifiers.
outcome (str) – Name of outcome variable column.
unit (str) – Name of unit identifier column.
time (str) – Name of time period column.
first_treat (str) – Name of column indicating when unit was first treated. Use 0 or np.inf for never-treated units.
aggregate (str, optional) – Aggregation mode: None/”simple” (overall ATT only) or “event_study”. Group aggregation is not supported because the pooled stacked regression cannot produce cohort-specific effects. Use CallawaySantAnna or ImputationDiD for cohort-level estimates.
population (str, optional) – Column name for population weights. Required only when weighting=”population”.
survey_design (SurveyDesign, optional) – Survey design specification for design-based inference. When provided, uses Taylor Series Linearization for variance estimation and applies sampling weights to the regression.
covariates (list of str, optional) – Covariate column names to balance the clean controls toward the treated cohort (requires balance="entropy"; see the constructor balance parameter). Values are read at the last pre-treatment period t = a-1-anticipation per sub-experiment, so balancing uses only pre-treatment information (Assumption 4). Raises ValueError if balance="none" (or vice versa), if a name is absent from data, or if a cohort cannot be balanced (infeasible).

Returns:

Object containing all estimation results.

Return type:

StackedDiDResults

Raises:

ValueError – If required columns are missing or data validation fails.

get_params()[source]

Get estimator parameters (sklearn-compatible).

Return type:: Dict[str, Any]

set_params(**params)[source]

Set estimator parameters (sklearn-compatible).

Re-validates vcov_type via the shared _validate_vcov_type helper so sklearn-style mutation hits the estimator-level guard before fit() (avoids a later, less-informative failure in the linalg layer).

Parameters:: params (Any)
Return type:: StackedDiD

summary()[source]

Get summary of estimation results.

Return type:: str

print_summary()[source]

Print summary to stdout.

Return type:: None

StackedDiDResults#

Results container for Stacked DiD estimation.

class diff_diff.StackedDiDResults[source]

Bases: object

Results from Stacked DiD estimation (Wing, Freedman & Hollingsworth 2024).

overall_att

Overall average treatment effect on the treated (average of post-treatment event-study coefficients).

Type:: float

overall_se

Standard error of overall ATT (delta method on VCV).

Type:: float

overall_t_stat

T-statistic for overall ATT.

Type:: float

overall_p_value

P-value for overall ATT.

Type:: float

overall_conf_int

Confidence interval for overall ATT.

Type:: tuple

event_study_effects

Dictionary mapping event time h to effect dict with keys: ‘effect’, ‘se’, ‘t_stat’, ‘p_value’, ‘conf_int’, ‘n_obs’.

Type:: dict, optional

group_effects

Dictionary mapping cohort g to effect dict.

Type:: dict, optional

stacked_data

Full stacked dataset with _sub_exp, _event_time, _D_sa, _Q_weight columns. Accessible for custom analysis.

Type:: pd.DataFrame

groups

Adoption events in the trimmed set (Omega_kappa).

Type:: list

trimmed_groups

Adoption events excluded by IC1/IC2.

Type:: list

time_periods

All time periods in the original data.

Type:: list

n_obs

Number of observations in the original data.

Type:: int

n_stacked_obs

Number of observations in the stacked dataset.

Type:: int

n_sub_experiments

Number of sub-experiments in the stack.

Type:: int

n_treated_units

Distinct treated units across trimmed set.

Type:: int

n_control_units

Distinct control units across trimmed set.

Type:: int

kappa_pre

Pre-treatment event-time window size.

Type:: int

kappa_post

Post-treatment event-time window size.

Type:: int

weighting

Weighting scheme used.

Type:: str

clean_control

Clean control definition used.

Type:: str

alpha

Significance level used.

Type:: float

Methods

`summary`([alpha])	Generate formatted summary of estimation results.
`print_summary`([alpha])	Print summary to stdout.
`to_dataframe`([level])	Convert results to DataFrame.

overall_att: float

overall_se: float

overall_t_stat: float

overall_p_value: float

overall_conf_int: Tuple[float, float]

event_study_effects: Dict[int, Dict[str, Any]] | None

group_effects: Dict[Any, Dict[str, Any]] | None

stacked_data: DataFrame

groups: List[Any]

trimmed_groups: List[Any]

time_periods: List[Any]

n_obs: int = 0

n_stacked_obs: int = 0

n_sub_experiments: int = 0

n_treated_units: int = 0

n_control_units: int = 0

kappa_pre: int = 1

kappa_post: int = 1

weighting: str = 'aggregate'

clean_control: str = 'not_yet_treated'

alpha: float = 0.05

anticipation: int = 0

vcov_type: str = 'hc1'

cluster_name: str | None = None

n_clusters: int | None = None

survey_metadata: Any | None = None

balance: str = 'none'

covariates: List[str] | None = None

balance_diagnostics: Dict[Any, Dict[str, Any]] | None = None

property att: float

property se: float

property conf_int: Tuple[float, float]

property p_value: float

property t_stat: float

__repr__()[source]

Concise string representation.

Return type:: str

property coef_var: float

SE / abs(overall ATT). NaN when ATT is 0 or SE non-finite.

Type:: Coefficient of variation

summary(alpha=None)[source]

Generate formatted summary of estimation results.

Parameters:: alpha (float, optional) – Significance level. Defaults to alpha used in estimation.
Returns:: Formatted summary.
Return type:: str

print_summary(alpha=None)[source]

Print summary to stdout.

Parameters:: alpha (float | None)
Return type:: None

to_dataframe(level='event_study')[source]

Convert results to DataFrame.

Parameters:: level (str, default="event_study") – Level of aggregation: - “event_study”: Event study effects by relative time - “group”: Group (cohort) effects
Returns:: Results as DataFrame.
Return type:: pd.DataFrame

property is_significant: bool: Check if overall ATT is significant.

property significance_stars: str: Significance stars for overall ATT.

__init__(overall_att, overall_se, overall_t_stat, overall_p_value, overall_conf_int, event_study_effects, group_effects, stacked_data, groups=<factory>, trimmed_groups=<factory>, time_periods=<factory>, n_obs=0, n_stacked_obs=0, n_sub_experiments=0, n_treated_units=0, n_control_units=0, kappa_pre=1, kappa_post=1, weighting='aggregate', clean_control='not_yet_treated', alpha=0.05, anticipation=0, vcov_type='hc1', cluster_name=None, n_clusters=None, survey_metadata=None, balance='none', covariates=None, balance_diagnostics=None)

Parameters:

overall_att (float)
overall_se (float)
overall_t_stat (float)
overall_p_value (float)
overall_conf_int (Tuple[float, float])
event_study_effects (Dict[int, Dict[str, Any]] | None)
group_effects (Dict[Any, Dict[str, Any]] | None)
stacked_data (DataFrame)
groups (List[Any])
trimmed_groups (List[Any])
time_periods (List[Any])
n_obs (int)
n_stacked_obs (int)
n_sub_experiments (int)
n_treated_units (int)
n_control_units (int)
kappa_pre (int)
kappa_post (int)
weighting (str)
clean_control (str)
alpha (float)
anticipation (int)
vcov_type (str)
cluster_name (str | None)
n_clusters (int | None)
survey_metadata (Any | None)
balance (str)
covariates (List[str] | None)
balance_diagnostics (Dict[Any, Dict[str, Any]] | None)

Return type:

None

Convenience Function#

diff_diff.stacked_did(data, outcome, unit, time, first_treat, kappa_pre=1, kappa_post=1, aggregate=None, population=None, survey_design=None, covariates=None, **kwargs)[source]#

Convenience function for stacked DiD estimation.

This is a shortcut for creating a StackedDiD estimator and calling fit().

Parameters:

data (pd.DataFrame) – Panel data.
outcome (str) – Outcome variable column name.
unit (str) – Unit identifier column name.
time (str) – Time period column name.
first_treat (str) – Column indicating first treatment period (0 or inf for never-treated).
kappa_pre (int, default=1) – Pre-treatment event-time periods.
kappa_post (int, default=1) – Post-treatment event-time periods.
aggregate (str, optional) – Aggregation mode: None, “simple”, or “event_study”.
population (str, optional) – Population column for weighting=”population”.
survey_design (SurveyDesign, optional) – Survey design specification for design-based inference.
covariates (list of str, optional) – Covariate columns to balance the clean controls toward the treated cohort (pass balance="entropy" via **kwargs to enable). See StackedDiD.fit.
**kwargs – Additional keyword arguments passed to StackedDiD constructor (e.g. balance="entropy", weighting, cluster, vcov_type).

Returns:

Estimation results.

Return type:

StackedDiDResults

Examples

>>> from diff_diff import stacked_did, generate_staggered_data
>>> data = generate_staggered_data(seed=42)
>>> results = stacked_did(data, 'outcome', 'unit', 'period',
...                       'first_treat', kappa_pre=2, kappa_post=2,
...                       aggregate='event_study')
>>> results.print_summary()

Example Usage#

Basic usage:

from diff_diff import StackedDiD, generate_staggered_data

data = generate_staggered_data(n_units=200, n_periods=12,
                                cohort_periods=[4, 6, 8], seed=42)

est = StackedDiD(kappa_pre=2, kappa_post=2)
results = est.fit(data, outcome='outcome', unit='unit',
                  time='period', first_treat='first_treat',
                  aggregate='event_study')
results.print_summary()

Accessing the stacked dataset:

# The stacked data is available for custom analysis
stacked = results.stacked_data
print(stacked[['unit', 'period', '_sub_exp', '_event_time', '_D_sa', '_Q_weight']].head())

Different weighting schemes:

# Population-weighted ATT (requires population column)
est = StackedDiD(kappa_pre=2, kappa_post=2, weighting='population')
results = est.fit(data, outcome='outcome', unit='unit',
                  time='period', first_treat='first_treat',
                  population='pop_size')

# Sample-share weighted ATT
est = StackedDiD(kappa_pre=2, kappa_post=2, weighting='sample_share')
results = est.fit(data, outcome='outcome', unit='unit',
                  time='period', first_treat='first_treat')

Comparison with Other Staggered Estimators#

Feature	Stacked DiD	Callaway-Sant’Anna
Approach	Pooled WLS on stacked sub-experiments	Separate group-time regressions
Compositional balance	Enforced by IC1/IC2 trimming	Via balanced event study aggregation
Target parameter	Trimmed aggregate ATT	Weighted average of ATT(g,t)
Custom analysis	Full stacked dataset accessible	Group-time effects accessible
Covariates	Entropy balancing via `balance="entropy"` + `fit(covariates=...)` (CBWSDID, Ustyuzhanin 2026); requires `weighting="aggregate"` + balanced windows, no `survey_design`	Supported (OR, IPW, DR)