Data Preparation#

Utilities for preparing and validating data for DiD analysis.

Data Generation#

generate_did_data#

Generate synthetic data with known treatment effects for testing.

diff_diff.generate_did_data(n_units=100, n_periods=4, treatment_effect=5.0, treatment_fraction=0.5, treatment_period=2, unit_fe_sd=2.0, time_trend=0.5, noise_sd=1.0, seed=None)[source]#

Generate synthetic data for DiD analysis with known treatment effect.

Creates a balanced panel dataset with realistic features including unit fixed effects, time trends, and a known treatment effect.

Parameters:
  • n_units (int, default=100) – Number of units in the panel.

  • n_periods (int, default=4) – Number of time periods.

  • treatment_effect (float, default=5.0) – True average treatment effect on the treated.

  • treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.

  • treatment_period (int, default=2) – First post-treatment period (0-indexed). Periods >= this are post.

  • unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.

  • time_trend (float, default=0.5) – Linear time trend coefficient.

  • noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic panel data with columns: - unit: Unit identifier - period: Time period - treated: Treatment indicator (0/1) - post: Post-treatment indicator (0/1) - outcome: Outcome variable - true_effect: The true treatment effect (for validation)

Return type:

pd.DataFrame

Examples

Generate simple data for testing:

>>> data = generate_did_data(n_units=50, n_periods=4, treatment_effect=3.0, seed=42)
>>> len(data)
200
>>> data.columns.tolist()
['unit', 'period', 'treated', 'post', 'outcome', 'true_effect']

Verify treatment effect recovery:

>>> from diff_diff import DifferenceInDifferences
>>> did = DifferenceInDifferences()
>>> results = did.fit(data, outcome='outcome', treatment='treated', time='post')
>>> abs(results.att - 3.0) < 1.0  # Close to true effect
True

Example#

from diff_diff import generate_did_data

# Generate basic 2x2 DiD data
data = generate_did_data(
    n_units=100,
    n_periods=10,
    treatment_effect=5.0,
    treatment_period=5,
    treatment_fraction=0.5,
    noise_sd=1.0
)

print(data.head())
# Columns: unit_id, period, outcome, treated, post

generate_staggered_data#

Generate synthetic staggered adoption data for testing.

diff_diff.generate_staggered_data(n_units=100, n_periods=10, cohort_periods=None, never_treated_frac=0.3, treatment_effect=2.0, dynamic_effects=True, effect_growth=0.1, unit_fe_sd=2.0, time_trend=0.1, noise_sd=0.5, seed=None, panel=True)[source]#

Generate synthetic data for staggered adoption DiD analysis.

Creates panel data where different units receive treatment at different times (staggered rollout). Useful for testing CallawaySantAnna, SunAbraham, and other staggered DiD estimators.

Parameters:
  • n_units (int, default=100) – Total number of units in the panel.

  • n_periods (int, default=10) – Number of time periods.

  • cohort_periods (list of int, optional) – Periods when treatment cohorts are first treated. If None, defaults to [3, 5, 7] for a 10-period panel.

  • never_treated_frac (float, default=0.3) – Fraction of units that are never treated (cohort 0).

  • treatment_effect (float, default=2.0) – Base treatment effect at time of treatment.

  • dynamic_effects (bool, default=True) – If True, treatment effects grow over time since treatment.

  • effect_growth (float, default=0.1) – Per-period growth in treatment effect (if dynamic_effects=True). Effect at time t since treatment: effect * (1 + effect_growth * t).

  • unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.

  • time_trend (float, default=0.1) – Linear time trend coefficient.

  • noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

  • panel (bool, default=True) – If True (default), generate balanced panel data (same units across all periods). If False, generate repeated cross-section data where each period draws independent observations with globally unique IDs.

Returns:

Synthetic staggered adoption data with columns: - unit: Unit identifier - period: Time period - outcome: Outcome variable - first_treat: First treatment period (0 = never treated) - treated: Binary indicator (1 if treated at this observation) - treat: Binary unit-level ever-treated indicator - true_effect: The true treatment effect for this observation

Return type:

pd.DataFrame

Examples

Generate staggered adoption data:

>>> data = generate_staggered_data(n_units=100, n_periods=10, seed=42)
>>> data['first_treat'].value_counts().sort_index()
0     30
3     24
5     23
7     23
Name: first_treat, dtype: int64

Use with Callaway-Sant’Anna estimator:

>>> from diff_diff import CallawaySantAnna
>>> cs = CallawaySantAnna()
>>> results = cs.fit(data, outcome='outcome', unit='unit',
...                  time='period', first_treat='first_treat')
>>> results.overall_att > 0
True

Example#

from diff_diff import generate_staggered_data

data = generate_staggered_data(
    n_units=200,
    n_periods=10,
    cohort_periods=[4, 6, 8],
    seed=42
)

generate_event_study_data#

Generate synthetic event study data for testing.

diff_diff.generate_event_study_data(n_units=300, n_pre=5, n_post=5, treatment_fraction=0.5, treatment_effect=5.0, unit_fe_sd=2.0, noise_sd=2.0, seed=None)[source]#

Generate synthetic data for event study analysis.

Creates panel data with simultaneous treatment at period n_pre. Useful for testing MultiPeriodDiD, pre-trends power analysis, and HonestDiD sensitivity analysis.

Parameters:
  • n_units (int, default=300) – Total number of units in the panel.

  • n_pre (int, default=5) – Number of pre-treatment periods.

  • n_post (int, default=5) – Number of post-treatment periods.

  • treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.

  • treatment_effect (float, default=5.0) – True average treatment effect on the treated.

  • unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.

  • noise_sd (float, default=2.0) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic event study data with columns: - unit: Unit identifier - period: Time period - treated: Binary unit-level treatment indicator - post: Binary post-treatment indicator - outcome: Outcome variable - event_time: Time relative to treatment (negative=pre, 0+=post) - true_effect: The true treatment effect for this observation

Return type:

pd.DataFrame

Examples

Generate event study data:

>>> data = generate_event_study_data(n_units=300, n_pre=5, n_post=5, seed=42)
>>> data['event_time'].unique()
array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4])

Use with MultiPeriodDiD:

>>> from diff_diff import MultiPeriodDiD
>>> mp_did = MultiPeriodDiD()
>>> results = mp_did.fit(data, outcome='outcome', treatment='treated',
...                      time='period', post_periods=[5, 6, 7, 8, 9])

Notes

The event_time column is relative to treatment: - Negative values: pre-treatment periods - 0: first post-treatment period - Positive values: subsequent post-treatment periods

generate_ddd_data#

Generate synthetic Triple Difference data.

diff_diff.generate_ddd_data(n_per_cell=100, treatment_effect=2.0, group_effect=2.0, partition_effect=1.0, time_effect=0.5, noise_sd=1.0, add_covariates=False, seed=None)[source]#

Generate synthetic data for Triple Difference (DDD) analysis.

Creates data following the DGP: Y = mu + G + P + T + G*P + G*T + P*T + tau*G*P*T + eps

where G=group, P=partition, T=time. The treatment effect (tau) only applies to units that are in the treated group (G=1), eligible partition (P=1), and post-treatment period (T=1).

Parameters:
  • n_per_cell (int, default=100) – Number of observations per cell (8 cells total: 2x2x2).

  • treatment_effect (float, default=2.0) – True average treatment effect on the treated (G=1, P=1, T=1).

  • group_effect (float, default=2.0) – Main effect of being in treated group.

  • partition_effect (float, default=1.0) – Main effect of being in eligible partition.

  • time_effect (float, default=0.5) – Main effect of post-treatment period.

  • noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.

  • add_covariates (bool, default=False) – If True, adds age and education covariates that affect outcome.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic DDD data with columns: - outcome: Outcome variable - group: Group indicator (0=control, 1=treated) - partition: Partition indicator (0=ineligible, 1=eligible) - time: Time indicator (0=pre, 1=post) - unit_id: Unique unit identifier - true_effect: The true treatment effect for this observation - age: Age covariate (if add_covariates=True) - education: Education covariate (if add_covariates=True)

Return type:

pd.DataFrame

Examples

Generate DDD data:

>>> data = generate_ddd_data(n_per_cell=100, treatment_effect=3.0, seed=42)
>>> data.shape
(800, 6)
>>> data.groupby(['group', 'partition', 'time']).size()
group  partition  time
0      0          0       100
                  1       100
       1          0       100
                  1       100
1      0          0       100
                  1       100
       1          0       100
                  1       100
dtype: int64

Use with TripleDifference estimator:

>>> from diff_diff import TripleDifference
>>> ddd = TripleDifference()
>>> results = ddd.fit(data, outcome='outcome', group='group',
...                   partition='partition', time='time')
>>> abs(results.att - 3.0) < 1.0
True

generate_ddd_panel_data#

Generate synthetic panel-structured Triple Difference data for power analysis.

diff_diff.generate_ddd_panel_data(n_units=200, n_periods=8, treatment_period=4, group_frac=0.5, partition_frac=0.5, treatment_effect=2.0, group_effect=2.0, partition_effect=1.0, time_effect=0.5, group_time_interaction=1.0, partition_time_interaction=0.5, group_partition_interaction=1.5, unit_fe_sd=1.5, noise_sd=1.0, add_covariates=False, seed=None)[source]#

Generate synthetic panel data for Triple Difference (DDD) power analysis.

Creates a balanced panel of n_units observed over n_periods with two time-invariant binary dimensions (group and partition) and a derived binary post indicator. The triple-interaction effect (group * partition * post) is the identifying ATT under DDD-CPT.

The DGP equation is:

Y_{i,t} = unit_fe_i
        + group_i        * group_effect
        + partition_i    * partition_effect
        + post_t         * time_effect
        + (group_i * partition_i)  * group_partition_interaction
        + (group_i * post_t)       * group_time_interaction
        + (partition_i * post_t)   * partition_time_interaction
        + treatment_effect * group_i * partition_i * post_t
        + epsilon_{i,t}

where group_i and partition_i are unit-level (constant in t) and post_t = 1[period >= treatment_period]. DDD-CPT identification holds because group_partition_interaction enters only as a unit-level (time-invariant) effect, leaving the triple-interaction as the sole source of differential group × partition trend.

Unlike the cross-sectional generate_ddd_data, this DGP provides panel-realistic unit fixed effects and within-unit serial structure, making it suitable for panel-aware power-analysis simulations or sanity-checking estimators that ignore the panel dimension.

Warning

TripleDifference is a repeated-cross-section panel=FALSE estimator: its analytical default treats each row as an independent observation (df = n_obs - 8). When fitting against generate_ddd_panel_data output, the within-unit serial correlation makes unclustered SEs anti-conservative — they understate sampling variability and overstate power. Always pass cluster="unit" (Liang-Zeger CR1) when fitting on panel-generated data; the point estimate att is invariant to clustering but the inference contract is not. See the TripleDifference REGISTRY entry for the clustering contract.

Parameters:
  • n_units (int, default=200) – Number of units in the panel.

  • n_periods (int, default=8) – Number of time periods.

  • treatment_period (int, default=4) – Period (0-indexed) at which post switches from 0 to 1. Must satisfy 1 <= treatment_period < n_periods.

  • group_frac (float, default=0.5) – Fraction of units with group=1. Must be in (0, 1). The partition split is then drawn stratified-by-group at the requested partition_frac so every (group, partition) cell receives at least one unit; a ValueError is raised when the rounded cell counts would leave any cell empty.

  • partition_frac (float, default=0.5) – Fraction of units with partition=1 within each group stratum. Must be in (0, 1). The stratified allocation is what makes TripleDifference.fit’s 2x2x2 surface populated for any valid (n_units, group_frac, partition_frac).

  • treatment_effect (float, default=2.0) – True ATT for the triple-interaction cell (group=1, partition=1, post=1).

  • group_effect (float, default=2.0) – Main effect of group=1 (unit-level).

  • partition_effect (float, default=1.0) – Main effect of partition=1 (unit-level).

  • time_effect (float, default=0.5) – Main effect of post=1 (time-level).

  • group_time_interaction (float, default=1.0) – Coefficient on group * post (differential trend for the group dimension).

  • partition_time_interaction (float, default=0.5) – Coefficient on partition * post (differential trend for the partition dimension).

  • group_partition_interaction (float, default=1.5) – Coefficient on the unit-level group * partition interaction. Must be time-invariant for DDD-CPT to hold.

  • unit_fe_sd (float, default=1.5) – Standard deviation of the unit fixed effect.

  • noise_sd (float, default=1.0) – Standard deviation of the idiosyncratic noise term.

  • add_covariates (bool, default=False) – If True, add unit-level covariates x1 (continuous) and x2 (binary) that affect the outcome.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Long-format panel with columns:

  • unit: integer unit ID.

  • period: integer time period (0-indexed).

  • outcome: outcome variable.

  • group: unit-level binary group indicator (time-invariant).

  • partition: unit-level binary partition indicator (time-invariant, orthogonal to group).

  • post: binary indicator, 1 if period >= treatment_period.

  • treated: group * partition * post (binary).

  • true_effect: treatment_effect when treated, else 0.

  • x1, x2: optional unit-level covariates (only if add_covariates=True).

Return type:

pd.DataFrame

Examples

Generate a balanced panel with default parameters:

>>> data = generate_ddd_panel_data(n_units=200, n_periods=8, seed=42)
>>> data.shape
(1600, 8)
>>> data.groupby('unit')['period'].count().eq(8).all()
True

Fit with TripleDifference. Note time="post" (the derived binary indicator) and cluster="unit" (required for valid inference on panel-generated data; see the warning above):

>>> from diff_diff import TripleDifference
>>> result = TripleDifference(cluster="unit").fit(
...     data, outcome='outcome', group='group',
...     partition='partition', time='post',
... )

generate_factor_data#

Generate synthetic data with factor structure for TROP testing.

diff_diff.generate_factor_data(n_units=50, n_pre=10, n_post=5, n_treated=10, n_factors=2, treatment_effect=2.0, factor_strength=1.0, treated_loading_shift=0.5, unit_fe_sd=1.0, noise_sd=0.5, seed=None)[source]#

Generate synthetic panel data with interactive fixed effects (factor model).

Creates data following the DGP: Y_it = mu + alpha_i + beta_t + Lambda_i’F_t + tau*D_it + eps_it

where Lambda_i’F_t is the interactive fixed effects component. Useful for testing TROP (Triply Robust Panel) and comparing with SyntheticDiD.

Parameters:
  • n_units (int, default=50) – Total number of units in the panel.

  • n_pre (int, default=10) – Number of pre-treatment periods.

  • n_post (int, default=5) – Number of post-treatment periods.

  • n_treated (int, default=10) – Number of treated units (assigned to first n_treated unit IDs).

  • n_factors (int, default=2) – Number of latent factors in the interactive fixed effects.

  • treatment_effect (float, default=2.0) – True average treatment effect on the treated.

  • factor_strength (float, default=1.0) – Scaling factor for interactive fixed effects.

  • treated_loading_shift (float, default=0.5) – Shift in factor loadings for treated units (creates confounding).

  • unit_fe_sd (float, default=1.0) – Standard deviation of unit fixed effects.

  • noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic factor model data with columns: - unit: Unit identifier - period: Time period - outcome: Outcome variable - treated: Binary indicator (1 if treated at this observation) - treat: Binary unit-level ever-treated indicator - true_effect: The true treatment effect for this observation

Return type:

pd.DataFrame

Examples

Generate data with factor structure:

>>> data = generate_factor_data(n_units=50, n_factors=2, seed=42)
>>> data.shape
(750, 6)

Use with TROP estimator:

>>> from diff_diff import TROP
>>> trop = TROP(n_bootstrap=50, seed=42)
>>> results = trop.fit(data, outcome='outcome', treatment='treated',
...                    unit='unit', time='period',
...                    post_periods=list(range(10, 15)))

Notes

The treated units have systematically different factor loadings (shifted by treated_loading_shift), which creates confounding that standard DiD cannot address but TROP can handle.

generate_panel_data#

Generate generic synthetic panel data.

diff_diff.generate_panel_data(n_units=100, n_periods=8, treatment_period=4, treatment_fraction=0.5, treatment_effect=5.0, parallel_trends=True, trend_violation=1.0, unit_fe_sd=2.0, noise_sd=0.5, seed=None)[source]#

Generate synthetic panel data for parallel trends testing.

Creates panel data with optional violation of parallel trends, useful for testing parallel trends diagnostics, placebo tests, and sensitivity analysis methods.

Parameters:
  • n_units (int, default=100) – Total number of units in the panel.

  • n_periods (int, default=8) – Number of time periods.

  • treatment_period (int, default=4) – First post-treatment period (0-indexed).

  • treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.

  • treatment_effect (float, default=5.0) – True average treatment effect on the treated.

  • parallel_trends (bool, default=True) – If True, treated and control groups have parallel pre-treatment trends. If False, treated group has a steeper pre-treatment trend.

  • trend_violation (float, default=1.0) – Size of the differential trend for treated group when parallel_trends=False. Treated units have trend = common_trend + trend_violation.

  • unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.

  • noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic panel data with columns: - unit: Unit identifier - period: Time period - treated: Binary unit-level treatment indicator - post: Binary post-treatment indicator - outcome: Outcome variable - true_effect: The true treatment effect for this observation

Return type:

pd.DataFrame

Examples

Generate data with parallel trends:

>>> data_parallel = generate_panel_data(parallel_trends=True, seed=42)
>>> from diff_diff.utils import check_parallel_trends
>>> result = check_parallel_trends(data_parallel, outcome='outcome',
...                                time='period', treatment_group='treated',
...                                pre_periods=[0, 1, 2, 3])
>>> result['parallel_trends_plausible']
True

Generate data with trend violation:

>>> data_violation = generate_panel_data(parallel_trends=False, seed=42)
>>> result = check_parallel_trends(data_violation, outcome='outcome',
...                                time='period', treatment_group='treated',
...                                pre_periods=[0, 1, 2, 3])
>>> result['parallel_trends_plausible']
False

generate_continuous_did_data#

Generate synthetic continuous treatment DiD data with known dose-response.

diff_diff.generate_continuous_did_data(n_units=500, n_periods=4, cohort_periods=None, never_treated_frac=0.3, dose_distribution='lognormal', dose_params=None, att_function='linear', att_slope=2.0, att_intercept=1.0, unit_fe_sd=2.0, time_trend=0.5, noise_sd=1.0, seed=None)[source]#

Generate synthetic data for continuous DiD analysis with known dose-response.

Creates a balanced panel with continuous treatment doses and known ATT(d) function, satisfying strong parallel trends by construction.

Parameters:
  • n_units (int, default=500) – Number of units in the panel.

  • n_periods (int, default=4) – Number of time periods (1-indexed).

  • cohort_periods (list of int, optional) – Treatment cohort periods. Default: [2] (single cohort).

  • never_treated_frac (float, default=0.3) – Fraction of units that are never-treated.

  • dose_distribution (str, default="lognormal") – Distribution for dose: "lognormal", "uniform", "exponential".

  • dose_params (dict, optional) – Distribution-specific parameters. Defaults: lognormal: {"mean": 0.5, "sigma": 0.5} uniform: {"low": 0.5, "high": 5.0} exponential: {"scale": 2.0}

  • att_function (str, default="linear") – Functional form of ATT(d): "linear", "quadratic", "log".

  • att_slope (float, default=2.0) – Slope parameter for ATT function.

  • att_intercept (float, default=1.0) – Intercept parameter for ATT function.

  • unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.

  • time_trend (float, default=0.5) – Linear time trend coefficient.

  • noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Panel data with columns: unit, period, outcome, first_treat, dose, true_att.

Return type:

pd.DataFrame

generate_reversible_did_data#

Generate synthetic reversible-treatment panel data — treatment can switch on and off over time. Use this with ChaisemartinDHaultfoeuille for testing the dCDH estimator on non-absorbing treatments.

diff_diff.generate_reversible_did_data(n_groups=50, n_periods=6, pattern='single_switch', p_switch=0.2, initial_treat_frac=0.3, cycle_length=2, treatment_effect=2.0, heterogeneous_effects=False, effect_sd=0.5, group_fe_sd=2.0, time_trend=0.1, noise_sd=0.5, seed=None)[source]#

Generate synthetic panel data with reversible (non-absorbing) treatment.

Treatment can switch on and off over time, supporting designs where the canonical staggered-adoption assumption (once treated, always treated) does not hold. This is the only generator in the library that produces reversible-treatment data; intended for the ChaisemartinDHaultfoeuille (dCDH) estimator.

Seven patterns are supported. Four of them are guaranteed to keep every group as a “single-switch” group (each group switches treatment status at most once), so the dCDH drop_larger_lower=True filter is a no-op. The other three deliberately produce multi-switch groups for stress- testing the drop logic.

Parameters:
  • n_groups (int, default=50) – Number of groups in the panel.

  • n_periods (int, default=6) – Number of time periods. Must be at least 2.

  • pattern (str, default="single_switch") –

    Treatment pattern. One of:

    • "single_switch" (default, single-switch): each group switches exactly once at a uniform-random time. Mix of joiners and leavers determined by initial_treat_frac.

    • "joiners_only" (single-switch): all groups start untreated and each switches to treated once. Pure staggered adoption.

    • "leavers_only" (single-switch): mirror of joiners_only — all groups start treated and each switches to untreated once.

    • "mixed_single_switch" (single-switch): deterministic 50/50 mix of joiners and leavers, each with exactly one switch. Useful for parity tests where you want a guaranteed split independent of seed.

    • "random" (often multi-switch): each (g, t >= 1) flips treatment from the previous period with probability p_switch. Initial state drawn from Bernoulli(initial_treat_frac). With n_periods >= 4 and p_switch > 0, many groups will switch more than once and will be dropped under drop_larger_lower=True. Useful for stress-testing the drop filter.

    • "cycles" (always multi-switch): deterministic on/off cycles of length cycle_length. Half the groups start in the “0” phase and half in the “1” phase, so the panel always contains both joiner and leaver transitions. Every group is multi-switch when n_periods > 2 * cycle_length.

    • "marketing" (always multi-switch): seasonal “2 on, 1 off” pattern starting in the on phase, identical across groups. Mimics a marketing campaign with periodic breaks.

  • p_switch (float, default=0.2) – Per-period flip probability. Only used when pattern="random".

  • initial_treat_frac (float, default=0.3) – Fraction of groups starting in the treated state at period 0. Only used by "single_switch" and "random".

  • cycle_length (int, default=2) – Length of each on/off phase. Only used when pattern="cycles".

  • treatment_effect (float, default=2.0) – Average treatment effect on treated cells. With heterogeneous_effects=False, every treated cell has exactly this effect. With True, this is the mean of a Normal distribution.

  • heterogeneous_effects (bool, default=False) – If True, per-cell true effects are drawn independently from Normal(treatment_effect, effect_sd).

  • effect_sd (float, default=0.5) – Standard deviation of per-cell effects when heterogeneous_effects=True.

  • group_fe_sd (float, default=2.0) – Standard deviation of group fixed effects.

  • time_trend (float, default=0.1) – Linear time trend coefficient.

  • noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic balanced panel with one row per (group, period) cell and the following columns:

  • group (int): group identifier in [0, n_groups)

  • period (int): time period in [0, n_periods)

  • treatment (int): per-cell binary treatment (0 or 1)

  • outcome (float): outcome variable

  • true_effect (float): per-cell true treatment effect (0 if untreated)

  • d_lag (float): previous-period treatment, NaN at period 0

  • switcher_type (object): one of "initial" (period 0), "joiner" (d_lag=0, treatment=1), "leaver" (d_lag=1, treatment=0), "stable_0" (d_lag=0, treatment=0), or "stable_1" (d_lag=1, treatment=1)

Return type:

pd.DataFrame

Notes

The default pattern is "single_switch" so the generator’s happy path produces data that the dCDH estimator can use directly without dropping groups. The "random", "cycles", and "marketing" patterns are primarily for stress-testing the drop_larger_lower=True filter and will produce data where many or all groups are filtered out before estimation.

The default pattern="single_switch" is A5-safe by construction: every group has at most one transition, so no group can be a “crosser” that switches in and back out. The dCDH estimator’s drop_larger_lower=True filter (matching R DIDmultiplegtDYN) is a no-op on this pattern. Other patterns (random, cycles, marketing) ARE allowed to violate A5 and are useful primarily for stress-testing the multi-switch drop filter — passing them through the estimator with drop_larger_lower=True should drop a non-zero count of crosser groups, which is the intended check. The cohort-recentered variance formula in Web Appendix Section 3.7.3 of the dynamic companion paper is derived under A5, which is why the drop filter is on by default.

Examples

Default single-switch panel (mix of joiners and leavers, all groups survive drop_larger_lower=True):

>>> data = generate_reversible_did_data(n_groups=20, n_periods=6, seed=42)
>>> sorted(data.columns.tolist())
['d_lag', 'group', 'outcome', 'period', 'switcher_type', 'treatment', 'true_effect']
>>> set(data['switcher_type']).issubset(
...     {'initial', 'joiner', 'leaver', 'stable_0', 'stable_1'}
... )
True

Joiners-only (pure staggered adoption):

>>> data = generate_reversible_did_data(
...     n_groups=20, pattern="joiners_only", seed=1
... )
>>> set(data.query("period == 0")['treatment'].unique()) == {0}
True

Leavers-only:

>>> data = generate_reversible_did_data(
...     n_groups=20, pattern="leavers_only", seed=2
... )
>>> set(data.query("period == 0")['treatment'].unique()) == {1}
True

Example#

from diff_diff import generate_reversible_did_data, ChaisemartinDHaultfoeuille

data = generate_reversible_did_data(
    n_groups=80,
    n_periods=6,
    pattern="single_switch",  # or "joiners_only", "leavers_only", "mixed_single_switch"
    treatment_effect=2.0,
    seed=42,
)

est = ChaisemartinDHaultfoeuille()
results = est.fit(
    data, outcome="outcome", group="group",
    time="period", treatment="treatment",
)

Indicator Creation#

make_treatment_indicator#

Create binary treatment indicator from categorical or numeric columns.

diff_diff.make_treatment_indicator(data, column, treated_values=None, threshold=None, above_threshold=True, new_column='treated')[source]#

Create a binary treatment indicator column from various input types.

Parameters:
  • data (pd.DataFrame) – Input DataFrame.

  • column (str) – Name of the column to use for creating the treatment indicator.

  • treated_values (Any or list, optional) – Value(s) that indicate treatment. Units with these values get treatment=1, others get treatment=0.

  • threshold (float, optional) – Numeric threshold for creating treatment. Used when the treatment is based on a continuous variable (e.g., treat firms above median size).

  • above_threshold (bool, default=True) – If True, values >= threshold are treated. If False, values <= threshold are treated. Only used when threshold is specified.

  • new_column (str, default="treated") – Name of the new treatment indicator column.

Returns:

DataFrame with the new treatment indicator column added.

Return type:

pd.DataFrame

Examples

Create treatment from categorical variable:

>>> df = pd.DataFrame({'group': ['A', 'A', 'B', 'B'], 'y': [1, 2, 3, 4]})
>>> df = make_treatment_indicator(df, 'group', treated_values='A')
>>> df['treated'].tolist()
[1, 1, 0, 0]

Create treatment from numeric threshold:

>>> df = pd.DataFrame({'size': [10, 50, 100, 200], 'y': [1, 2, 3, 4]})
>>> df = make_treatment_indicator(df, 'size', threshold=75)
>>> df['treated'].tolist()
[0, 0, 1, 1]

Treat units below a threshold:

>>> df = make_treatment_indicator(df, 'size', threshold=75, above_threshold=False)
>>> df['treated'].tolist()
[1, 1, 0, 0]

Example#

from diff_diff import make_treatment_indicator

# From categorical
data = make_treatment_indicator(
    data,
    column='group',
    treated_values='treatment'
)

# From numeric threshold
data = make_treatment_indicator(
    data,
    column='exposure',
    threshold=0.5,
    new_column='high_exposure'
)

make_post_indicator#

Create post-treatment period indicator.

diff_diff.make_post_indicator(data, time_column, post_periods=None, treatment_start=None, new_column='post')[source]#

Create a binary post-treatment indicator column.

Parameters:
  • data (pd.DataFrame) – Input DataFrame.

  • time_column (str) – Name of the time/period column.

  • post_periods (Any or list, optional) – Specific period value(s) that are post-treatment. Periods matching these values get post=1, others get post=0.

  • treatment_start (Any, optional) – The first post-treatment period. All periods >= this value get post=1. Works with numeric periods, strings (sorted alphabetically), or dates.

  • new_column (str, default="post") – Name of the new post indicator column.

Returns:

DataFrame with the new post indicator column added.

Return type:

pd.DataFrame

Examples

Using specific post periods:

>>> df = pd.DataFrame({'year': [2018, 2019, 2020, 2021], 'y': [1, 2, 3, 4]})
>>> df = make_post_indicator(df, 'year', post_periods=[2020, 2021])
>>> df['post'].tolist()
[0, 0, 1, 1]

Using treatment start:

>>> df = make_post_indicator(df, 'year', treatment_start=2020)
>>> df['post'].tolist()
[0, 0, 1, 1]

Works with date columns:

>>> df = pd.DataFrame({'date': pd.to_datetime(['2020-01-01', '2020-06-01', '2021-01-01'])})
>>> df = make_post_indicator(df, 'date', treatment_start='2020-06-01')

Example#

from diff_diff import make_post_indicator

data = make_post_indicator(
    data,
    time_column='period',
    treatment_start=5
)

Panel Data Utilities#

wide_to_long#

Reshape wide panel data to long format.

diff_diff.wide_to_long(data, value_columns, id_column, time_name='period', value_name='value', time_values=None)[source]#

Convert wide-format panel data to long format for DiD analysis.

Wide format has one row per unit with multiple columns for each time period. Long format has one row per unit-period combination.

Parameters:
  • data (pd.DataFrame) – Wide-format DataFrame with one row per unit.

  • value_columns (list of str) – Column names containing the outcome values for each period. These should be in chronological order.

  • id_column (str) – Column name for the unit identifier.

  • time_name (str, default="period") – Name for the new time period column.

  • value_name (str, default="value") – Name for the new value/outcome column.

  • time_values (list, optional) – Values to use for time periods. If None, uses 0, 1, 2, … Must have same length as value_columns.

Returns:

Long-format DataFrame with one row per unit-period.

Return type:

pd.DataFrame

Examples

>>> wide_df = pd.DataFrame({
...     'firm_id': [1, 2, 3],
...     'sales_2019': [100, 150, 200],
...     'sales_2020': [110, 160, 210],
...     'sales_2021': [120, 170, 220]
... })
>>> long_df = wide_to_long(
...     wide_df,
...     value_columns=['sales_2019', 'sales_2020', 'sales_2021'],
...     id_column='firm_id',
...     time_name='year',
...     value_name='sales',
...     time_values=[2019, 2020, 2021]
... )
>>> len(long_df)
9
>>> long_df.columns.tolist()
['firm_id', 'year', 'sales']

Example#

from diff_diff import wide_to_long

# Wide format: each column is a time period
# unit_id, y_2019, y_2020, y_2021, y_2022
long_data = wide_to_long(
    wide_data,
    id_col='unit_id',
    value_name='outcome',
    var_name='year'
)

balance_panel#

Balance panel data by filling or dropping incomplete observations.

diff_diff.balance_panel(data, unit_column, time_column, method='inner', fill_value=None)[source]#

Balance a panel dataset to ensure all units have all time periods.

Parameters:
  • data (pd.DataFrame) – Unbalanced panel data.

  • unit_column (str) – Column name for unit identifier.

  • time_column (str) – Column name for time period.

  • method (str, default="inner") – Balancing method: - “inner”: Keep only units that appear in all periods (drops units) - “outer”: Include all unit-period combinations (creates NaN) - “fill”: Include all combinations and fill missing values

  • fill_value (float, optional) – Value to fill missing observations when method=”fill”. If None with method=”fill”, uses column-specific forward fill.

Returns:

Balanced panel DataFrame.

Return type:

pd.DataFrame

Examples

Keep only complete units:

>>> df = pd.DataFrame({
...     'unit': [1, 1, 1, 2, 2, 3, 3, 3],
...     'period': [1, 2, 3, 1, 2, 1, 2, 3],
...     'y': [10, 11, 12, 20, 21, 30, 31, 32]
... })
>>> balanced = balance_panel(df, 'unit', 'period', method='inner')
>>> balanced['unit'].unique().tolist()
[1, 3]

Include all combinations:

>>> balanced = balance_panel(df, 'unit', 'period', method='outer')
>>> len(balanced)
9

Example#

from diff_diff import balance_panel

# Fill missing periods with NaN
balanced = balance_panel(
    data,
    unit_column='unit_id',
    time_column='period',
    method='fill'
)

# Or keep only units with all periods (default)
balanced = balance_panel(
    data,
    unit_column='unit_id',
    time_column='period',
    method='inner'
)

Staggered Adoption Utilities#

create_event_time#

Create event-time column for staggered adoption designs.

diff_diff.create_event_time(data, time_column, treatment_time_column, new_column='event_time')[source]#

Create an event-time column relative to treatment timing.

Useful for event study designs where treatment occurs at different times for different units.

Parameters:
  • data (pd.DataFrame) – Panel data.

  • time_column (str) – Name of the calendar time column.

  • treatment_time_column (str) – Name of the column indicating when each unit was treated. Units with NaN or infinity are considered never-treated.

  • new_column (str, default="event_time") – Name of the new event-time column.

Returns:

DataFrame with event-time column added. Values are: - Negative for pre-treatment periods - 0 for the treatment period - Positive for post-treatment periods - NaN for never-treated units

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({
...     'unit': [1, 1, 1, 2, 2, 2],
...     'year': [2018, 2019, 2020, 2018, 2019, 2020],
...     'treatment_year': [2019, 2019, 2019, 2020, 2020, 2020]
... })
>>> df = create_event_time(df, 'year', 'treatment_year')
>>> df['event_time'].tolist()
[-1, 0, 1, -2, -1, 0]

Example#

from diff_diff import create_event_time

data = create_event_time(
    data,
    time_column='period',
    treatment_time_column='first_treat'
)

# event_time = period - first_treat
# Negative values: pre-treatment
# Zero: treatment period
# Positive values: post-treatment
# NaN for never-treated

aggregate_to_cohorts#

Aggregate unit-level data to cohort means.

diff_diff.aggregate_to_cohorts(data, unit_column, time_column, treatment_column, outcome, covariates=None)[source]#

Aggregate unit-level data to treatment cohort means.

Useful for visualization and cohort-level analysis.

Parameters:
  • data (pd.DataFrame) – Unit-level panel data.

  • unit_column (str) – Name of unit identifier column.

  • time_column (str) – Name of time period column.

  • treatment_column (str) – Name of treatment indicator column.

  • outcome (str) – Name of outcome variable column.

  • covariates (list of str, optional) – Additional columns to aggregate (will compute means).

Returns:

Cohort-level data with mean outcomes by treatment status and period.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({
...     'unit': [1, 1, 2, 2, 3, 3, 4, 4],
...     'period': [0, 1, 0, 1, 0, 1, 0, 1],
...     'treated': [1, 1, 1, 1, 0, 0, 0, 0],
...     'y': [10, 15, 12, 17, 8, 10, 9, 11]
... })
>>> cohort_df = aggregate_to_cohorts(df, 'unit', 'period', 'treated', 'y')
>>> len(cohort_df)
4

Example#

from diff_diff import aggregate_to_cohorts

cohort_data = aggregate_to_cohorts(
    data,
    unit_column='unit_id',
    time_column='period',
    treatment_column='first_treat',
    outcome='outcome'
)

Survey Aggregation#

aggregate_survey#

Aggregate survey microdata to geographic-period cells with design-based precision.

diff_diff.aggregate_survey(data, by, outcomes, survey_design, covariates=None, min_n=2, lonely_psu=None, second_stage_weights='pweight')[source]#

Aggregate survey microdata to geographic-period cells with design-based precision.

Computes design-weighted cell means and their Taylor-linearized (or replicate-based) standard errors for each cell defined by the by columns. Returns a panel-ready DataFrame and a pre-configured SurveyDesign for second-stage DiD estimation.

Each cell is treated as a subpopulation/domain of the full survey design: influence function values are zero-padded outside the cell, preserving full strata/PSU structure for variance estimation per Lumley (2004) Section 3.4.

Parameters:
  • data (pd.DataFrame) – Individual-level microdata.

  • by (str or list of str) – Columns defining cells (e.g., ["state", "year"]). The first element is used as the clustering variable in the returned SurveyDesign (geographic unit for second-stage inference).

  • outcomes (str or list of str) – Outcome variable(s) to aggregate with full precision tracking. Each outcome produces {name}_mean, {name}_se, {name}_n, and {name}_precision columns. When multiple outcomes are given, panel filtering (non-estimable cell removal, zero-weight PSU pruning) is based on the first outcome only, consistent with the returned SurveyDesign. For independent per-outcome support, call once per outcome.

  • survey_design (SurveyDesign) – Survey design specification for the microdata.

  • covariates (str or list of str, optional) – Additional variables to aggregate as design-weighted means only (no SE/precision columns).

  • min_n (int, default 2) – Minimum respondents per cell. Cells below this threshold use simple random sampling variance as a fallback.

  • lonely_psu (str, optional) – Override the survey design’s lonely_psu setting for within-cell computation. One of "remove", "certainty", "adjust".

  • second_stage_weights (str, default "pweight") –

    Weight type for the returned second-stage SurveyDesign:

    • "pweight" (default): Population weights - the mean of per-cell survey weight sums within each geographic unit (first by column), constant across periods. Targets population-weighted second-stage estimation. Compatible with all survey-capable estimators including those that require unit-constant survey columns.

    • "aweight": Precision weights - inverse variance (1 / V(y_bar)). Targets precision-weighted second-stage estimation via WLS. Compatible with estimators that accept aweight (DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD, SunAbraham, ContinuousDiD, EfficientDiD); rejected by pweight-only estimators.

Returns:

  • panel_df (pd.DataFrame) – Aggregated panel with columns: grouping variables, {outcome}_mean, {outcome}_se, {outcome}_n, {outcome}_precision, {outcome}_weight, {covariate}_mean, cell_n, cell_n_eff, cell_sum_w, srs_fallback. The _weight column contains unit-constant population weights (mean of cell_sum_w within each geographic unit) in pweight mode, or cleaned precision (NaN/Inf mapped to 0.0) in aweight mode. cell_sum_w is always present as a diagnostic column containing the sum of normalized survey weights per cell (proportional to estimated population).

  • second_stage_design (SurveyDesign) – Pre-configured for second-stage estimation with the chosen weight_type, weights from the first outcome, and geographic clustering via psu.

Return type:

Tuple[DataFrame, SurveyDesign]

Examples

>>> design = SurveyDesign(weights="finalwt", strata="strat", psu="psu")
>>> panel, stage2 = aggregate_survey(
...     microdata, by=["state", "year"],
...     outcomes="smoking_rate", survey_design=design,
... )
>>> # stage2 has weight_type="pweight" — compatible with all estimators.
>>> # Add treatment/time indicators at the panel level, then fit:
>>> # panel["first_treat"] = panel["state"].map(policy_year).fillna(0)
>>> # result = CallawaySantAnna().fit(
>>> #     panel, outcome="smoking_rate_mean",
>>> #     unit="state", time="year", first_treat="first_treat",
>>> #     survey_design=stage2,
>>> # )

Example#

from diff_diff import aggregate_survey, SurveyDesign, DifferenceInDifferences

# Define the survey design for the microdata
design = SurveyDesign(weights="finalwt", strata="strat", psu="psu")

# Aggregate to state-year panel with design-based SEs
panel, stage2 = aggregate_survey(
    microdata,
    by=["state", "year"],
    outcomes="smoking_rate",
    covariates=["age", "income"],
    survey_design=design,
)

# panel has: state, year, smoking_rate_mean, smoking_rate_se,
#   smoking_rate_n, smoking_rate_precision, smoking_rate_weight,
#   age_mean, income_mean, cell_n, cell_n_eff, cell_sum_w, srs_fallback
#
# *_weight is fit-ready: unit-constant population weight (pweight, default)
#   or cleaned precision with NaN/Inf -> 0.0 (aweight opt-in).
# cell_sum_w is a per-cell diagnostic (sum of survey weights per cell).
# Non-estimable cells and zero-weight geos are dropped automatically.

# stage2 is pre-configured: pweights + state-level clustering
# Add treatment/time indicators at the panel level, then fit:
# panel["treated"] = ...  # from policy adoption data
# panel["post"] = (panel["year"] >= treatment_year).astype(int)
# result = DifferenceInDifferences().fit(
#     panel, outcome="smoking_rate_mean",
#     treatment="treated", time="post", survey_design=stage2,
# )

Data Validation#

validate_did_data#

Validate data structure for DiD analysis.

diff_diff.validate_did_data(data, outcome, treatment, time, unit=None, raise_on_error=True)[source]#

Validate that data is properly formatted for DiD analysis.

Checks for common data issues and provides informative error messages.

Parameters:
  • data (pd.DataFrame) – Data to validate.

  • outcome (str) – Name of outcome variable column.

  • treatment (str) – Name of treatment indicator column.

  • time (str) – Name of time/post indicator column.

  • unit (str, optional) – Name of unit identifier column (for panel data validation).

  • raise_on_error (bool, default=True) – If True, raises ValueError on validation failures. If False, returns validation results without raising.

Returns:

Validation results with keys: - valid: bool indicating if data passed all checks - errors: list of error messages - warnings: list of warning messages - summary: dict with data summary statistics

Return type:

dict

Examples

>>> df = pd.DataFrame({
...     'y': [1, 2, 3, 4],
...     'treated': [0, 0, 1, 1],
...     'post': [0, 1, 0, 1]
... })
>>> result = validate_did_data(df, 'y', 'treated', 'post', raise_on_error=False)
>>> result['valid']
True

Example#

from diff_diff import validate_did_data

result = validate_did_data(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    unit='unit_id'
)

if not result['valid']:
    for error in result['errors']:
        print(f"Error: {error}")
    for warning in result['warnings']:
        print(f"Warning: {warning}")

summarize_did_data#

Generate summary statistics for DiD data.

diff_diff.summarize_did_data(data, outcome, treatment, time, unit=None)[source]#

Generate summary statistics by treatment group and time period.

Parameters:
  • data (pd.DataFrame) – Input data.

  • outcome (str) – Name of outcome variable column.

  • treatment (str) – Name of treatment indicator column.

  • time (str) – Name of time/period column.

  • unit (str, optional) – Name of unit identifier column.

Returns:

Summary statistics with columns for each treatment-time combination.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({
...     'y': [10, 11, 12, 13, 20, 21, 22, 23],
...     'treated': [0, 0, 1, 1, 0, 0, 1, 1],
...     'post': [0, 1, 0, 1, 0, 1, 0, 1]
... })
>>> summary = summarize_did_data(df, 'y', 'treated', 'post')
>>> print(summary)

Example#

from diff_diff import summarize_did_data

summary = summarize_did_data(
    data,
    outcome='outcome',
    treatment='treated',
    time='period',
    unit='unit_id'
)

print(summary)

Control Unit Selection#

rank_control_units#

Rank control units by suitability for DiD or synthetic control.

diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]#

Rank potential control units by their suitability for DiD analysis.

Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • unit_column (str) – Column name for unit identifier.

  • time_column (str) – Column name for time periods.

  • outcome_column (str) – Column name for outcome variable.

  • treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.

  • treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.

  • pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.

  • covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.

  • outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).

  • covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.

  • exclude_units (list, optional) – Units that cannot be in control group.

  • require_units (list, optional) – Units that must be in control group (will always appear in output).

  • n_top (int, optional) – Return only top N control units. If None, return all.

  • suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.

  • n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.

  • lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.

Returns:

Ranked control units with columns:

  • unit: Unit identifier

  • quality_score: Combined quality score (0-1, higher is better)

  • outcome_trend_score: Pre-treatment outcome trend similarity

  • covariate_score: Covariate match score (NaN if no covariates)

  • synthetic_weight: Informational heuristic weight from a single-pass uncentered Frank-Wolfe solve; does NOT factor into quality_score (ranking) and is NOT the canonical SDID unit weight. For canonical SDID weights use SyntheticDiD.fit().

  • pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean

  • is_required: Whether unit was in require_units

If suggest_treatment_candidates=True (and no treated units):

  • unit: Unit identifier

  • treatment_candidate_score: Suitability as treatment unit

  • avg_outcome_level: Pre-treatment outcome mean

  • outcome_trend: Pre-treatment trend slope

  • n_similar_controls: Count of similar potential controls

Return type:

pd.DataFrame

Examples

Rank controls against treated units:

>>> data = generate_did_data(n_units=30, n_periods=6, seed=42)
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     n_top=10
... )
>>> ranking['quality_score'].is_monotonic_decreasing
True

With covariates:

>>> data['size'] = np.random.randn(len(data))
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     covariates=['size']
... )

Filter data for SyntheticDiD:

>>> top_controls = ranking['unit'].tolist()
>>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]

Example#

from diff_diff import rank_control_units, generate_did_data

panel = generate_did_data(n_units=100, n_periods=10, treatment_effect=2.0)
ranked = rank_control_units(
    panel,
    unit_column='unit',
    time_column='period',
    outcome_column='outcome',
    treatment_column='treated',
    pre_periods=[0, 1, 2, 3, 4]
)

# Select top 10 control units
best_controls = ranked.head(10)['unit'].tolist()