Data Preparation#
Utilities for preparing and validating data for DiD analysis.
Data Generation#
generate_did_data#
Generate synthetic data with known treatment effects for testing.
- diff_diff.generate_did_data(n_units=100, n_periods=4, treatment_effect=5.0, treatment_fraction=0.5, treatment_period=2, unit_fe_sd=2.0, time_trend=0.5, noise_sd=1.0, seed=None)[source]#
Generate synthetic data for DiD analysis with known treatment effect.
Creates a balanced panel dataset with realistic features including unit fixed effects, time trends, and a known treatment effect.
- Parameters:
n_units (int, default=100) – Number of units in the panel.
n_periods (int, default=4) – Number of time periods.
treatment_effect (float, default=5.0) – True average treatment effect on the treated.
treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.
treatment_period (int, default=2) – First post-treatment period (0-indexed). Periods >= this are post.
unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.
time_trend (float, default=0.5) – Linear time trend coefficient.
noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic panel data with columns: - unit: Unit identifier - period: Time period - treated: Treatment indicator (0/1) - post: Post-treatment indicator (0/1) - outcome: Outcome variable - true_effect: The true treatment effect (for validation)
- Return type:
pd.DataFrame
Examples
Generate simple data for testing:
>>> data = generate_did_data(n_units=50, n_periods=4, treatment_effect=3.0, seed=42) >>> len(data) 200 >>> data.columns.tolist() ['unit', 'period', 'treated', 'post', 'outcome', 'true_effect']
Verify treatment effect recovery:
>>> from diff_diff import DifferenceInDifferences >>> did = DifferenceInDifferences() >>> results = did.fit(data, outcome='outcome', treatment='treated', time='post') >>> abs(results.att - 3.0) < 1.0 # Close to true effect True
Example#
from diff_diff import generate_did_data
# Generate basic 2x2 DiD data
data = generate_did_data(
n_units=100,
n_periods=10,
treatment_effect=5.0,
treatment_period=5,
treatment_fraction=0.5,
noise_sd=1.0
)
print(data.head())
# Columns: unit_id, period, outcome, treated, post
generate_staggered_data#
Generate synthetic staggered adoption data for testing.
- diff_diff.generate_staggered_data(n_units=100, n_periods=10, cohort_periods=None, never_treated_frac=0.3, treatment_effect=2.0, dynamic_effects=True, effect_growth=0.1, unit_fe_sd=2.0, time_trend=0.1, noise_sd=0.5, seed=None, panel=True)[source]#
Generate synthetic data for staggered adoption DiD analysis.
Creates panel data where different units receive treatment at different times (staggered rollout). Useful for testing CallawaySantAnna, SunAbraham, and other staggered DiD estimators.
- Parameters:
n_units (int, default=100) – Total number of units in the panel.
n_periods (int, default=10) – Number of time periods.
cohort_periods (list of int, optional) – Periods when treatment cohorts are first treated. If None, defaults to [3, 5, 7] for a 10-period panel.
never_treated_frac (float, default=0.3) – Fraction of units that are never treated (cohort 0).
treatment_effect (float, default=2.0) – Base treatment effect at time of treatment.
dynamic_effects (bool, default=True) – If True, treatment effects grow over time since treatment.
effect_growth (float, default=0.1) – Per-period growth in treatment effect (if dynamic_effects=True). Effect at time t since treatment: effect * (1 + effect_growth * t).
unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.
time_trend (float, default=0.1) – Linear time trend coefficient.
noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
panel (bool, default=True) – If True (default), generate balanced panel data (same units across all periods). If False, generate repeated cross-section data where each period draws independent observations with globally unique IDs.
- Returns:
Synthetic staggered adoption data with columns: - unit: Unit identifier - period: Time period - outcome: Outcome variable - first_treat: First treatment period (0 = never treated) - treated: Binary indicator (1 if treated at this observation) - treat: Binary unit-level ever-treated indicator - true_effect: The true treatment effect for this observation
- Return type:
pd.DataFrame
Examples
Generate staggered adoption data:
>>> data = generate_staggered_data(n_units=100, n_periods=10, seed=42) >>> data['first_treat'].value_counts().sort_index() 0 30 3 24 5 23 7 23 Name: first_treat, dtype: int64
Use with Callaway-Sant’Anna estimator:
>>> from diff_diff import CallawaySantAnna >>> cs = CallawaySantAnna() >>> results = cs.fit(data, outcome='outcome', unit='unit', ... time='period', first_treat='first_treat') >>> results.overall_att > 0 True
Example#
from diff_diff import generate_staggered_data
data = generate_staggered_data(
n_units=200,
n_periods=10,
cohort_periods=[4, 6, 8],
seed=42
)
generate_event_study_data#
Generate synthetic event study data for testing.
- diff_diff.generate_event_study_data(n_units=300, n_pre=5, n_post=5, treatment_fraction=0.5, treatment_effect=5.0, unit_fe_sd=2.0, noise_sd=2.0, seed=None)[source]#
Generate synthetic data for event study analysis.
Creates panel data with simultaneous treatment at period n_pre. Useful for testing MultiPeriodDiD, pre-trends power analysis, and HonestDiD sensitivity analysis.
- Parameters:
n_units (int, default=300) – Total number of units in the panel.
n_pre (int, default=5) – Number of pre-treatment periods.
n_post (int, default=5) – Number of post-treatment periods.
treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.
treatment_effect (float, default=5.0) – True average treatment effect on the treated.
unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.
noise_sd (float, default=2.0) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic event study data with columns: - unit: Unit identifier - period: Time period - treated: Binary unit-level treatment indicator - post: Binary post-treatment indicator - outcome: Outcome variable - event_time: Time relative to treatment (negative=pre, 0+=post) - true_effect: The true treatment effect for this observation
- Return type:
pd.DataFrame
Examples
Generate event study data:
>>> data = generate_event_study_data(n_units=300, n_pre=5, n_post=5, seed=42) >>> data['event_time'].unique() array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])
Use with MultiPeriodDiD:
>>> from diff_diff import MultiPeriodDiD >>> mp_did = MultiPeriodDiD() >>> results = mp_did.fit(data, outcome='outcome', treatment='treated', ... time='period', post_periods=[5, 6, 7, 8, 9])
Notes
The event_time column is relative to treatment: - Negative values: pre-treatment periods - 0: first post-treatment period - Positive values: subsequent post-treatment periods
generate_ddd_data#
Generate synthetic Triple Difference data.
- diff_diff.generate_ddd_data(n_per_cell=100, treatment_effect=2.0, group_effect=2.0, partition_effect=1.0, time_effect=0.5, noise_sd=1.0, add_covariates=False, seed=None)[source]#
Generate synthetic data for Triple Difference (DDD) analysis.
Creates data following the DGP: Y = mu + G + P + T + G*P + G*T + P*T + tau*G*P*T + eps
where G=group, P=partition, T=time. The treatment effect (tau) only applies to units that are in the treated group (G=1), eligible partition (P=1), and post-treatment period (T=1).
- Parameters:
n_per_cell (int, default=100) – Number of observations per cell (8 cells total: 2x2x2).
treatment_effect (float, default=2.0) – True average treatment effect on the treated (G=1, P=1, T=1).
group_effect (float, default=2.0) – Main effect of being in treated group.
partition_effect (float, default=1.0) – Main effect of being in eligible partition.
time_effect (float, default=0.5) – Main effect of post-treatment period.
noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.
add_covariates (bool, default=False) – If True, adds age and education covariates that affect outcome.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic DDD data with columns: - outcome: Outcome variable - group: Group indicator (0=control, 1=treated) - partition: Partition indicator (0=ineligible, 1=eligible) - time: Time indicator (0=pre, 1=post) - unit_id: Unique unit identifier - true_effect: The true treatment effect for this observation - age: Age covariate (if add_covariates=True) - education: Education covariate (if add_covariates=True)
- Return type:
pd.DataFrame
Examples
Generate DDD data:
>>> data = generate_ddd_data(n_per_cell=100, treatment_effect=3.0, seed=42) >>> data.shape (800, 6) >>> data.groupby(['group', 'partition', 'time']).size() group partition time 0 0 0 100 1 100 1 0 100 1 100 1 0 0 100 1 100 1 0 100 1 100 dtype: int64
Use with TripleDifference estimator:
>>> from diff_diff import TripleDifference >>> ddd = TripleDifference() >>> results = ddd.fit(data, outcome='outcome', group='group', ... partition='partition', time='time') >>> abs(results.att - 3.0) < 1.0 True
generate_ddd_panel_data#
Generate synthetic panel-structured Triple Difference data for power analysis.
- diff_diff.generate_ddd_panel_data(n_units=200, n_periods=8, treatment_period=4, group_frac=0.5, partition_frac=0.5, treatment_effect=2.0, group_effect=2.0, partition_effect=1.0, time_effect=0.5, group_time_interaction=1.0, partition_time_interaction=0.5, group_partition_interaction=1.5, unit_fe_sd=1.5, noise_sd=1.0, add_covariates=False, seed=None)[source]#
Generate synthetic panel data for Triple Difference (DDD) power analysis.
Creates a balanced panel of n_units observed over n_periods with two time-invariant binary dimensions (
groupandpartition) and a derived binarypostindicator. The triple-interaction effect (group * partition * post) is the identifying ATT under DDD-CPT.The DGP equation is:
Y_{i,t} = unit_fe_i + group_i * group_effect + partition_i * partition_effect + post_t * time_effect + (group_i * partition_i) * group_partition_interaction + (group_i * post_t) * group_time_interaction + (partition_i * post_t) * partition_time_interaction + treatment_effect * group_i * partition_i * post_t + epsilon_{i,t}
where
group_iandpartition_iare unit-level (constant in t) andpost_t = 1[period >= treatment_period]. DDD-CPT identification holds becausegroup_partition_interactionenters only as a unit-level (time-invariant) effect, leaving the triple-interaction as the sole source of differential group × partition trend.Unlike the cross-sectional
generate_ddd_data, this DGP provides panel-realistic unit fixed effects and within-unit serial structure, making it suitable for panel-aware power-analysis simulations or sanity-checking estimators that ignore the panel dimension.Warning
TripleDifferenceis a repeated-cross-sectionpanel=FALSEestimator: its analytical default treats each row as an independent observation (df = n_obs - 8). When fitting againstgenerate_ddd_panel_dataoutput, the within-unit serial correlation makes unclustered SEs anti-conservative — they understate sampling variability and overstate power. Always passcluster="unit"(Liang-Zeger CR1) when fitting on panel-generated data; the point estimateattis invariant to clustering but the inference contract is not. See theTripleDifferenceREGISTRY entry for the clustering contract.- Parameters:
n_units (int, default=200) – Number of units in the panel.
n_periods (int, default=8) – Number of time periods.
treatment_period (int, default=4) – Period (0-indexed) at which
postswitches from 0 to 1. Must satisfy1 <= treatment_period < n_periods.group_frac (float, default=0.5) – Fraction of units with
group=1. Must be in(0, 1). The partition split is then drawn stratified-by-group at the requestedpartition_fracso every (group, partition) cell receives at least one unit; aValueErroris raised when the rounded cell counts would leave any cell empty.partition_frac (float, default=0.5) – Fraction of units with
partition=1within eachgroupstratum. Must be in(0, 1). The stratified allocation is what makes TripleDifference.fit’s 2x2x2 surface populated for any valid(n_units, group_frac, partition_frac).treatment_effect (float, default=2.0) – True ATT for the triple-interaction cell (group=1, partition=1, post=1).
group_effect (float, default=2.0) – Main effect of
group=1(unit-level).partition_effect (float, default=1.0) – Main effect of
partition=1(unit-level).time_effect (float, default=0.5) – Main effect of
post=1(time-level).group_time_interaction (float, default=1.0) – Coefficient on
group * post(differential trend for the group dimension).partition_time_interaction (float, default=0.5) – Coefficient on
partition * post(differential trend for the partition dimension).group_partition_interaction (float, default=1.5) – Coefficient on the unit-level
group * partitioninteraction. Must be time-invariant for DDD-CPT to hold.unit_fe_sd (float, default=1.5) – Standard deviation of the unit fixed effect.
noise_sd (float, default=1.0) – Standard deviation of the idiosyncratic noise term.
add_covariates (bool, default=False) – If True, add unit-level covariates
x1(continuous) andx2(binary) that affect the outcome.seed (int, optional) – Random seed for reproducibility.
- Returns:
Long-format panel with columns:
unit: integer unit ID.period: integer time period (0-indexed).outcome: outcome variable.group: unit-level binary group indicator (time-invariant).partition: unit-level binary partition indicator (time-invariant, orthogonal to group).post: binary indicator,1ifperiod >= treatment_period.treated:group * partition * post(binary).true_effect:treatment_effectwhen treated, else 0.x1,x2: optional unit-level covariates (only ifadd_covariates=True).
- Return type:
pd.DataFrame
Examples
Generate a balanced panel with default parameters:
>>> data = generate_ddd_panel_data(n_units=200, n_periods=8, seed=42) >>> data.shape (1600, 8) >>> data.groupby('unit')['period'].count().eq(8).all() True
Fit with TripleDifference. Note
time="post"(the derived binary indicator) andcluster="unit"(required for valid inference on panel-generated data; see the warning above):>>> from diff_diff import TripleDifference >>> result = TripleDifference(cluster="unit").fit( ... data, outcome='outcome', group='group', ... partition='partition', time='post', ... )
generate_factor_data#
Generate synthetic data with factor structure for TROP testing.
- diff_diff.generate_factor_data(n_units=50, n_pre=10, n_post=5, n_treated=10, n_factors=2, treatment_effect=2.0, factor_strength=1.0, treated_loading_shift=0.5, unit_fe_sd=1.0, noise_sd=0.5, seed=None)[source]#
Generate synthetic panel data with interactive fixed effects (factor model).
Creates data following the DGP: Y_it = mu + alpha_i + beta_t + Lambda_i’F_t + tau*D_it + eps_it
where Lambda_i’F_t is the interactive fixed effects component. Useful for testing TROP (Triply Robust Panel) and comparing with SyntheticDiD.
- Parameters:
n_units (int, default=50) – Total number of units in the panel.
n_pre (int, default=10) – Number of pre-treatment periods.
n_post (int, default=5) – Number of post-treatment periods.
n_treated (int, default=10) – Number of treated units (assigned to first n_treated unit IDs).
n_factors (int, default=2) – Number of latent factors in the interactive fixed effects.
treatment_effect (float, default=2.0) – True average treatment effect on the treated.
factor_strength (float, default=1.0) – Scaling factor for interactive fixed effects.
treated_loading_shift (float, default=0.5) – Shift in factor loadings for treated units (creates confounding).
unit_fe_sd (float, default=1.0) – Standard deviation of unit fixed effects.
noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic factor model data with columns: - unit: Unit identifier - period: Time period - outcome: Outcome variable - treated: Binary indicator (1 if treated at this observation) - treat: Binary unit-level ever-treated indicator - true_effect: The true treatment effect for this observation
- Return type:
pd.DataFrame
Examples
Generate data with factor structure:
>>> data = generate_factor_data(n_units=50, n_factors=2, seed=42) >>> data.shape (750, 6)
Use with TROP estimator:
>>> from diff_diff import TROP >>> trop = TROP(n_bootstrap=50, seed=42) >>> results = trop.fit(data, outcome='outcome', treatment='treated', ... unit='unit', time='period', ... post_periods=list(range(10, 15)))
Notes
The treated units have systematically different factor loadings (shifted by treated_loading_shift), which creates confounding that standard DiD cannot address but TROP can handle.
generate_panel_data#
Generate generic synthetic panel data.
- diff_diff.generate_panel_data(n_units=100, n_periods=8, treatment_period=4, treatment_fraction=0.5, treatment_effect=5.0, parallel_trends=True, trend_violation=1.0, unit_fe_sd=2.0, noise_sd=0.5, seed=None)[source]#
Generate synthetic panel data for parallel trends testing.
Creates panel data with optional violation of parallel trends, useful for testing parallel trends diagnostics, placebo tests, and sensitivity analysis methods.
- Parameters:
n_units (int, default=100) – Total number of units in the panel.
n_periods (int, default=8) – Number of time periods.
treatment_period (int, default=4) – First post-treatment period (0-indexed).
treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.
treatment_effect (float, default=5.0) – True average treatment effect on the treated.
parallel_trends (bool, default=True) – If True, treated and control groups have parallel pre-treatment trends. If False, treated group has a steeper pre-treatment trend.
trend_violation (float, default=1.0) – Size of the differential trend for treated group when parallel_trends=False. Treated units have trend = common_trend + trend_violation.
unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.
noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic panel data with columns: - unit: Unit identifier - period: Time period - treated: Binary unit-level treatment indicator - post: Binary post-treatment indicator - outcome: Outcome variable - true_effect: The true treatment effect for this observation
- Return type:
pd.DataFrame
Examples
Generate data with parallel trends:
>>> data_parallel = generate_panel_data(parallel_trends=True, seed=42) >>> from diff_diff.utils import check_parallel_trends >>> result = check_parallel_trends(data_parallel, outcome='outcome', ... time='period', treatment_group='treated', ... pre_periods=[0, 1, 2, 3]) >>> result['parallel_trends_plausible'] True
Generate data with trend violation:
>>> data_violation = generate_panel_data(parallel_trends=False, seed=42) >>> result = check_parallel_trends(data_violation, outcome='outcome', ... time='period', treatment_group='treated', ... pre_periods=[0, 1, 2, 3]) >>> result['parallel_trends_plausible'] False
generate_continuous_did_data#
Generate synthetic continuous treatment DiD data with known dose-response.
- diff_diff.generate_continuous_did_data(n_units=500, n_periods=4, cohort_periods=None, never_treated_frac=0.3, dose_distribution='lognormal', dose_params=None, att_function='linear', att_slope=2.0, att_intercept=1.0, unit_fe_sd=2.0, time_trend=0.5, noise_sd=1.0, seed=None)[source]#
Generate synthetic data for continuous DiD analysis with known dose-response.
Creates a balanced panel with continuous treatment doses and known ATT(d) function, satisfying strong parallel trends by construction.
- Parameters:
n_units (int, default=500) – Number of units in the panel.
n_periods (int, default=4) – Number of time periods (1-indexed).
cohort_periods (list of int, optional) – Treatment cohort periods. Default:
[2](single cohort).never_treated_frac (float, default=0.3) – Fraction of units that are never-treated.
dose_distribution (str, default="lognormal") – Distribution for dose:
"lognormal","uniform","exponential".dose_params (dict, optional) – Distribution-specific parameters. Defaults: lognormal:
{"mean": 0.5, "sigma": 0.5}uniform:{"low": 0.5, "high": 5.0}exponential:{"scale": 2.0}att_function (str, default="linear") – Functional form of ATT(d):
"linear","quadratic","log".att_slope (float, default=2.0) – Slope parameter for ATT function.
att_intercept (float, default=1.0) – Intercept parameter for ATT function.
unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.
time_trend (float, default=0.5) – Linear time trend coefficient.
noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Panel data with columns:
unit,period,outcome,first_treat,dose,true_att.- Return type:
pd.DataFrame
generate_reversible_did_data#
Generate synthetic reversible-treatment panel data — treatment can switch on
and off over time. Use this with ChaisemartinDHaultfoeuille
for testing the dCDH estimator on non-absorbing treatments.
- diff_diff.generate_reversible_did_data(n_groups=50, n_periods=6, pattern='single_switch', p_switch=0.2, initial_treat_frac=0.3, cycle_length=2, treatment_effect=2.0, heterogeneous_effects=False, effect_sd=0.5, group_fe_sd=2.0, time_trend=0.1, noise_sd=0.5, seed=None)[source]#
Generate synthetic panel data with reversible (non-absorbing) treatment.
Treatment can switch on and off over time, supporting designs where the canonical staggered-adoption assumption (once treated, always treated) does not hold. This is the only generator in the library that produces reversible-treatment data; intended for the
ChaisemartinDHaultfoeuille(dCDH) estimator.Seven patterns are supported. Four of them are guaranteed to keep every group as a “single-switch” group (each group switches treatment status at most once), so the dCDH
drop_larger_lower=Truefilter is a no-op. The other three deliberately produce multi-switch groups for stress- testing the drop logic.- Parameters:
n_groups (int, default=50) – Number of groups in the panel.
n_periods (int, default=6) – Number of time periods. Must be at least 2.
pattern (str, default="single_switch") –
Treatment pattern. One of:
"single_switch"(default, single-switch): each group switches exactly once at a uniform-random time. Mix of joiners and leavers determined byinitial_treat_frac."joiners_only"(single-switch): all groups start untreated and each switches to treated once. Pure staggered adoption."leavers_only"(single-switch): mirror ofjoiners_only— all groups start treated and each switches to untreated once."mixed_single_switch"(single-switch): deterministic 50/50 mix of joiners and leavers, each with exactly one switch. Useful for parity tests where you want a guaranteed split independent of seed."random"(often multi-switch): each(g, t >= 1)flips treatment from the previous period with probabilityp_switch. Initial state drawn fromBernoulli(initial_treat_frac). Withn_periods >= 4andp_switch > 0, many groups will switch more than once and will be dropped underdrop_larger_lower=True. Useful for stress-testing the drop filter."cycles"(always multi-switch): deterministic on/off cycles of lengthcycle_length. Half the groups start in the “0” phase and half in the “1” phase, so the panel always contains both joiner and leaver transitions. Every group is multi-switch whenn_periods > 2 * cycle_length."marketing"(always multi-switch): seasonal “2 on, 1 off” pattern starting in the on phase, identical across groups. Mimics a marketing campaign with periodic breaks.
p_switch (float, default=0.2) – Per-period flip probability. Only used when
pattern="random".initial_treat_frac (float, default=0.3) – Fraction of groups starting in the treated state at period 0. Only used by
"single_switch"and"random".cycle_length (int, default=2) – Length of each on/off phase. Only used when
pattern="cycles".treatment_effect (float, default=2.0) – Average treatment effect on treated cells. With
heterogeneous_effects=False, every treated cell has exactly this effect. WithTrue, this is the mean of a Normal distribution.heterogeneous_effects (bool, default=False) – If True, per-cell true effects are drawn independently from
Normal(treatment_effect, effect_sd).effect_sd (float, default=0.5) – Standard deviation of per-cell effects when
heterogeneous_effects=True.group_fe_sd (float, default=2.0) – Standard deviation of group fixed effects.
time_trend (float, default=0.1) – Linear time trend coefficient.
noise_sd (float, default=0.5) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic balanced panel with one row per
(group, period)cell and the following columns:group(int): group identifier in[0, n_groups)period(int): time period in[0, n_periods)treatment(int): per-cell binary treatment (0 or 1)outcome(float): outcome variabletrue_effect(float): per-cell true treatment effect (0 if untreated)d_lag(float): previous-period treatment, NaN at period 0switcher_type(object): one of"initial"(period 0),"joiner"(d_lag=0, treatment=1),"leaver"(d_lag=1, treatment=0),"stable_0"(d_lag=0, treatment=0), or"stable_1"(d_lag=1, treatment=1)
- Return type:
pd.DataFrame
Notes
The default pattern is
"single_switch"so the generator’s happy path produces data that the dCDH estimator can use directly without dropping groups. The"random","cycles", and"marketing"patterns are primarily for stress-testing thedrop_larger_lower=Truefilter and will produce data where many or all groups are filtered out before estimation.The default
pattern="single_switch"is A5-safe by construction: every group has at most one transition, so no group can be a “crosser” that switches in and back out. The dCDH estimator’sdrop_larger_lower=Truefilter (matching RDIDmultiplegtDYN) is a no-op on this pattern. Other patterns (random,cycles,marketing) ARE allowed to violate A5 and are useful primarily for stress-testing the multi-switch drop filter — passing them through the estimator withdrop_larger_lower=Trueshould drop a non-zero count of crosser groups, which is the intended check. The cohort-recentered variance formula in Web Appendix Section 3.7.3 of the dynamic companion paper is derived under A5, which is why the drop filter is on by default.Examples
Default single-switch panel (mix of joiners and leavers, all groups survive
drop_larger_lower=True):>>> data = generate_reversible_did_data(n_groups=20, n_periods=6, seed=42) >>> sorted(data.columns.tolist()) ['d_lag', 'group', 'outcome', 'period', 'switcher_type', 'treatment', 'true_effect'] >>> set(data['switcher_type']).issubset( ... {'initial', 'joiner', 'leaver', 'stable_0', 'stable_1'} ... ) True
Joiners-only (pure staggered adoption):
>>> data = generate_reversible_did_data( ... n_groups=20, pattern="joiners_only", seed=1 ... ) >>> set(data.query("period == 0")['treatment'].unique()) == {0} True
Leavers-only:
>>> data = generate_reversible_did_data( ... n_groups=20, pattern="leavers_only", seed=2 ... ) >>> set(data.query("period == 0")['treatment'].unique()) == {1} True
Example#
from diff_diff import generate_reversible_did_data, ChaisemartinDHaultfoeuille
data = generate_reversible_did_data(
n_groups=80,
n_periods=6,
pattern="single_switch", # or "joiners_only", "leavers_only", "mixed_single_switch"
treatment_effect=2.0,
seed=42,
)
est = ChaisemartinDHaultfoeuille()
results = est.fit(
data, outcome="outcome", group="group",
time="period", treatment="treatment",
)
Indicator Creation#
make_treatment_indicator#
Create binary treatment indicator from categorical or numeric columns.
- diff_diff.make_treatment_indicator(data, column, treated_values=None, threshold=None, above_threshold=True, new_column='treated')[source]#
Create a binary treatment indicator column from various input types.
- Parameters:
data (pd.DataFrame) – Input DataFrame.
column (str) – Name of the column to use for creating the treatment indicator.
treated_values (Any or list, optional) – Value(s) that indicate treatment. Units with these values get treatment=1, others get treatment=0.
threshold (float, optional) – Numeric threshold for creating treatment. Used when the treatment is based on a continuous variable (e.g., treat firms above median size).
above_threshold (bool, default=True) – If True, values >= threshold are treated. If False, values <= threshold are treated. Only used when threshold is specified.
new_column (str, default="treated") – Name of the new treatment indicator column.
- Returns:
DataFrame with the new treatment indicator column added.
- Return type:
pd.DataFrame
Examples
Create treatment from categorical variable:
>>> df = pd.DataFrame({'group': ['A', 'A', 'B', 'B'], 'y': [1, 2, 3, 4]}) >>> df = make_treatment_indicator(df, 'group', treated_values='A') >>> df['treated'].tolist() [1, 1, 0, 0]
Create treatment from numeric threshold:
>>> df = pd.DataFrame({'size': [10, 50, 100, 200], 'y': [1, 2, 3, 4]}) >>> df = make_treatment_indicator(df, 'size', threshold=75) >>> df['treated'].tolist() [0, 0, 1, 1]
Treat units below a threshold:
>>> df = make_treatment_indicator(df, 'size', threshold=75, above_threshold=False) >>> df['treated'].tolist() [1, 1, 0, 0]
Example#
from diff_diff import make_treatment_indicator
# From categorical
data = make_treatment_indicator(
data,
column='group',
treated_values='treatment'
)
# From numeric threshold
data = make_treatment_indicator(
data,
column='exposure',
threshold=0.5,
new_column='high_exposure'
)
make_post_indicator#
Create post-treatment period indicator.
- diff_diff.make_post_indicator(data, time_column, post_periods=None, treatment_start=None, new_column='post')[source]#
Create a binary post-treatment indicator column.
- Parameters:
data (pd.DataFrame) – Input DataFrame.
time_column (str) – Name of the time/period column.
post_periods (Any or list, optional) – Specific period value(s) that are post-treatment. Periods matching these values get post=1, others get post=0.
treatment_start (Any, optional) – The first post-treatment period. All periods >= this value get post=1. Works with numeric periods, strings (sorted alphabetically), or dates.
new_column (str, default="post") – Name of the new post indicator column.
- Returns:
DataFrame with the new post indicator column added.
- Return type:
pd.DataFrame
Examples
Using specific post periods:
>>> df = pd.DataFrame({'year': [2018, 2019, 2020, 2021], 'y': [1, 2, 3, 4]}) >>> df = make_post_indicator(df, 'year', post_periods=[2020, 2021]) >>> df['post'].tolist() [0, 0, 1, 1]
Using treatment start:
>>> df = make_post_indicator(df, 'year', treatment_start=2020) >>> df['post'].tolist() [0, 0, 1, 1]
Works with date columns:
>>> df = pd.DataFrame({'date': pd.to_datetime(['2020-01-01', '2020-06-01', '2021-01-01'])}) >>> df = make_post_indicator(df, 'date', treatment_start='2020-06-01')
Example#
from diff_diff import make_post_indicator
data = make_post_indicator(
data,
time_column='period',
treatment_start=5
)
Panel Data Utilities#
wide_to_long#
Reshape wide panel data to long format.
- diff_diff.wide_to_long(data, value_columns, id_column, time_name='period', value_name='value', time_values=None)[source]#
Convert wide-format panel data to long format for DiD analysis.
Wide format has one row per unit with multiple columns for each time period. Long format has one row per unit-period combination.
- Parameters:
data (pd.DataFrame) – Wide-format DataFrame with one row per unit.
value_columns (list of str) – Column names containing the outcome values for each period. These should be in chronological order.
id_column (str) – Column name for the unit identifier.
time_name (str, default="period") – Name for the new time period column.
value_name (str, default="value") – Name for the new value/outcome column.
time_values (list, optional) – Values to use for time periods. If None, uses 0, 1, 2, … Must have same length as value_columns.
- Returns:
Long-format DataFrame with one row per unit-period.
- Return type:
pd.DataFrame
Examples
>>> wide_df = pd.DataFrame({ ... 'firm_id': [1, 2, 3], ... 'sales_2019': [100, 150, 200], ... 'sales_2020': [110, 160, 210], ... 'sales_2021': [120, 170, 220] ... }) >>> long_df = wide_to_long( ... wide_df, ... value_columns=['sales_2019', 'sales_2020', 'sales_2021'], ... id_column='firm_id', ... time_name='year', ... value_name='sales', ... time_values=[2019, 2020, 2021] ... ) >>> len(long_df) 9 >>> long_df.columns.tolist() ['firm_id', 'year', 'sales']
Example#
from diff_diff import wide_to_long
# Wide format: each column is a time period
# unit_id, y_2019, y_2020, y_2021, y_2022
long_data = wide_to_long(
wide_data,
id_col='unit_id',
value_name='outcome',
var_name='year'
)
balance_panel#
Balance panel data by filling or dropping incomplete observations.
- diff_diff.balance_panel(data, unit_column, time_column, method='inner', fill_value=None)[source]#
Balance a panel dataset to ensure all units have all time periods.
- Parameters:
data (pd.DataFrame) – Unbalanced panel data.
unit_column (str) – Column name for unit identifier.
time_column (str) – Column name for time period.
method (str, default="inner") – Balancing method: - “inner”: Keep only units that appear in all periods (drops units) - “outer”: Include all unit-period combinations (creates NaN) - “fill”: Include all combinations and fill missing values
fill_value (float, optional) – Value to fill missing observations when method=”fill”. If None with method=”fill”, uses column-specific forward fill.
- Returns:
Balanced panel DataFrame.
- Return type:
pd.DataFrame
Examples
Keep only complete units:
>>> df = pd.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 3, 3, 3], ... 'period': [1, 2, 3, 1, 2, 1, 2, 3], ... 'y': [10, 11, 12, 20, 21, 30, 31, 32] ... }) >>> balanced = balance_panel(df, 'unit', 'period', method='inner') >>> balanced['unit'].unique().tolist() [1, 3]
Include all combinations:
>>> balanced = balance_panel(df, 'unit', 'period', method='outer') >>> len(balanced) 9
Example#
from diff_diff import balance_panel
# Fill missing periods with NaN
balanced = balance_panel(
data,
unit_column='unit_id',
time_column='period',
method='fill'
)
# Or keep only units with all periods (default)
balanced = balance_panel(
data,
unit_column='unit_id',
time_column='period',
method='inner'
)
Staggered Adoption Utilities#
create_event_time#
Create event-time column for staggered adoption designs.
- diff_diff.create_event_time(data, time_column, treatment_time_column, new_column='event_time')[source]#
Create an event-time column relative to treatment timing.
Useful for event study designs where treatment occurs at different times for different units.
- Parameters:
data (pd.DataFrame) – Panel data.
time_column (str) – Name of the calendar time column.
treatment_time_column (str) – Name of the column indicating when each unit was treated. Units with NaN or infinity are considered never-treated.
new_column (str, default="event_time") – Name of the new event-time column.
- Returns:
DataFrame with event-time column added. Values are: - Negative for pre-treatment periods - 0 for the treatment period - Positive for post-treatment periods - NaN for never-treated units
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 2], ... 'year': [2018, 2019, 2020, 2018, 2019, 2020], ... 'treatment_year': [2019, 2019, 2019, 2020, 2020, 2020] ... }) >>> df = create_event_time(df, 'year', 'treatment_year') >>> df['event_time'].tolist() [-1, 0, 1, -2, -1, 0]
Example#
from diff_diff import create_event_time
data = create_event_time(
data,
time_column='period',
treatment_time_column='first_treat'
)
# event_time = period - first_treat
# Negative values: pre-treatment
# Zero: treatment period
# Positive values: post-treatment
# NaN for never-treated
aggregate_to_cohorts#
Aggregate unit-level data to cohort means.
- diff_diff.aggregate_to_cohorts(data, unit_column, time_column, treatment_column, outcome, covariates=None)[source]#
Aggregate unit-level data to treatment cohort means.
Useful for visualization and cohort-level analysis.
- Parameters:
data (pd.DataFrame) – Unit-level panel data.
unit_column (str) – Name of unit identifier column.
time_column (str) – Name of time period column.
treatment_column (str) – Name of treatment indicator column.
outcome (str) – Name of outcome variable column.
covariates (list of str, optional) – Additional columns to aggregate (will compute means).
- Returns:
Cohort-level data with mean outcomes by treatment status and period.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({ ... 'unit': [1, 1, 2, 2, 3, 3, 4, 4], ... 'period': [0, 1, 0, 1, 0, 1, 0, 1], ... 'treated': [1, 1, 1, 1, 0, 0, 0, 0], ... 'y': [10, 15, 12, 17, 8, 10, 9, 11] ... }) >>> cohort_df = aggregate_to_cohorts(df, 'unit', 'period', 'treated', 'y') >>> len(cohort_df) 4
Example#
from diff_diff import aggregate_to_cohorts
cohort_data = aggregate_to_cohorts(
data,
unit_column='unit_id',
time_column='period',
treatment_column='first_treat',
outcome='outcome'
)
Survey Aggregation#
aggregate_survey#
Aggregate survey microdata to geographic-period cells with design-based precision.
- diff_diff.aggregate_survey(data, by, outcomes, survey_design, covariates=None, min_n=2, lonely_psu=None, second_stage_weights='pweight')[source]#
Aggregate survey microdata to geographic-period cells with design-based precision.
Computes design-weighted cell means and their Taylor-linearized (or replicate-based) standard errors for each cell defined by the
bycolumns. Returns a panel-ready DataFrame and a pre-configuredSurveyDesignfor second-stage DiD estimation.Each cell is treated as a subpopulation/domain of the full survey design: influence function values are zero-padded outside the cell, preserving full strata/PSU structure for variance estimation per Lumley (2004) Section 3.4.
- Parameters:
data (pd.DataFrame) – Individual-level microdata.
by (str or list of str) – Columns defining cells (e.g.,
["state", "year"]). The first element is used as the clustering variable in the returned SurveyDesign (geographic unit for second-stage inference).outcomes (str or list of str) – Outcome variable(s) to aggregate with full precision tracking. Each outcome produces
{name}_mean,{name}_se,{name}_n, and{name}_precisioncolumns. When multiple outcomes are given, panel filtering (non-estimable cell removal, zero-weight PSU pruning) is based on the first outcome only, consistent with the returned SurveyDesign. For independent per-outcome support, call once per outcome.survey_design (SurveyDesign) – Survey design specification for the microdata.
covariates (str or list of str, optional) – Additional variables to aggregate as design-weighted means only (no SE/precision columns).
min_n (int, default 2) – Minimum respondents per cell. Cells below this threshold use simple random sampling variance as a fallback.
lonely_psu (str, optional) – Override the survey design’s
lonely_psusetting for within-cell computation. One of"remove","certainty","adjust".second_stage_weights (str, default "pweight") –
Weight type for the returned second-stage
SurveyDesign:"pweight"(default): Population weights - the mean of per-cell survey weight sums within each geographic unit (firstbycolumn), constant across periods. Targets population-weighted second-stage estimation. Compatible with all survey-capable estimators including those that require unit-constant survey columns."aweight": Precision weights - inverse variance (1 / V(y_bar)). Targets precision-weighted second-stage estimation via WLS. Compatible with estimators that acceptaweight(DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD, SunAbraham, ContinuousDiD, EfficientDiD); rejected bypweight-only estimators.
- Returns:
panel_df (pd.DataFrame) – Aggregated panel with columns: grouping variables,
{outcome}_mean,{outcome}_se,{outcome}_n,{outcome}_precision,{outcome}_weight,{covariate}_mean,cell_n,cell_n_eff,cell_sum_w,srs_fallback. The_weightcolumn contains unit-constant population weights (mean ofcell_sum_wwithin each geographic unit) in pweight mode, or cleaned precision (NaN/Inf mapped to 0.0) in aweight mode.cell_sum_wis always present as a diagnostic column containing the sum of normalized survey weights per cell (proportional to estimated population).second_stage_design (SurveyDesign) – Pre-configured for second-stage estimation with the chosen
weight_type, weights from the first outcome, and geographic clustering viapsu.
- Return type:
Examples
>>> design = SurveyDesign(weights="finalwt", strata="strat", psu="psu") >>> panel, stage2 = aggregate_survey( ... microdata, by=["state", "year"], ... outcomes="smoking_rate", survey_design=design, ... ) >>> # stage2 has weight_type="pweight" — compatible with all estimators. >>> # Add treatment/time indicators at the panel level, then fit: >>> # panel["first_treat"] = panel["state"].map(policy_year).fillna(0) >>> # result = CallawaySantAnna().fit( >>> # panel, outcome="smoking_rate_mean", >>> # unit="state", time="year", first_treat="first_treat", >>> # survey_design=stage2, >>> # )
Example#
from diff_diff import aggregate_survey, SurveyDesign, DifferenceInDifferences
# Define the survey design for the microdata
design = SurveyDesign(weights="finalwt", strata="strat", psu="psu")
# Aggregate to state-year panel with design-based SEs
panel, stage2 = aggregate_survey(
microdata,
by=["state", "year"],
outcomes="smoking_rate",
covariates=["age", "income"],
survey_design=design,
)
# panel has: state, year, smoking_rate_mean, smoking_rate_se,
# smoking_rate_n, smoking_rate_precision, smoking_rate_weight,
# age_mean, income_mean, cell_n, cell_n_eff, cell_sum_w, srs_fallback
#
# *_weight is fit-ready: unit-constant population weight (pweight, default)
# or cleaned precision with NaN/Inf -> 0.0 (aweight opt-in).
# cell_sum_w is a per-cell diagnostic (sum of survey weights per cell).
# Non-estimable cells and zero-weight geos are dropped automatically.
# stage2 is pre-configured: pweights + state-level clustering
# Add treatment/time indicators at the panel level, then fit:
# panel["treated"] = ... # from policy adoption data
# panel["post"] = (panel["year"] >= treatment_year).astype(int)
# result = DifferenceInDifferences().fit(
# panel, outcome="smoking_rate_mean",
# treatment="treated", time="post", survey_design=stage2,
# )
Data Validation#
validate_did_data#
Validate data structure for DiD analysis.
- diff_diff.validate_did_data(data, outcome, treatment, time, unit=None, raise_on_error=True)[source]#
Validate that data is properly formatted for DiD analysis.
Checks for common data issues and provides informative error messages.
- Parameters:
data (pd.DataFrame) – Data to validate.
outcome (str) – Name of outcome variable column.
treatment (str) – Name of treatment indicator column.
time (str) – Name of time/post indicator column.
unit (str, optional) – Name of unit identifier column (for panel data validation).
raise_on_error (bool, default=True) – If True, raises ValueError on validation failures. If False, returns validation results without raising.
- Returns:
Validation results with keys: - valid: bool indicating if data passed all checks - errors: list of error messages - warnings: list of warning messages - summary: dict with data summary statistics
- Return type:
Examples
>>> df = pd.DataFrame({ ... 'y': [1, 2, 3, 4], ... 'treated': [0, 0, 1, 1], ... 'post': [0, 1, 0, 1] ... }) >>> result = validate_did_data(df, 'y', 'treated', 'post', raise_on_error=False) >>> result['valid'] True
Example#
from diff_diff import validate_did_data
result = validate_did_data(
data,
outcome='outcome',
treatment='treated',
time='period',
unit='unit_id'
)
if not result['valid']:
for error in result['errors']:
print(f"Error: {error}")
for warning in result['warnings']:
print(f"Warning: {warning}")
summarize_did_data#
Generate summary statistics for DiD data.
- diff_diff.summarize_did_data(data, outcome, treatment, time, unit=None)[source]#
Generate summary statistics by treatment group and time period.
- Parameters:
- Returns:
Summary statistics with columns for each treatment-time combination.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({ ... 'y': [10, 11, 12, 13, 20, 21, 22, 23], ... 'treated': [0, 0, 1, 1, 0, 0, 1, 1], ... 'post': [0, 1, 0, 1, 0, 1, 0, 1] ... }) >>> summary = summarize_did_data(df, 'y', 'treated', 'post') >>> print(summary)
Example#
from diff_diff import summarize_did_data
summary = summarize_did_data(
data,
outcome='outcome',
treatment='treated',
time='period',
unit='unit_id'
)
print(summary)
Control Unit Selection#
rank_control_units#
Rank control units by suitability for DiD or synthetic control.
- diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]#
Rank potential control units by their suitability for DiD analysis.
Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.
- Parameters:
data (pd.DataFrame) – Panel data in long format.
unit_column (str) – Column name for unit identifier.
time_column (str) – Column name for time periods.
outcome_column (str) – Column name for outcome variable.
treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.
treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.
pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.
covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.
outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).
covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.
exclude_units (list, optional) – Units that cannot be in control group.
require_units (list, optional) – Units that must be in control group (will always appear in output).
n_top (int, optional) – Return only top N control units. If None, return all.
suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.
n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.
lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.
- Returns:
Ranked control units with columns:
unit: Unit identifier
quality_score: Combined quality score (0-1, higher is better)
outcome_trend_score: Pre-treatment outcome trend similarity
covariate_score: Covariate match score (NaN if no covariates)
synthetic_weight: Informational heuristic weight from a single-pass uncentered Frank-Wolfe solve; does NOT factor into
quality_score(ranking) and is NOT the canonical SDID unit weight. For canonical SDID weights useSyntheticDiD.fit().pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean
is_required: Whether unit was in require_units
If suggest_treatment_candidates=True (and no treated units):
unit: Unit identifier
treatment_candidate_score: Suitability as treatment unit
avg_outcome_level: Pre-treatment outcome mean
outcome_trend: Pre-treatment trend slope
n_similar_controls: Count of similar potential controls
- Return type:
pd.DataFrame
Examples
Rank controls against treated units:
>>> data = generate_did_data(n_units=30, n_periods=6, seed=42) >>> ranking = rank_control_units( ... data, ... unit_column='unit', ... time_column='period', ... outcome_column='outcome', ... treatment_column='treated', ... n_top=10 ... ) >>> ranking['quality_score'].is_monotonic_decreasing True
With covariates:
>>> data['size'] = np.random.randn(len(data)) >>> ranking = rank_control_units( ... data, ... unit_column='unit', ... time_column='period', ... outcome_column='outcome', ... treatment_column='treated', ... covariates=['size'] ... )
Filter data for SyntheticDiD:
>>> top_controls = ranking['unit'].tolist() >>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]
Example#
from diff_diff import rank_control_units, generate_did_data
panel = generate_did_data(n_units=100, n_periods=10, treatment_effect=2.0)
ranked = rank_control_units(
panel,
unit_column='unit',
time_column='period',
outcome_column='outcome',
treatment_column='treated',
pre_periods=[0, 1, 2, 3, 4]
)
# Select top 10 control units
best_controls = ranked.head(10)['unit'].tolist()