Data Preparation
Utilities for preparing and validating data for DiD analysis.
Data Generation
generate_did_data
Generate synthetic data with known treatment effects for testing.
- diff_diff.generate_did_data(n_units=100, n_periods=4, treatment_effect=5.0, treatment_fraction=0.5, treatment_period=2, unit_fe_sd=2.0, time_trend=0.5, noise_sd=1.0, seed=None)[source]
Generate synthetic data for DiD analysis with known treatment effect.
Creates a balanced panel dataset with realistic features including unit fixed effects, time trends, and a known treatment effect.
- Parameters:
n_units (int, default=100) – Number of units in the panel.
n_periods (int, default=4) – Number of time periods.
treatment_effect (float, default=5.0) – True average treatment effect on the treated.
treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.
treatment_period (int, default=2) – First post-treatment period (0-indexed). Periods >= this are post.
unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.
time_trend (float, default=0.5) – Linear time trend coefficient.
noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.
seed (int, optional) – Random seed for reproducibility.
- Returns:
Synthetic panel data with columns: - unit: Unit identifier - period: Time period - treated: Treatment indicator (0/1) - post: Post-treatment indicator (0/1) - outcome: Outcome variable - true_effect: The true treatment effect (for validation)
- Return type:
pd.DataFrame
Examples
Generate simple data for testing:
>>> data = generate_did_data(n_units=50, n_periods=4, treatment_effect=3.0, seed=42) >>> len(data) 200 >>> data.columns.tolist() ['unit', 'period', 'treated', 'post', 'outcome', 'true_effect']
Verify treatment effect recovery:
>>> from diff_diff import DifferenceInDifferences >>> did = DifferenceInDifferences() >>> results = did.fit(data, outcome='outcome', treatment='treated', time='post') >>> abs(results.att - 3.0) < 1.0 # Close to true effect True
Example
from diff_diff import generate_did_data
# Generate basic 2x2 DiD data
data = generate_did_data(
n_units=100,
n_periods=10,
treatment_effect=5.0,
treatment_start=5,
treatment_fraction=0.5,
noise_sd=1.0
)
print(data.head())
# Columns: unit_id, period, outcome, treated, post
Indicator Creation
make_treatment_indicator
Create binary treatment indicator from categorical or numeric columns.
- diff_diff.make_treatment_indicator(data, column, treated_values=None, threshold=None, above_threshold=True, new_column='treated')[source]
Create a binary treatment indicator column from various input types.
- Parameters:
data (pd.DataFrame) – Input DataFrame.
column (str) – Name of the column to use for creating the treatment indicator.
treated_values (Any or list, optional) – Value(s) that indicate treatment. Units with these values get treatment=1, others get treatment=0.
threshold (float, optional) – Numeric threshold for creating treatment. Used when the treatment is based on a continuous variable (e.g., treat firms above median size).
above_threshold (bool, default=True) – If True, values >= threshold are treated. If False, values <= threshold are treated. Only used when threshold is specified.
new_column (str, default="treated") – Name of the new treatment indicator column.
- Returns:
DataFrame with the new treatment indicator column added.
- Return type:
pd.DataFrame
Examples
Create treatment from categorical variable:
>>> df = pd.DataFrame({'group': ['A', 'A', 'B', 'B'], 'y': [1, 2, 3, 4]}) >>> df = make_treatment_indicator(df, 'group', treated_values='A') >>> df['treated'].tolist() [1, 1, 0, 0]
Create treatment from numeric threshold:
>>> df = pd.DataFrame({'size': [10, 50, 100, 200], 'y': [1, 2, 3, 4]}) >>> df = make_treatment_indicator(df, 'size', threshold=75) >>> df['treated'].tolist() [0, 0, 1, 1]
Treat units below a threshold:
>>> df = make_treatment_indicator(df, 'size', threshold=75, above_threshold=False) >>> df['treated'].tolist() [1, 1, 0, 0]
Example
from diff_diff import make_treatment_indicator
# From categorical
data['treated'] = make_treatment_indicator(
data,
column='group',
treated_value='treatment'
)
# From numeric threshold
data['high_exposure'] = make_treatment_indicator(
data,
column='exposure',
threshold=0.5
)
make_post_indicator
Create post-treatment period indicator.
- diff_diff.make_post_indicator(data, time_column, post_periods=None, treatment_start=None, new_column='post')[source]
Create a binary post-treatment indicator column.
- Parameters:
data (pd.DataFrame) – Input DataFrame.
time_column (str) – Name of the time/period column.
post_periods (Any or list, optional) – Specific period value(s) that are post-treatment. Periods matching these values get post=1, others get post=0.
treatment_start (Any, optional) – The first post-treatment period. All periods >= this value get post=1. Works with numeric periods, strings (sorted alphabetically), or dates.
new_column (str, default="post") – Name of the new post indicator column.
- Returns:
DataFrame with the new post indicator column added.
- Return type:
pd.DataFrame
Examples
Using specific post periods:
>>> df = pd.DataFrame({'year': [2018, 2019, 2020, 2021], 'y': [1, 2, 3, 4]}) >>> df = make_post_indicator(df, 'year', post_periods=[2020, 2021]) >>> df['post'].tolist() [0, 0, 1, 1]
Using treatment start:
>>> df = make_post_indicator(df, 'year', treatment_start=2020) >>> df['post'].tolist() [0, 0, 1, 1]
Works with date columns:
>>> df = pd.DataFrame({'date': pd.to_datetime(['2020-01-01', '2020-06-01', '2021-01-01'])}) >>> df = make_post_indicator(df, 'date', treatment_start='2020-06-01')
Example
from diff_diff import make_post_indicator
data['post'] = make_post_indicator(
data,
time_column='period',
treatment_start=5
)
Panel Data Utilities
wide_to_long
Reshape wide panel data to long format.
- diff_diff.wide_to_long(data, value_columns, id_column, time_name='period', value_name='value', time_values=None)[source]
Convert wide-format panel data to long format for DiD analysis.
Wide format has one row per unit with multiple columns for each time period. Long format has one row per unit-period combination.
- Parameters:
data (pd.DataFrame) – Wide-format DataFrame with one row per unit.
value_columns (list of str) – Column names containing the outcome values for each period. These should be in chronological order.
id_column (str) – Column name for the unit identifier.
time_name (str, default="period") – Name for the new time period column.
value_name (str, default="value") – Name for the new value/outcome column.
time_values (list, optional) – Values to use for time periods. If None, uses 0, 1, 2, … Must have same length as value_columns.
- Returns:
Long-format DataFrame with one row per unit-period.
- Return type:
pd.DataFrame
Examples
>>> wide_df = pd.DataFrame({ ... 'firm_id': [1, 2, 3], ... 'sales_2019': [100, 150, 200], ... 'sales_2020': [110, 160, 210], ... 'sales_2021': [120, 170, 220] ... }) >>> long_df = wide_to_long( ... wide_df, ... value_columns=['sales_2019', 'sales_2020', 'sales_2021'], ... id_column='firm_id', ... time_name='year', ... value_name='sales', ... time_values=[2019, 2020, 2021] ... ) >>> len(long_df) 9 >>> long_df.columns.tolist() ['firm_id', 'year', 'sales']
Example
from diff_diff import wide_to_long
# Wide format: each column is a time period
# unit_id, y_2019, y_2020, y_2021, y_2022
long_data = wide_to_long(
wide_data,
id_col='unit_id',
value_name='outcome',
var_name='year'
)
balance_panel
Balance panel data by filling or dropping incomplete observations.
- diff_diff.balance_panel(data, unit_column, time_column, method='inner', fill_value=None)[source]
Balance a panel dataset to ensure all units have all time periods.
- Parameters:
data (pd.DataFrame) – Unbalanced panel data.
unit_column (str) – Column name for unit identifier.
time_column (str) – Column name for time period.
method (str, default="inner") – Balancing method: - “inner”: Keep only units that appear in all periods (drops units) - “outer”: Include all unit-period combinations (creates NaN) - “fill”: Include all combinations and fill missing values
fill_value (float, optional) – Value to fill missing observations when method=”fill”. If None with method=”fill”, uses column-specific forward fill.
- Returns:
Balanced panel DataFrame.
- Return type:
pd.DataFrame
Examples
Keep only complete units:
>>> df = pd.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 3, 3, 3], ... 'period': [1, 2, 3, 1, 2, 1, 2, 3], ... 'y': [10, 11, 12, 20, 21, 30, 31, 32] ... }) >>> balanced = balance_panel(df, 'unit', 'period', method='inner') >>> balanced['unit'].unique().tolist() [1, 3]
Include all combinations:
>>> balanced = balance_panel(df, 'unit', 'period', method='outer') >>> len(balanced) 9
Example
from diff_diff import balance_panel
# Fill missing periods with NaN
balanced = balance_panel(
data,
unit='unit_id',
time='period',
method='fill'
)
# Or drop units with missing periods
balanced = balance_panel(
data,
unit='unit_id',
time='period',
method='drop'
)
Staggered Adoption Utilities
create_event_time
Create event-time column for staggered adoption designs.
- diff_diff.create_event_time(data, time_column, treatment_time_column, new_column='event_time')[source]
Create an event-time column relative to treatment timing.
Useful for event study designs where treatment occurs at different times for different units.
- Parameters:
data (pd.DataFrame) – Panel data.
time_column (str) – Name of the calendar time column.
treatment_time_column (str) – Name of the column indicating when each unit was treated. Units with NaN or infinity are considered never-treated.
new_column (str, default="event_time") – Name of the new event-time column.
- Returns:
DataFrame with event-time column added. Values are: - Negative for pre-treatment periods - 0 for the treatment period - Positive for post-treatment periods - NaN for never-treated units
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({ ... 'unit': [1, 1, 1, 2, 2, 2], ... 'year': [2018, 2019, 2020, 2018, 2019, 2020], ... 'treatment_year': [2019, 2019, 2019, 2020, 2020, 2020] ... }) >>> df = create_event_time(df, 'year', 'treatment_year') >>> df['event_time'].tolist() [-1, 0, 1, -2, -1, 0]
Example
from diff_diff import create_event_time
data['event_time'] = create_event_time(
data,
time_col='period',
first_treat_col='first_treatment'
)
# event_time = period - first_treatment
# Negative values: pre-treatment
# Zero: treatment period
# Positive values: post-treatment
# NaN for never-treated
aggregate_to_cohorts
Aggregate unit-level data to cohort means.
- diff_diff.aggregate_to_cohorts(data, unit_column, time_column, treatment_column, outcome, covariates=None)[source]
Aggregate unit-level data to treatment cohort means.
Useful for visualization and cohort-level analysis.
- Parameters:
data (pd.DataFrame) – Unit-level panel data.
unit_column (str) – Name of unit identifier column.
time_column (str) – Name of time period column.
treatment_column (str) – Name of treatment indicator column.
outcome (str) – Name of outcome variable column.
covariates (list of str, optional) – Additional columns to aggregate (will compute means).
- Returns:
Cohort-level data with mean outcomes by treatment status and period.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({ ... 'unit': [1, 1, 2, 2, 3, 3, 4, 4], ... 'period': [0, 1, 0, 1, 0, 1, 0, 1], ... 'treated': [1, 1, 1, 1, 0, 0, 0, 0], ... 'y': [10, 15, 12, 17, 8, 10, 9, 11] ... }) >>> cohort_df = aggregate_to_cohorts(df, 'unit', 'period', 'treated', 'y') >>> len(cohort_df) 4
Example
from diff_diff import aggregate_to_cohorts
cohort_data = aggregate_to_cohorts(
data,
outcome='outcome',
time='period',
cohort='first_treatment',
agg_func='mean'
)
Data Validation
validate_did_data
Validate data structure for DiD analysis.
- diff_diff.validate_did_data(data, outcome, treatment, time, unit=None, raise_on_error=True)[source]
Validate that data is properly formatted for DiD analysis.
Checks for common data issues and provides informative error messages.
- Parameters:
data (pd.DataFrame) – Data to validate.
outcome (str) – Name of outcome variable column.
treatment (str) – Name of treatment indicator column.
time (str) – Name of time/post indicator column.
unit (str, optional) – Name of unit identifier column (for panel data validation).
raise_on_error (bool, default=True) – If True, raises ValueError on validation failures. If False, returns validation results without raising.
- Returns:
Validation results with keys: - valid: bool indicating if data passed all checks - errors: list of error messages - warnings: list of warning messages - summary: dict with data summary statistics
- Return type:
Examples
>>> df = pd.DataFrame({ ... 'y': [1, 2, 3, 4], ... 'treated': [0, 0, 1, 1], ... 'post': [0, 1, 0, 1] ... }) >>> result = validate_did_data(df, 'y', 'treated', 'post', raise_on_error=False) >>> result['valid'] True
Example
from diff_diff import validate_did_data
is_valid, issues = validate_did_data(
data,
outcome='outcome',
treated='treated',
post='post',
unit='unit_id',
time='period'
)
if not is_valid:
for issue in issues:
print(f"Issue: {issue}")
summarize_did_data
Generate summary statistics for DiD data.
- diff_diff.summarize_did_data(data, outcome, treatment, time, unit=None)[source]
Generate summary statistics by treatment group and time period.
- Parameters:
- Returns:
Summary statistics with columns for each treatment-time combination.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({ ... 'y': [10, 11, 12, 13, 20, 21, 22, 23], ... 'treated': [0, 0, 1, 1, 0, 0, 1, 1], ... 'post': [0, 1, 0, 1, 0, 1, 0, 1] ... }) >>> summary = summarize_did_data(df, 'y', 'treated', 'post') >>> print(summary)
Example
from diff_diff import summarize_did_data
summary = summarize_did_data(
data,
outcome='outcome',
treated='treated',
post='post',
unit='unit_id',
time='period'
)
print(f"N units: {summary['n_units']}")
print(f"N periods: {summary['n_periods']}")
print(f"Treatment fraction: {summary['treatment_fraction']:.1%}")
Control Unit Selection
rank_control_units
Rank control units by suitability for DiD or synthetic control.
- diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]
Rank potential control units by their suitability for DiD analysis.
Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.
- Parameters:
data (pd.DataFrame) – Panel data in long format.
unit_column (str) – Column name for unit identifier.
time_column (str) – Column name for time periods.
outcome_column (str) – Column name for outcome variable.
treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.
treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.
pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.
covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.
outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).
covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.
exclude_units (list, optional) – Units that cannot be in control group.
require_units (list, optional) – Units that must be in control group (will always appear in output).
n_top (int, optional) – Return only top N control units. If None, return all.
suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.
n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.
lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.
- Returns:
Ranked control units with columns: - unit: Unit identifier - quality_score: Combined quality score (0-1, higher is better) - outcome_trend_score: Pre-treatment outcome trend similarity - covariate_score: Covariate match score (NaN if no covariates) - synthetic_weight: Weight from synthetic control optimization - pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean - is_required: Whether unit was in require_units
If suggest_treatment_candidates=True (and no treated units): - unit: Unit identifier - treatment_candidate_score: Suitability as treatment unit - avg_outcome_level: Pre-treatment outcome mean - outcome_trend: Pre-treatment trend slope - n_similar_controls: Count of similar potential controls
- Return type:
pd.DataFrame
Examples
Rank controls against treated units:
>>> data = generate_did_data(n_units=30, n_periods=6, seed=42) >>> ranking = rank_control_units( ... data, ... unit_column='unit', ... time_column='period', ... outcome_column='outcome', ... treatment_column='treated', ... n_top=10 ... ) >>> ranking['quality_score'].is_monotonic_decreasing True
With covariates:
>>> data['size'] = np.random.randn(len(data)) >>> ranking = rank_control_units( ... data, ... unit_column='unit', ... time_column='period', ... outcome_column='outcome', ... treatment_column='treated', ... covariates=['size'] ... )
Filter data for SyntheticDiD:
>>> top_controls = ranking['unit'].tolist() >>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]
Example
from diff_diff import rank_control_units
ranked = rank_control_units(
data,
outcome='outcome',
unit='unit_id',
time='period',
treated='treated',
pre_periods=4,
method='correlation' # or 'rmse'
)
# Select top 10 control units
best_controls = ranked.head(10)['unit_id'].tolist()