Data Preparation

Utilities for preparing and validating data for DiD analysis.

Data Generation

generate_did_data

Generate synthetic data with known treatment effects for testing.

diff_diff.generate_did_data(n_units=100, n_periods=4, treatment_effect=5.0, treatment_fraction=0.5, treatment_period=2, unit_fe_sd=2.0, time_trend=0.5, noise_sd=1.0, seed=None)[source]

Generate synthetic data for DiD analysis with known treatment effect.

Creates a balanced panel dataset with realistic features including unit fixed effects, time trends, and a known treatment effect.

Parameters:
  • n_units (int, default=100) – Number of units in the panel.

  • n_periods (int, default=4) – Number of time periods.

  • treatment_effect (float, default=5.0) – True average treatment effect on the treated.

  • treatment_fraction (float, default=0.5) – Fraction of units that receive treatment.

  • treatment_period (int, default=2) – First post-treatment period (0-indexed). Periods >= this are post.

  • unit_fe_sd (float, default=2.0) – Standard deviation of unit fixed effects.

  • time_trend (float, default=0.5) – Linear time trend coefficient.

  • noise_sd (float, default=1.0) – Standard deviation of idiosyncratic noise.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Synthetic panel data with columns: - unit: Unit identifier - period: Time period - treated: Treatment indicator (0/1) - post: Post-treatment indicator (0/1) - outcome: Outcome variable - true_effect: The true treatment effect (for validation)

Return type:

pd.DataFrame

Examples

Generate simple data for testing:

>>> data = generate_did_data(n_units=50, n_periods=4, treatment_effect=3.0, seed=42)
>>> len(data)
200
>>> data.columns.tolist()
['unit', 'period', 'treated', 'post', 'outcome', 'true_effect']

Verify treatment effect recovery:

>>> from diff_diff import DifferenceInDifferences
>>> did = DifferenceInDifferences()
>>> results = did.fit(data, outcome='outcome', treatment='treated', time='post')
>>> abs(results.att - 3.0) < 1.0  # Close to true effect
True

Example

from diff_diff import generate_did_data

# Generate basic 2x2 DiD data
data = generate_did_data(
    n_units=100,
    n_periods=10,
    treatment_effect=5.0,
    treatment_start=5,
    treatment_fraction=0.5,
    noise_sd=1.0
)

print(data.head())
# Columns: unit_id, period, outcome, treated, post

Indicator Creation

make_treatment_indicator

Create binary treatment indicator from categorical or numeric columns.

diff_diff.make_treatment_indicator(data, column, treated_values=None, threshold=None, above_threshold=True, new_column='treated')[source]

Create a binary treatment indicator column from various input types.

Parameters:
  • data (pd.DataFrame) – Input DataFrame.

  • column (str) – Name of the column to use for creating the treatment indicator.

  • treated_values (Any or list, optional) – Value(s) that indicate treatment. Units with these values get treatment=1, others get treatment=0.

  • threshold (float, optional) – Numeric threshold for creating treatment. Used when the treatment is based on a continuous variable (e.g., treat firms above median size).

  • above_threshold (bool, default=True) – If True, values >= threshold are treated. If False, values <= threshold are treated. Only used when threshold is specified.

  • new_column (str, default="treated") – Name of the new treatment indicator column.

Returns:

DataFrame with the new treatment indicator column added.

Return type:

pd.DataFrame

Examples

Create treatment from categorical variable:

>>> df = pd.DataFrame({'group': ['A', 'A', 'B', 'B'], 'y': [1, 2, 3, 4]})
>>> df = make_treatment_indicator(df, 'group', treated_values='A')
>>> df['treated'].tolist()
[1, 1, 0, 0]

Create treatment from numeric threshold:

>>> df = pd.DataFrame({'size': [10, 50, 100, 200], 'y': [1, 2, 3, 4]})
>>> df = make_treatment_indicator(df, 'size', threshold=75)
>>> df['treated'].tolist()
[0, 0, 1, 1]

Treat units below a threshold:

>>> df = make_treatment_indicator(df, 'size', threshold=75, above_threshold=False)
>>> df['treated'].tolist()
[1, 1, 0, 0]

Example

from diff_diff import make_treatment_indicator

# From categorical
data['treated'] = make_treatment_indicator(
    data,
    column='group',
    treated_value='treatment'
)

# From numeric threshold
data['high_exposure'] = make_treatment_indicator(
    data,
    column='exposure',
    threshold=0.5
)

make_post_indicator

Create post-treatment period indicator.

diff_diff.make_post_indicator(data, time_column, post_periods=None, treatment_start=None, new_column='post')[source]

Create a binary post-treatment indicator column.

Parameters:
  • data (pd.DataFrame) – Input DataFrame.

  • time_column (str) – Name of the time/period column.

  • post_periods (Any or list, optional) – Specific period value(s) that are post-treatment. Periods matching these values get post=1, others get post=0.

  • treatment_start (Any, optional) – The first post-treatment period. All periods >= this value get post=1. Works with numeric periods, strings (sorted alphabetically), or dates.

  • new_column (str, default="post") – Name of the new post indicator column.

Returns:

DataFrame with the new post indicator column added.

Return type:

pd.DataFrame

Examples

Using specific post periods:

>>> df = pd.DataFrame({'year': [2018, 2019, 2020, 2021], 'y': [1, 2, 3, 4]})
>>> df = make_post_indicator(df, 'year', post_periods=[2020, 2021])
>>> df['post'].tolist()
[0, 0, 1, 1]

Using treatment start:

>>> df = make_post_indicator(df, 'year', treatment_start=2020)
>>> df['post'].tolist()
[0, 0, 1, 1]

Works with date columns:

>>> df = pd.DataFrame({'date': pd.to_datetime(['2020-01-01', '2020-06-01', '2021-01-01'])})
>>> df = make_post_indicator(df, 'date', treatment_start='2020-06-01')

Example

from diff_diff import make_post_indicator

data['post'] = make_post_indicator(
    data,
    time_column='period',
    treatment_start=5
)

Panel Data Utilities

wide_to_long

Reshape wide panel data to long format.

diff_diff.wide_to_long(data, value_columns, id_column, time_name='period', value_name='value', time_values=None)[source]

Convert wide-format panel data to long format for DiD analysis.

Wide format has one row per unit with multiple columns for each time period. Long format has one row per unit-period combination.

Parameters:
  • data (pd.DataFrame) – Wide-format DataFrame with one row per unit.

  • value_columns (list of str) – Column names containing the outcome values for each period. These should be in chronological order.

  • id_column (str) – Column name for the unit identifier.

  • time_name (str, default="period") – Name for the new time period column.

  • value_name (str, default="value") – Name for the new value/outcome column.

  • time_values (list, optional) – Values to use for time periods. If None, uses 0, 1, 2, … Must have same length as value_columns.

Returns:

Long-format DataFrame with one row per unit-period.

Return type:

pd.DataFrame

Examples

>>> wide_df = pd.DataFrame({
...     'firm_id': [1, 2, 3],
...     'sales_2019': [100, 150, 200],
...     'sales_2020': [110, 160, 210],
...     'sales_2021': [120, 170, 220]
... })
>>> long_df = wide_to_long(
...     wide_df,
...     value_columns=['sales_2019', 'sales_2020', 'sales_2021'],
...     id_column='firm_id',
...     time_name='year',
...     value_name='sales',
...     time_values=[2019, 2020, 2021]
... )
>>> len(long_df)
9
>>> long_df.columns.tolist()
['firm_id', 'year', 'sales']

Example

from diff_diff import wide_to_long

# Wide format: each column is a time period
# unit_id, y_2019, y_2020, y_2021, y_2022
long_data = wide_to_long(
    wide_data,
    id_col='unit_id',
    value_name='outcome',
    var_name='year'
)

balance_panel

Balance panel data by filling or dropping incomplete observations.

diff_diff.balance_panel(data, unit_column, time_column, method='inner', fill_value=None)[source]

Balance a panel dataset to ensure all units have all time periods.

Parameters:
  • data (pd.DataFrame) – Unbalanced panel data.

  • unit_column (str) – Column name for unit identifier.

  • time_column (str) – Column name for time period.

  • method (str, default="inner") – Balancing method: - “inner”: Keep only units that appear in all periods (drops units) - “outer”: Include all unit-period combinations (creates NaN) - “fill”: Include all combinations and fill missing values

  • fill_value (float, optional) – Value to fill missing observations when method=”fill”. If None with method=”fill”, uses column-specific forward fill.

Returns:

Balanced panel DataFrame.

Return type:

pd.DataFrame

Examples

Keep only complete units:

>>> df = pd.DataFrame({
...     'unit': [1, 1, 1, 2, 2, 3, 3, 3],
...     'period': [1, 2, 3, 1, 2, 1, 2, 3],
...     'y': [10, 11, 12, 20, 21, 30, 31, 32]
... })
>>> balanced = balance_panel(df, 'unit', 'period', method='inner')
>>> balanced['unit'].unique().tolist()
[1, 3]

Include all combinations:

>>> balanced = balance_panel(df, 'unit', 'period', method='outer')
>>> len(balanced)
9

Example

from diff_diff import balance_panel

# Fill missing periods with NaN
balanced = balance_panel(
    data,
    unit='unit_id',
    time='period',
    method='fill'
)

# Or drop units with missing periods
balanced = balance_panel(
    data,
    unit='unit_id',
    time='period',
    method='drop'
)

Staggered Adoption Utilities

create_event_time

Create event-time column for staggered adoption designs.

diff_diff.create_event_time(data, time_column, treatment_time_column, new_column='event_time')[source]

Create an event-time column relative to treatment timing.

Useful for event study designs where treatment occurs at different times for different units.

Parameters:
  • data (pd.DataFrame) – Panel data.

  • time_column (str) – Name of the calendar time column.

  • treatment_time_column (str) – Name of the column indicating when each unit was treated. Units with NaN or infinity are considered never-treated.

  • new_column (str, default="event_time") – Name of the new event-time column.

Returns:

DataFrame with event-time column added. Values are: - Negative for pre-treatment periods - 0 for the treatment period - Positive for post-treatment periods - NaN for never-treated units

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({
...     'unit': [1, 1, 1, 2, 2, 2],
...     'year': [2018, 2019, 2020, 2018, 2019, 2020],
...     'treatment_year': [2019, 2019, 2019, 2020, 2020, 2020]
... })
>>> df = create_event_time(df, 'year', 'treatment_year')
>>> df['event_time'].tolist()
[-1, 0, 1, -2, -1, 0]

Example

from diff_diff import create_event_time

data['event_time'] = create_event_time(
    data,
    time_col='period',
    first_treat_col='first_treatment'
)

# event_time = period - first_treatment
# Negative values: pre-treatment
# Zero: treatment period
# Positive values: post-treatment
# NaN for never-treated

aggregate_to_cohorts

Aggregate unit-level data to cohort means.

diff_diff.aggregate_to_cohorts(data, unit_column, time_column, treatment_column, outcome, covariates=None)[source]

Aggregate unit-level data to treatment cohort means.

Useful for visualization and cohort-level analysis.

Parameters:
  • data (pd.DataFrame) – Unit-level panel data.

  • unit_column (str) – Name of unit identifier column.

  • time_column (str) – Name of time period column.

  • treatment_column (str) – Name of treatment indicator column.

  • outcome (str) – Name of outcome variable column.

  • covariates (list of str, optional) – Additional columns to aggregate (will compute means).

Returns:

Cohort-level data with mean outcomes by treatment status and period.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({
...     'unit': [1, 1, 2, 2, 3, 3, 4, 4],
...     'period': [0, 1, 0, 1, 0, 1, 0, 1],
...     'treated': [1, 1, 1, 1, 0, 0, 0, 0],
...     'y': [10, 15, 12, 17, 8, 10, 9, 11]
... })
>>> cohort_df = aggregate_to_cohorts(df, 'unit', 'period', 'treated', 'y')
>>> len(cohort_df)
4

Example

from diff_diff import aggregate_to_cohorts

cohort_data = aggregate_to_cohorts(
    data,
    outcome='outcome',
    time='period',
    cohort='first_treatment',
    agg_func='mean'
)

Data Validation

validate_did_data

Validate data structure for DiD analysis.

diff_diff.validate_did_data(data, outcome, treatment, time, unit=None, raise_on_error=True)[source]

Validate that data is properly formatted for DiD analysis.

Checks for common data issues and provides informative error messages.

Parameters:
  • data (pd.DataFrame) – Data to validate.

  • outcome (str) – Name of outcome variable column.

  • treatment (str) – Name of treatment indicator column.

  • time (str) – Name of time/post indicator column.

  • unit (str, optional) – Name of unit identifier column (for panel data validation).

  • raise_on_error (bool, default=True) – If True, raises ValueError on validation failures. If False, returns validation results without raising.

Returns:

Validation results with keys: - valid: bool indicating if data passed all checks - errors: list of error messages - warnings: list of warning messages - summary: dict with data summary statistics

Return type:

dict

Examples

>>> df = pd.DataFrame({
...     'y': [1, 2, 3, 4],
...     'treated': [0, 0, 1, 1],
...     'post': [0, 1, 0, 1]
... })
>>> result = validate_did_data(df, 'y', 'treated', 'post', raise_on_error=False)
>>> result['valid']
True

Example

from diff_diff import validate_did_data

is_valid, issues = validate_did_data(
    data,
    outcome='outcome',
    treated='treated',
    post='post',
    unit='unit_id',
    time='period'
)

if not is_valid:
    for issue in issues:
        print(f"Issue: {issue}")

summarize_did_data

Generate summary statistics for DiD data.

diff_diff.summarize_did_data(data, outcome, treatment, time, unit=None)[source]

Generate summary statistics by treatment group and time period.

Parameters:
  • data (pd.DataFrame) – Input data.

  • outcome (str) – Name of outcome variable column.

  • treatment (str) – Name of treatment indicator column.

  • time (str) – Name of time/period column.

  • unit (str, optional) – Name of unit identifier column.

Returns:

Summary statistics with columns for each treatment-time combination.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({
...     'y': [10, 11, 12, 13, 20, 21, 22, 23],
...     'treated': [0, 0, 1, 1, 0, 0, 1, 1],
...     'post': [0, 1, 0, 1, 0, 1, 0, 1]
... })
>>> summary = summarize_did_data(df, 'y', 'treated', 'post')
>>> print(summary)

Example

from diff_diff import summarize_did_data

summary = summarize_did_data(
    data,
    outcome='outcome',
    treated='treated',
    post='post',
    unit='unit_id',
    time='period'
)

print(f"N units: {summary['n_units']}")
print(f"N periods: {summary['n_periods']}")
print(f"Treatment fraction: {summary['treatment_fraction']:.1%}")

Control Unit Selection

rank_control_units

Rank control units by suitability for DiD or synthetic control.

diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]

Rank potential control units by their suitability for DiD analysis.

Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • unit_column (str) – Column name for unit identifier.

  • time_column (str) – Column name for time periods.

  • outcome_column (str) – Column name for outcome variable.

  • treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.

  • treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.

  • pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.

  • covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.

  • outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).

  • covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.

  • exclude_units (list, optional) – Units that cannot be in control group.

  • require_units (list, optional) – Units that must be in control group (will always appear in output).

  • n_top (int, optional) – Return only top N control units. If None, return all.

  • suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.

  • n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.

  • lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.

Returns:

Ranked control units with columns: - unit: Unit identifier - quality_score: Combined quality score (0-1, higher is better) - outcome_trend_score: Pre-treatment outcome trend similarity - covariate_score: Covariate match score (NaN if no covariates) - synthetic_weight: Weight from synthetic control optimization - pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean - is_required: Whether unit was in require_units

If suggest_treatment_candidates=True (and no treated units): - unit: Unit identifier - treatment_candidate_score: Suitability as treatment unit - avg_outcome_level: Pre-treatment outcome mean - outcome_trend: Pre-treatment trend slope - n_similar_controls: Count of similar potential controls

Return type:

pd.DataFrame

Examples

Rank controls against treated units:

>>> data = generate_did_data(n_units=30, n_periods=6, seed=42)
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     n_top=10
... )
>>> ranking['quality_score'].is_monotonic_decreasing
True

With covariates:

>>> data['size'] = np.random.randn(len(data))
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     covariates=['size']
... )

Filter data for SyntheticDiD:

>>> top_controls = ranking['unit'].tolist()
>>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]

Example

from diff_diff import rank_control_units

ranked = rank_control_units(
    data,
    outcome='outcome',
    unit='unit_id',
    time='period',
    treated='treated',
    pre_periods=4,
    method='correlation'  # or 'rmse'
)

# Select top 10 control units
best_controls = ranked.head(10)['unit_id'].tolist()