Data Preparation ================ Utilities for preparing and validating data for DiD analysis. .. module:: diff_diff.prep Data Generation --------------- generate_did_data ~~~~~~~~~~~~~~~~~ Generate synthetic data with known treatment effects for testing. .. autofunction:: diff_diff.generate_did_data Example ^^^^^^^ .. code-block:: python from diff_diff import generate_did_data # Generate basic 2x2 DiD data data = generate_did_data( n_units=100, n_periods=10, treatment_effect=5.0, treatment_period=5, treatment_fraction=0.5, noise_sd=1.0 ) print(data.head()) # Columns: unit_id, period, outcome, treated, post generate_staggered_data ~~~~~~~~~~~~~~~~~~~~~~~ Generate synthetic staggered adoption data for testing. .. autofunction:: diff_diff.generate_staggered_data Example ^^^^^^^ .. code-block:: python from diff_diff import generate_staggered_data data = generate_staggered_data( n_units=200, n_periods=10, cohort_periods=[4, 6, 8], seed=42 ) generate_event_study_data ~~~~~~~~~~~~~~~~~~~~~~~~~ Generate synthetic event study data for testing. .. autofunction:: diff_diff.generate_event_study_data generate_ddd_data ~~~~~~~~~~~~~~~~~ Generate synthetic Triple Difference data. .. autofunction:: diff_diff.generate_ddd_data generate_ddd_panel_data ~~~~~~~~~~~~~~~~~~~~~~~ Generate synthetic panel-structured Triple Difference data for power analysis. .. autofunction:: diff_diff.generate_ddd_panel_data generate_factor_data ~~~~~~~~~~~~~~~~~~~~ Generate synthetic data with factor structure for TROP testing. .. autofunction:: diff_diff.generate_factor_data generate_panel_data ~~~~~~~~~~~~~~~~~~~ Generate generic synthetic panel data. .. autofunction:: diff_diff.generate_panel_data generate_continuous_did_data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generate synthetic continuous treatment DiD data with known dose-response. .. autofunction:: diff_diff.generate_continuous_did_data generate_reversible_did_data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generate synthetic **reversible-treatment** panel data — treatment can switch on and off over time. Use this with :class:`~diff_diff.ChaisemartinDHaultfoeuille` for testing the dCDH estimator on non-absorbing treatments. .. autofunction:: diff_diff.generate_reversible_did_data Example ^^^^^^^ .. code-block:: python from diff_diff import generate_reversible_did_data, ChaisemartinDHaultfoeuille data = generate_reversible_did_data( n_groups=80, n_periods=6, pattern="single_switch", # or "joiners_only", "leavers_only", "mixed_single_switch" treatment_effect=2.0, seed=42, ) est = ChaisemartinDHaultfoeuille() results = est.fit( data, outcome="outcome", group="group", time="period", treatment="treatment", ) Indicator Creation ------------------ make_treatment_indicator ~~~~~~~~~~~~~~~~~~~~~~~~ Create binary treatment indicator from categorical or numeric columns. .. autofunction:: diff_diff.make_treatment_indicator Example ^^^^^^^ .. code-block:: python from diff_diff import make_treatment_indicator # From categorical data = make_treatment_indicator( data, column='group', treated_values='treatment' ) # From numeric threshold data = make_treatment_indicator( data, column='exposure', threshold=0.5, new_column='high_exposure' ) make_post_indicator ~~~~~~~~~~~~~~~~~~~ Create post-treatment period indicator. .. autofunction:: diff_diff.make_post_indicator Example ^^^^^^^ .. code-block:: python from diff_diff import make_post_indicator data = make_post_indicator( data, time_column='period', treatment_start=5 ) Panel Data Utilities -------------------- wide_to_long ~~~~~~~~~~~~ Reshape wide panel data to long format. .. autofunction:: diff_diff.wide_to_long Example ^^^^^^^ .. code-block:: python from diff_diff import wide_to_long # Wide format: each column is a time period # unit_id, y_2019, y_2020, y_2021, y_2022 long_data = wide_to_long( wide_data, id_col='unit_id', value_name='outcome', var_name='year' ) balance_panel ~~~~~~~~~~~~~ Balance panel data by filling or dropping incomplete observations. .. autofunction:: diff_diff.balance_panel Example ^^^^^^^ .. code-block:: python from diff_diff import balance_panel # Fill missing periods with NaN balanced = balance_panel( data, unit_column='unit_id', time_column='period', method='fill' ) # Or keep only units with all periods (default) balanced = balance_panel( data, unit_column='unit_id', time_column='period', method='inner' ) Staggered Adoption Utilities ---------------------------- create_event_time ~~~~~~~~~~~~~~~~~ Create event-time column for staggered adoption designs. .. autofunction:: diff_diff.create_event_time Example ^^^^^^^ .. code-block:: python from diff_diff import create_event_time data = create_event_time( data, time_column='period', treatment_time_column='first_treat' ) # event_time = period - first_treat # Negative values: pre-treatment # Zero: treatment period # Positive values: post-treatment # NaN for never-treated aggregate_to_cohorts ~~~~~~~~~~~~~~~~~~~~ Aggregate unit-level data to cohort means. .. autofunction:: diff_diff.aggregate_to_cohorts Example ^^^^^^^ .. code-block:: python from diff_diff import aggregate_to_cohorts cohort_data = aggregate_to_cohorts( data, unit_column='unit_id', time_column='period', treatment_column='first_treat', outcome='outcome' ) Survey Aggregation ------------------ aggregate_survey ~~~~~~~~~~~~~~~~ Aggregate survey microdata to geographic-period cells with design-based precision. .. autofunction:: diff_diff.aggregate_survey Example ^^^^^^^ .. code-block:: python from diff_diff import aggregate_survey, SurveyDesign, DifferenceInDifferences # Define the survey design for the microdata design = SurveyDesign(weights="finalwt", strata="strat", psu="psu") # Aggregate to state-year panel with design-based SEs panel, stage2 = aggregate_survey( microdata, by=["state", "year"], outcomes="smoking_rate", covariates=["age", "income"], survey_design=design, ) # panel has: state, year, smoking_rate_mean, smoking_rate_se, # smoking_rate_n, smoking_rate_precision, smoking_rate_weight, # age_mean, income_mean, cell_n, cell_n_eff, cell_sum_w, srs_fallback # # *_weight is fit-ready: unit-constant population weight (pweight, default) # or cleaned precision with NaN/Inf -> 0.0 (aweight opt-in). # cell_sum_w is a per-cell diagnostic (sum of survey weights per cell). # Non-estimable cells and zero-weight geos are dropped automatically. # stage2 is pre-configured: pweights + state-level clustering # Add treatment/time indicators at the panel level, then fit: # panel["treated"] = ... # from policy adoption data # panel["post"] = (panel["year"] >= treatment_year).astype(int) # result = DifferenceInDifferences().fit( # panel, outcome="smoking_rate_mean", # treatment="treated", time="post", survey_design=stage2, # ) Data Validation --------------- validate_did_data ~~~~~~~~~~~~~~~~~ Validate data structure for DiD analysis. .. autofunction:: diff_diff.validate_did_data Example ^^^^^^^ .. code-block:: python from diff_diff import validate_did_data result = validate_did_data( data, outcome='outcome', treatment='treated', time='period', unit='unit_id' ) if not result['valid']: for error in result['errors']: print(f"Error: {error}") for warning in result['warnings']: print(f"Warning: {warning}") summarize_did_data ~~~~~~~~~~~~~~~~~~ Generate summary statistics for DiD data. .. autofunction:: diff_diff.summarize_did_data Example ^^^^^^^ .. code-block:: python from diff_diff import summarize_did_data summary = summarize_did_data( data, outcome='outcome', treatment='treated', time='period', unit='unit_id' ) print(summary) Control Unit Selection ---------------------- rank_control_units ~~~~~~~~~~~~~~~~~~ Rank control units by suitability for DiD or synthetic control. .. autofunction:: diff_diff.rank_control_units Example ^^^^^^^ .. code-block:: python from diff_diff import rank_control_units, generate_did_data panel = generate_did_data(n_units=100, n_periods=10, treatment_effect=2.0) ranked = rank_control_units( panel, unit_column='unit', time_column='period', outcome_column='outcome', treatment_column='treated', pre_periods=[0, 1, 2, 3, 4] ) # Select top 10 control units best_controls = ranked.head(10)['unit'].tolist()