Data Preparation
================

Utilities for preparing and validating data for DiD analysis.

.. module:: diff_diff.prep

Data Generation
---------------

generate_did_data
~~~~~~~~~~~~~~~~~

Generate synthetic data with known treatment effects for testing.

.. autofunction:: diff_diff.generate_did_data

Example
^^^^^^^

.. code-block:: python

   from diff_diff import generate_did_data

   # Generate basic 2x2 DiD data
   data = generate_did_data(
       n_units=100,
       n_periods=10,
       treatment_effect=5.0,
       treatment_start=5,
       treatment_fraction=0.5,
       noise_sd=1.0
   )

   print(data.head())
   # Columns: unit_id, period, outcome, treated, post

Indicator Creation
------------------

make_treatment_indicator
~~~~~~~~~~~~~~~~~~~~~~~~

Create binary treatment indicator from categorical or numeric columns.

.. autofunction:: diff_diff.make_treatment_indicator

Example
^^^^^^^

.. code-block:: python

   from diff_diff import make_treatment_indicator

   # From categorical
   data['treated'] = make_treatment_indicator(
       data,
       column='group',
       treated_value='treatment'
   )

   # From numeric threshold
   data['high_exposure'] = make_treatment_indicator(
       data,
       column='exposure',
       threshold=0.5
   )

make_post_indicator
~~~~~~~~~~~~~~~~~~~

Create post-treatment period indicator.

.. autofunction:: diff_diff.make_post_indicator

Example
^^^^^^^

.. code-block:: python

   from diff_diff import make_post_indicator

   data['post'] = make_post_indicator(
       data,
       time_column='period',
       treatment_start=5
   )

Panel Data Utilities
--------------------

wide_to_long
~~~~~~~~~~~~

Reshape wide panel data to long format.

.. autofunction:: diff_diff.wide_to_long

Example
^^^^^^^

.. code-block:: python

   from diff_diff import wide_to_long

   # Wide format: each column is a time period
   # unit_id, y_2019, y_2020, y_2021, y_2022
   long_data = wide_to_long(
       wide_data,
       id_col='unit_id',
       value_name='outcome',
       var_name='year'
   )

balance_panel
~~~~~~~~~~~~~

Balance panel data by filling or dropping incomplete observations.

.. autofunction:: diff_diff.balance_panel

Example
^^^^^^^

.. code-block:: python

   from diff_diff import balance_panel

   # Fill missing periods with NaN
   balanced = balance_panel(
       data,
       unit='unit_id',
       time='period',
       method='fill'
   )

   # Or drop units with missing periods
   balanced = balance_panel(
       data,
       unit='unit_id',
       time='period',
       method='drop'
   )

Staggered Adoption Utilities
----------------------------

create_event_time
~~~~~~~~~~~~~~~~~

Create event-time column for staggered adoption designs.

.. autofunction:: diff_diff.create_event_time

Example
^^^^^^^

.. code-block:: python

   from diff_diff import create_event_time

   data['event_time'] = create_event_time(
       data,
       time_col='period',
       first_treat_col='first_treatment'
   )

   # event_time = period - first_treatment
   # Negative values: pre-treatment
   # Zero: treatment period
   # Positive values: post-treatment
   # NaN for never-treated

aggregate_to_cohorts
~~~~~~~~~~~~~~~~~~~~

Aggregate unit-level data to cohort means.

.. autofunction:: diff_diff.aggregate_to_cohorts

Example
^^^^^^^

.. code-block:: python

   from diff_diff import aggregate_to_cohorts

   cohort_data = aggregate_to_cohorts(
       data,
       outcome='outcome',
       time='period',
       cohort='first_treatment',
       agg_func='mean'
   )

Data Validation
---------------

validate_did_data
~~~~~~~~~~~~~~~~~

Validate data structure for DiD analysis.

.. autofunction:: diff_diff.validate_did_data

Example
^^^^^^^

.. code-block:: python

   from diff_diff import validate_did_data

   is_valid, issues = validate_did_data(
       data,
       outcome='outcome',
       treated='treated',
       post='post',
       unit='unit_id',
       time='period'
   )

   if not is_valid:
       for issue in issues:
           print(f"Issue: {issue}")

summarize_did_data
~~~~~~~~~~~~~~~~~~

Generate summary statistics for DiD data.

.. autofunction:: diff_diff.summarize_did_data

Example
^^^^^^^

.. code-block:: python

   from diff_diff import summarize_did_data

   summary = summarize_did_data(
       data,
       outcome='outcome',
       treated='treated',
       post='post',
       unit='unit_id',
       time='period'
   )

   print(f"N units: {summary['n_units']}")
   print(f"N periods: {summary['n_periods']}")
   print(f"Treatment fraction: {summary['treatment_fraction']:.1%}")

Control Unit Selection
----------------------

rank_control_units
~~~~~~~~~~~~~~~~~~

Rank control units by suitability for DiD or synthetic control.

.. autofunction:: diff_diff.rank_control_units

Example
^^^^^^^

.. code-block:: python

   from diff_diff import rank_control_units

   ranked = rank_control_units(
       data,
       outcome='outcome',
       unit='unit_id',
       time='period',
       treated='treated',
       pre_periods=4,
       method='correlation'  # or 'rmse'
   )

   # Select top 10 control units
   best_controls = ranked.head(10)['unit_id'].tolist()