diff_diff.rank_control_units#

diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]

Rank potential control units by their suitability for DiD analysis.

Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.

Parameters:

data (pd.DataFrame) – Panel data in long format.
unit_column (str) – Column name for unit identifier.
time_column (str) – Column name for time periods.
outcome_column (str) – Column name for outcome variable.
treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.
treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.
pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.
covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.
outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).
covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.
exclude_units (list, optional) – Units that cannot be in control group.
require_units (list, optional) – Units that must be in control group (will always appear in output).
n_top (int, optional) – Return only top N control units. If None, return all.
suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.
n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.
lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.

Returns:

Ranked control units with columns:

unit: Unit identifier
quality_score: Combined quality score (0-1, higher is better)
outcome_trend_score: Pre-treatment outcome trend similarity
covariate_score: Covariate match score (NaN if no covariates)
synthetic_weight: Informational heuristic weight from a single-pass uncentered Frank-Wolfe solve; does NOT factor into quality_score (ranking) and is NOT the canonical SDID unit weight. For canonical SDID weights use SyntheticDiD.fit().
pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean
is_required: Whether unit was in require_units

If suggest_treatment_candidates=True (and no treated units):

unit: Unit identifier
treatment_candidate_score: Suitability as treatment unit
avg_outcome_level: Pre-treatment outcome mean
outcome_trend: Pre-treatment trend slope
n_similar_controls: Count of similar potential controls

Return type:

pd.DataFrame

Examples

Rank controls against treated units:

>>> data = generate_did_data(n_units=30, n_periods=6, seed=42)
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     n_top=10
... )
>>> ranking['quality_score'].is_monotonic_decreasing
True

With covariates:

>>> data['size'] = np.random.randn(len(data))
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     covariates=['size']
... )

Filter data for SyntheticDiD:

>>> top_controls = ranking['unit'].tolist()
>>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]