diff_diff.rank_control_units#

diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]

Rank potential control units by their suitability for DiD analysis.

Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • unit_column (str) – Column name for unit identifier.

  • time_column (str) – Column name for time periods.

  • outcome_column (str) – Column name for outcome variable.

  • treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.

  • treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.

  • pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.

  • covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.

  • outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).

  • covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.

  • exclude_units (list, optional) – Units that cannot be in control group.

  • require_units (list, optional) – Units that must be in control group (will always appear in output).

  • n_top (int, optional) – Return only top N control units. If None, return all.

  • suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.

  • n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.

  • lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.

Returns:

Ranked control units with columns:

  • unit: Unit identifier

  • quality_score: Combined quality score (0-1, higher is better)

  • outcome_trend_score: Pre-treatment outcome trend similarity

  • covariate_score: Covariate match score (NaN if no covariates)

  • synthetic_weight: Informational heuristic weight from a single-pass uncentered Frank-Wolfe solve; does NOT factor into quality_score (ranking) and is NOT the canonical SDID unit weight. For canonical SDID weights use SyntheticDiD.fit().

  • pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean

  • is_required: Whether unit was in require_units

If suggest_treatment_candidates=True (and no treated units):

  • unit: Unit identifier

  • treatment_candidate_score: Suitability as treatment unit

  • avg_outcome_level: Pre-treatment outcome mean

  • outcome_trend: Pre-treatment trend slope

  • n_similar_controls: Count of similar potential controls

Return type:

pd.DataFrame

Examples

Rank controls against treated units:

>>> data = generate_did_data(n_units=30, n_periods=6, seed=42)
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     n_top=10
... )
>>> ranking['quality_score'].is_monotonic_decreasing
True

With covariates:

>>> data['size'] = np.random.randn(len(data))
>>> ranking = rank_control_units(
...     data,
...     unit_column='unit',
...     time_column='period',
...     outcome_column='outcome',
...     treatment_column='treated',
...     covariates=['size']
... )

Filter data for SyntheticDiD:

>>> top_controls = ranking['unit'].tolist()
>>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]