diff_diff.rank_control_units#
- diff_diff.rank_control_units(data, unit_column, time_column, outcome_column, treatment_column=None, treated_units=None, pre_periods=None, covariates=None, outcome_weight=0.7, covariate_weight=0.3, exclude_units=None, require_units=None, n_top=None, suggest_treatment_candidates=False, n_treatment_candidates=5, lambda_reg=0.0)[source]
Rank potential control units by their suitability for DiD analysis.
Evaluates control units based on pre-treatment outcome trend similarity and optional covariate matching to treated units. Returns a ranked list with quality scores.
- Parameters:
data (pd.DataFrame) – Panel data in long format.
unit_column (str) – Column name for unit identifier.
time_column (str) – Column name for time periods.
outcome_column (str) – Column name for outcome variable.
treatment_column (str, optional) – Column with binary treatment indicator (0/1). Used to identify treated units from data.
treated_units (list, optional) – Explicit list of treated unit IDs. Alternative to treatment_column.
pre_periods (list, optional) – Pre-treatment periods for comparison. If None, uses first half of periods.
covariates (list of str, optional) – Covariate columns for matching. Similarity is based on pre-treatment means.
outcome_weight (float, default=0.7) – Weight for pre-treatment outcome trend similarity (0-1).
covariate_weight (float, default=0.3) – Weight for covariate distance (0-1). Ignored if no covariates.
exclude_units (list, optional) – Units that cannot be in control group.
require_units (list, optional) – Units that must be in control group (will always appear in output).
n_top (int, optional) – Return only top N control units. If None, return all.
suggest_treatment_candidates (bool, default=False) – If True and no treated units specified, identify potential treatment candidates instead of ranking controls.
n_treatment_candidates (int, default=5) – Number of treatment candidates to suggest.
lambda_reg (float, default=0.0) – Regularization for synthetic weights. Higher values give more uniform weights across controls.
- Returns:
Ranked control units with columns:
unit: Unit identifier
quality_score: Combined quality score (0-1, higher is better)
outcome_trend_score: Pre-treatment outcome trend similarity
covariate_score: Covariate match score (NaN if no covariates)
synthetic_weight: Informational heuristic weight from a single-pass uncentered Frank-Wolfe solve; does NOT factor into
quality_score(ranking) and is NOT the canonical SDID unit weight. For canonical SDID weights useSyntheticDiD.fit().pre_trend_rmse: RMSE of pre-treatment outcome vs treated mean
is_required: Whether unit was in require_units
If suggest_treatment_candidates=True (and no treated units):
unit: Unit identifier
treatment_candidate_score: Suitability as treatment unit
avg_outcome_level: Pre-treatment outcome mean
outcome_trend: Pre-treatment trend slope
n_similar_controls: Count of similar potential controls
- Return type:
pd.DataFrame
Examples
Rank controls against treated units:
>>> data = generate_did_data(n_units=30, n_periods=6, seed=42) >>> ranking = rank_control_units( ... data, ... unit_column='unit', ... time_column='period', ... outcome_column='outcome', ... treatment_column='treated', ... n_top=10 ... ) >>> ranking['quality_score'].is_monotonic_decreasing True
With covariates:
>>> data['size'] = np.random.randn(len(data)) >>> ranking = rank_control_units( ... data, ... unit_column='unit', ... time_column='period', ... outcome_column='outcome', ... treatment_column='treated', ... covariates=['size'] ... )
Filter data for SyntheticDiD:
>>> top_controls = ranking['unit'].tolist() >>> filtered = data[(data['treated'] == 1) | (data['unit'].isin(top_controls))]