Panel Profiling#
Pre-fit description of a DiD panel’s structural facts. profile_panel()
inspects a long-format panel and returns a PanelProfile dataclass
covering balance, treatment-type classification, outcome characteristics, and
a list of factual Alert observations.
The profile is descriptive, not opinionated: alerts report what is (e.g.
“smallest cohort has 7 units”), never what to do about it. Estimator
selection is the caller’s responsibility. For autonomous-agent consumption,
pair the profile output with the
autonomous-agent reference guide (also accessible
at runtime via diff_diff.get_llm_guide("autonomous")), which walks
through the estimator-support matrix and the per-design-feature reasoning
keyed off PanelProfile field values.
Note
PanelProfile and its three supporting dataclasses
(OutcomeShape, TreatmentDoseShape, Alert) are
re-exported at the top level of diff_diff so callers can construct
or pattern-match against them without dotted-module access.
profile_panel#
- diff_diff.profile_panel(df, *, unit, time, treatment, outcome)[source]#
Describe the structure of a DiD panel.
Reports structural facts — balance, treatment-type classification, outcome characteristics, factual alerts. Descriptive, not opinionated: the profile says what is, never what to do about it. Estimator selection is up to the caller.
- Parameters:
df (pandas.DataFrame) – Long-format panel data containing the four named columns.
unit (str) – Column identifying the cross-sectional unit.
time (str) – Column identifying the time period.
treatment (str) – Column holding the treatment indicator or dose. See Notes for the classification rules.
outcome (str) – Column holding the outcome variable.
- Returns:
Frozen dataclass. Call
.to_dict()for a JSON-serializable view.- Return type:
- Raises:
ValueError – If any of the four column names is not present in
df.
Examples
>>> import pandas as pd >>> from diff_diff import profile_panel >>> df = pd.DataFrame({ ... "u": [1, 1, 2, 2], ... "t": [0, 1, 0, 1], ... "tr": [0, 0, 1, 1], ... "y": [0.1, 0.2, 0.1, 0.9], ... }) >>> profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") >>> profile.is_balanced True >>> profile.treatment_type 'binary_absorbing'
Notes
Classification rules for
treatment_type:"binary_absorbing": numeric treatment whose observed non-NaN values are a subset of \(\{0, 1\}\) (one or two distinct values) AND each unit’s treatment sequence (ordered bytime) is weakly monotone non-decreasing. All-zero and all-one panels are valid degenerate cases."binary_non_absorbing": values a subset of \(\{0, 1\}\) with at least two distinct values observed, where at least one unit switches from 1 back to 0."continuous": numeric treatment with more than two distinct values, or a 2-valued numeric whose values are not in \(\{0, 1\}\) (matches theContinuousDiDconvention)."categorical": non-numeric dtype (object / category) or a column that is entirely NaN.
Bool-dtype columns (
True/False) are classified the same way as numeric{0, 1}: the library’s binary estimators validate on value support viadiff_diff.utils.validate_binary(), soTrue/Falsebehave like1/0for absorbing / non-absorbing classification.has_never_treatedis computed across both binary and continuous numeric treatment types: some unit hastreatment == 0in every observed non-NaN row. For binary this flags the clean-control group; for continuous this flags zero-dose controls (required byContinuousDiD). AlwaysFalsefor"categorical".has_always_treatedhas binary-only semantics: some unit hastreatment == 1in every observed non-NaN row (no pre-treatment information in the DiD sense). For"continuous"and"categorical"treatment this field is alwaysFalseregardless of dose positivity — pre-treatment periods on continuous DiD are determined by the separatefirst_treatcolumn passed toContinuousDiD.fit, not by whether the dose is strictly positive.Rows with
NaNinunitortimeare dropped up front and surfaced via themissing_id_rows_droppedalert; all subsequent structural facts are computed on the non-missing subset, soobservation_coverageis always in[0, 1]. Duplicate(unit, time)rows are surfaced separately via theduplicate_unit_time_rowsalert.The profile does not recommend an estimator. Consult
diff_diff.get_llm_guide("autonomous")for the estimator-support matrix and per-design-feature reasoning.
PanelProfile#
- class diff_diff.PanelProfile[source]
Bases:
objectStructural facts about a DiD panel.
Returned by
profile_panel(). Mirrors theBusinessContextfrozen-dataclass pattern. Consume.to_dict()for a JSON-serializable representation and reason against the bundledllms-autonomous.txtguide.- n_units: int
- n_periods: int
- n_obs: int
- is_balanced: bool
- observation_coverage: float
- treatment_type: str
- is_staggered: bool
- n_cohorts: int
- has_never_treated: bool
- has_always_treated: bool
- treatment_varies_within_unit: bool
- outcome_dtype: str
- outcome_is_binary: bool
- outcome_has_zeros: bool
- outcome_has_negatives: bool
- outcome_missing_fraction: float
- outcome_shape: OutcomeShape | None = None
- treatment_dose: TreatmentDoseShape | None = None
- to_dict()[source]
Return a JSON-serializable dict representation of the profile.
- __init__(n_units, n_periods, n_obs, is_balanced, observation_coverage, treatment_type, is_staggered, n_cohorts, cohort_sizes, has_never_treated, has_always_treated, treatment_varies_within_unit, first_treatment_period, last_treatment_period, min_pre_periods, min_post_periods, outcome_dtype, outcome_is_binary, outcome_has_zeros, outcome_has_negatives, outcome_missing_fraction, outcome_summary, alerts, outcome_shape=None, treatment_dose=None)
- Parameters:
n_units (int)
n_periods (int)
n_obs (int)
is_balanced (bool)
observation_coverage (float)
treatment_type (str)
is_staggered (bool)
n_cohorts (int)
has_never_treated (bool)
has_always_treated (bool)
treatment_varies_within_unit (bool)
first_treatment_period (Any | None)
last_treatment_period (Any | None)
min_pre_periods (int | None)
min_post_periods (int | None)
outcome_dtype (str)
outcome_is_binary (bool)
outcome_has_zeros (bool)
outcome_has_negatives (bool)
outcome_missing_fraction (float)
outcome_shape (OutcomeShape | None)
treatment_dose (TreatmentDoseShape | None)
- Return type:
None
OutcomeShape#
- class diff_diff.OutcomeShape[source]
Bases:
objectDistributional shape of a numeric outcome column.
Populated on
PanelProfilewhen the outcome dtype is integer or float (np.dtype(...).kind in {"i", "u", "f"});Noneotherwise. Descriptive only — these fields surface what is observed in the outcome distribution. They never recommend a specific estimator family.- n_distinct_values: int
- pct_zeros: float
- value_min: float
- value_max: float
- is_integer_valued: bool
- is_count_like: bool
- is_bounded_unit: bool
- __init__(n_distinct_values, pct_zeros, value_min, value_max, skewness, excess_kurtosis, is_integer_valued, is_count_like, is_bounded_unit)
TreatmentDoseShape#
- class diff_diff.TreatmentDoseShape[source]
Bases:
objectDistributional shape of a continuous treatment dose.
Populated on
PanelProfileonly whentreatment_type == "continuous";Noneotherwise. Most fields are descriptive distributional context.profile_panel only sees the dose column, not the separate
first_treatcolumnContinuousDiD.fit()consumes. In the canonicalContinuousDiDsetup (Callaway, Goodman-Bacon, Sant’Anna 2024) the doseD_iis time-invariant per unit (D_i = 0for never-treated,D_i > 0constant across all periods for treated unit i) andfirst_treatis a separate column the caller supplies — not derived from the dose column. Under that canonical setup, several profile-side facts on the dose column predictContinuousDiD.fit()outcomes:PanelProfile.has_never_treated == True(some unit has dose 0 in every period). Predicts the estimator’sP(D=0) > 0requirement under bothcontrol_group="never_treated"andcontrol_group="not_yet_treated"(Remark 3.1 lowest-dose-as-control not yet implemented), because the canonical setup tiesfirst_treat == 0toD_i == 0. Failure means no never-treated controls exist on the dose column; see routing notes below.PanelProfile.treatment_varies_within_unit == False(per-unit full-path dose constancy on the dose column). This IS the actual fit-time gate, matchingContinuousDiD.fit()’sdf.groupby(unit)[dose].nunique() > 1rejection at line 222-228; holds regardless offirst_treat.TruerulesContinuousDiDout — for graded-adoption panels with dose changes useHeterogeneousAdoptionDiD.PanelProfile.is_balanced == True. Actual fit-time gate (continuous_did.py:329-338); notfirst_treat-dependent.Absence of the
duplicate_unit_time_rowsalert. The precompute path silently resolves duplicate(unit, time)cells via last-row-wins (continuous_did.py:818-823); not a fit-time raise. The agent must deduplicate before fit becauseContinuousDiDwill otherwise overwrite silently.treatment_dose.dose_min > 0(over non-zero doses). PredictsContinuousDiD.fit()’s strictly-positive-treated- dose requirement (raisesValueErroron negative dose forfirst_treat > 0units,continuous_did.py:287-294). Failure means some treated units have negative dose; see routing notes below.
Routing alternatives when (1) or (5) fails:
When (1) fails (no never-treated controls but all observed doses non-negative):
ContinuousDiDdoes not apply (Remark 3.1 lowest-dose-as-control is not implemented).HeterogeneousAdoptionDiDIS a candidate for graded-adoption designs (HAD’s contract requires non-negative dose, satisfied here); linear DiD with the treatment as a continuous covariate is another.When (5) fails (negative treated doses):
HeterogeneousAdoptionDiDis not a fallback either — HAD raises on negative post-period dose (had.py:1450-1459, paper Section 2). Linear DiD with the treatment as a signed continuous covariate is the applicable routing alternative.Re-encoding the treatment column (shifting, absolute value, etc.) is an agent-side preprocessing choice that changes the estimand and is not documented in REGISTRY as a supported fallback; if the agent re-encodes to non-negative support, both
ContinuousDiDandHeterogeneousAdoptionDiDbecome candidates again on the re-encoded scale.Do not relabel positive- or negative-dose units as
first_treat == 0: that triggersContinuousDiD.fit()’s force-zero coercion path, which is implementation behavior for inconsistent inputs (e.g., an accidentally-nonzero row on a never-treated unit), not a documented routing option.
The agent must still validate the supplied
first_treatcolumn independently: it must contain at least onefirst_treat == 0unit (P(D=0) > 0), be non-negative integer-valued (or+inf/ 0 for never-treated), and be consistent with the dose column on per-unit treated/untreated status.profile_paneldoes not seefirst_treatand cannot validate it.has_zero_doseis a row-level fact (“at least one observation has dose == 0”); it is NOT a substitute forhas_never_treated, which is the unit-level field. A panel can havehas_zero_dose == True(pre-treatment zero rows) whilehas_never_treated == False(every unit eventually treated), in which case the standard-workflow agent would conclude no never-treated controls exist before callingContinuousDiD.fit().- n_distinct_doses: int
- has_zero_dose: bool
- dose_min: float
- dose_max: float
- dose_mean: float