Panel Profiling#

Pre-fit description of a DiD panel’s structural facts. profile_panel() inspects a long-format panel and returns a PanelProfile dataclass covering balance, treatment-type classification, outcome characteristics, and a list of factual Alert observations.

The profile is descriptive, not opinionated: alerts report what is (e.g. “smallest cohort has 7 units”), never what to do about it. Estimator selection is the caller’s responsibility. For autonomous-agent consumption, pair the profile output with the autonomous-agent reference guide (also accessible at runtime via diff_diff.get_llm_guide("autonomous")), which walks through the estimator-support matrix and the per-design-feature reasoning keyed off PanelProfile field values.

Note

PanelProfile and its three supporting dataclasses (OutcomeShape, TreatmentDoseShape, Alert) are re-exported at the top level of diff_diff so callers can construct or pattern-match against them without dotted-module access.

profile_panel#

diff_diff.profile_panel(df, *, unit, time, treatment, outcome)[source]#

Describe the structure of a DiD panel.

Reports structural facts — balance, treatment-type classification, outcome characteristics, factual alerts. Descriptive, not opinionated: the profile says what is, never what to do about it. Estimator selection is up to the caller.

Parameters:
  • df (pandas.DataFrame) – Long-format panel data containing the four named columns.

  • unit (str) – Column identifying the cross-sectional unit.

  • time (str) – Column identifying the time period.

  • treatment (str) – Column holding the treatment indicator or dose. See Notes for the classification rules.

  • outcome (str) – Column holding the outcome variable.

Returns:

Frozen dataclass. Call .to_dict() for a JSON-serializable view.

Return type:

PanelProfile

Raises:

ValueError – If any of the four column names is not present in df.

Examples

>>> import pandas as pd
>>> from diff_diff import profile_panel
>>> df = pd.DataFrame({
...     "u":  [1, 1, 2, 2],
...     "t":  [0, 1, 0, 1],
...     "tr": [0, 0, 1, 1],
...     "y":  [0.1, 0.2, 0.1, 0.9],
... })
>>> profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")
>>> profile.is_balanced
True
>>> profile.treatment_type
'binary_absorbing'

Notes

Classification rules for treatment_type:

  • "binary_absorbing": numeric treatment whose observed non-NaN values are a subset of \(\{0, 1\}\) (one or two distinct values) AND each unit’s treatment sequence (ordered by time) is weakly monotone non-decreasing. All-zero and all-one panels are valid degenerate cases.

  • "binary_non_absorbing": values a subset of \(\{0, 1\}\) with at least two distinct values observed, where at least one unit switches from 1 back to 0.

  • "continuous": numeric treatment with more than two distinct values, or a 2-valued numeric whose values are not in \(\{0, 1\}\) (matches the ContinuousDiD convention).

  • "categorical": non-numeric dtype (object / category) or a column that is entirely NaN.

Bool-dtype columns (True / False) are classified the same way as numeric {0, 1}: the library’s binary estimators validate on value support via diff_diff.utils.validate_binary(), so True / False behave like 1 / 0 for absorbing / non-absorbing classification.

has_never_treated is computed across both binary and continuous numeric treatment types: some unit has treatment == 0 in every observed non-NaN row. For binary this flags the clean-control group; for continuous this flags zero-dose controls (required by ContinuousDiD). Always False for "categorical".

has_always_treated has binary-only semantics: some unit has treatment == 1 in every observed non-NaN row (no pre-treatment information in the DiD sense). For "continuous" and "categorical" treatment this field is always False regardless of dose positivity — pre-treatment periods on continuous DiD are determined by the separate first_treat column passed to ContinuousDiD.fit, not by whether the dose is strictly positive.

Rows with NaN in unit or time are dropped up front and surfaced via the missing_id_rows_dropped alert; all subsequent structural facts are computed on the non-missing subset, so observation_coverage is always in [0, 1]. Duplicate (unit, time) rows are surfaced separately via the duplicate_unit_time_rows alert.

The profile does not recommend an estimator. Consult diff_diff.get_llm_guide("autonomous") for the estimator-support matrix and per-design-feature reasoning.

PanelProfile#

class diff_diff.PanelProfile[source]

Bases: object

Structural facts about a DiD panel.

Returned by profile_panel(). Mirrors the BusinessContext frozen-dataclass pattern. Consume .to_dict() for a JSON-serializable representation and reason against the bundled llms-autonomous.txt guide.

n_units: int
n_periods: int
n_obs: int
is_balanced: bool
observation_coverage: float
treatment_type: str
is_staggered: bool
n_cohorts: int
cohort_sizes: Mapping[Any, int]
has_never_treated: bool
has_always_treated: bool
treatment_varies_within_unit: bool
first_treatment_period: Any | None
last_treatment_period: Any | None
min_pre_periods: int | None
min_post_periods: int | None
outcome_dtype: str
outcome_is_binary: bool
outcome_has_zeros: bool
outcome_has_negatives: bool
outcome_missing_fraction: float
outcome_summary: Mapping[str, float]
alerts: Tuple[Alert, ...]
outcome_shape: OutcomeShape | None = None
treatment_dose: TreatmentDoseShape | None = None
to_dict()[source]

Return a JSON-serializable dict representation of the profile.

Return type:

Dict[str, Any]

__init__(n_units, n_periods, n_obs, is_balanced, observation_coverage, treatment_type, is_staggered, n_cohorts, cohort_sizes, has_never_treated, has_always_treated, treatment_varies_within_unit, first_treatment_period, last_treatment_period, min_pre_periods, min_post_periods, outcome_dtype, outcome_is_binary, outcome_has_zeros, outcome_has_negatives, outcome_missing_fraction, outcome_summary, alerts, outcome_shape=None, treatment_dose=None)
Parameters:
Return type:

None

OutcomeShape#

class diff_diff.OutcomeShape[source]

Bases: object

Distributional shape of a numeric outcome column.

Populated on PanelProfile when the outcome dtype is integer or float (np.dtype(...).kind in {"i", "u", "f"}); None otherwise. Descriptive only — these fields surface what is observed in the outcome distribution. They never recommend a specific estimator family.

n_distinct_values: int
pct_zeros: float
value_min: float
value_max: float
skewness: float | None
excess_kurtosis: float | None
is_integer_valued: bool
is_count_like: bool
is_bounded_unit: bool
__init__(n_distinct_values, pct_zeros, value_min, value_max, skewness, excess_kurtosis, is_integer_valued, is_count_like, is_bounded_unit)
Parameters:
  • n_distinct_values (int)

  • pct_zeros (float)

  • value_min (float)

  • value_max (float)

  • skewness (float | None)

  • excess_kurtosis (float | None)

  • is_integer_valued (bool)

  • is_count_like (bool)

  • is_bounded_unit (bool)

Return type:

None

TreatmentDoseShape#

class diff_diff.TreatmentDoseShape[source]

Bases: object

Distributional shape of a continuous treatment dose.

Populated on PanelProfile only when treatment_type == "continuous"; None otherwise. Most fields are descriptive distributional context.

profile_panel only sees the dose column, not the separate first_treat column ContinuousDiD.fit() consumes. In the canonical ContinuousDiD setup (Callaway, Goodman-Bacon, Sant’Anna 2024) the dose D_i is time-invariant per unit (D_i = 0 for never-treated, D_i > 0 constant across all periods for treated unit i) and first_treat is a separate column the caller supplies — not derived from the dose column. Under that canonical setup, several profile-side facts on the dose column predict ContinuousDiD.fit() outcomes:

  1. PanelProfile.has_never_treated == True (some unit has dose 0 in every period). Predicts the estimator’s P(D=0) > 0 requirement under both control_group="never_treated" and control_group="not_yet_treated" (Remark 3.1 lowest-dose-as-control not yet implemented), because the canonical setup ties first_treat == 0 to D_i == 0. Failure means no never-treated controls exist on the dose column; see routing notes below.

  2. PanelProfile.treatment_varies_within_unit == False (per-unit full-path dose constancy on the dose column). This IS the actual fit-time gate, matching ContinuousDiD.fit()’s df.groupby(unit)[dose].nunique() > 1 rejection at line 222-228; holds regardless of first_treat. True rules ContinuousDiD out — for graded-adoption panels with dose changes use HeterogeneousAdoptionDiD.

  3. PanelProfile.is_balanced == True. Actual fit-time gate (continuous_did.py:329-338); not first_treat-dependent.

  4. Absence of the duplicate_unit_time_rows alert. The precompute path silently resolves duplicate (unit, time) cells via last-row-wins (continuous_did.py:818-823); not a fit-time raise. The agent must deduplicate before fit because ContinuousDiD will otherwise overwrite silently.

  5. treatment_dose.dose_min > 0 (over non-zero doses). Predicts ContinuousDiD.fit()’s strictly-positive-treated- dose requirement (raises ValueError on negative dose for first_treat > 0 units, continuous_did.py:287-294). Failure means some treated units have negative dose; see routing notes below.

Routing alternatives when (1) or (5) fails:

  • When (1) fails (no never-treated controls but all observed doses non-negative): ContinuousDiD does not apply (Remark 3.1 lowest-dose-as-control is not implemented). HeterogeneousAdoptionDiD IS a candidate for graded-adoption designs (HAD’s contract requires non-negative dose, satisfied here); linear DiD with the treatment as a continuous covariate is another.

  • When (5) fails (negative treated doses): HeterogeneousAdoptionDiD is not a fallback either — HAD raises on negative post-period dose (had.py:1450-1459, paper Section 2). Linear DiD with the treatment as a signed continuous covariate is the applicable routing alternative.

  • Re-encoding the treatment column (shifting, absolute value, etc.) is an agent-side preprocessing choice that changes the estimand and is not documented in REGISTRY as a supported fallback; if the agent re-encodes to non-negative support, both ContinuousDiD and HeterogeneousAdoptionDiD become candidates again on the re-encoded scale.

  • Do not relabel positive- or negative-dose units as first_treat == 0: that triggers ContinuousDiD.fit()’s force-zero coercion path, which is implementation behavior for inconsistent inputs (e.g., an accidentally-nonzero row on a never-treated unit), not a documented routing option.

The agent must still validate the supplied first_treat column independently: it must contain at least one first_treat == 0 unit (P(D=0) > 0), be non-negative integer-valued (or +inf / 0 for never-treated), and be consistent with the dose column on per-unit treated/untreated status. profile_panel does not see first_treat and cannot validate it.

has_zero_dose is a row-level fact (“at least one observation has dose == 0”); it is NOT a substitute for has_never_treated, which is the unit-level field. A panel can have has_zero_dose == True (pre-treatment zero rows) while has_never_treated == False (every unit eventually treated), in which case the standard-workflow agent would conclude no never-treated controls exist before calling ContinuousDiD.fit().

n_distinct_doses: int
has_zero_dose: bool
dose_min: float
dose_max: float
dose_mean: float
__init__(n_distinct_doses, has_zero_dose, dose_min, dose_max, dose_mean)
Parameters:
Return type:

None

Alert#

class diff_diff.Alert[source]

Bases: object

A factual observation about a panel.

severity is "info" (descriptive) or "warn" (descriptive and likely relevant to the caller’s estimator choice). Alerts never recommend a specific estimator.

code: str
severity: str
message: str
observed: Any
__init__(code, severity, message, observed)
Parameters:
Return type:

None