Panel Profiling#

Pre-fit description of a DiD panel’s structural facts. profile_panel() inspects a long-format panel and returns a PanelProfile dataclass covering balance, treatment-type classification, outcome characteristics, and a list of factual Alert observations.

The profile is descriptive, not opinionated: alerts report what is (e.g. “smallest cohort has 7 units”), never what to do about it. Estimator selection is the caller’s responsibility. For autonomous-agent consumption, pair the profile output with the autonomous-agent reference guide (also accessible at runtime via diff_diff.get_llm_guide("autonomous")), which walks through the estimator-support matrix and the per-design-feature reasoning keyed off PanelProfile field values.

Note

PanelProfile and its three supporting dataclasses (OutcomeShape, TreatmentDoseShape, Alert) are re-exported at the top level of diff_diff so callers can construct or pattern-match against them without dotted-module access.

profile_panel#

diff_diff.profile_panel(df, *, unit, time, treatment, outcome)[source]#

Describe the structure of a DiD panel.

Reports structural facts — balance, treatment-type classification, outcome characteristics, factual alerts. Descriptive, not opinionated: the profile says what is, never what to do about it. Estimator selection is up to the caller.

Parameters:

df (pandas.DataFrame) – Long-format panel data containing the four named columns.
unit (str) – Column identifying the cross-sectional unit.
time (str) – Column identifying the time period.
treatment (str) – Column holding the treatment indicator or dose. See Notes for the classification rules.
outcome (str) – Column holding the outcome variable.

Returns:

Frozen dataclass. Call .to_dict() for a JSON-serializable view.

Return type:

PanelProfile

Raises:

ValueError – If any of the four column names is not present in df.

Examples

>>> import pandas as pd
>>> from diff_diff import profile_panel
>>> df = pd.DataFrame({
...     "u":  [1, 1, 2, 2],
...     "t":  [0, 1, 0, 1],
...     "tr": [0, 0, 1, 1],
...     "y":  [0.1, 0.2, 0.1, 0.9],
... })
>>> profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")
>>> profile.is_balanced
True
>>> profile.treatment_type
'binary_absorbing'

Notes

Classification rules for treatment_type:

"binary_absorbing": numeric treatment whose observed non-NaN values are a subset of \(\{0, 1\}\) (one or two distinct values) AND each unit’s treatment sequence (ordered by time) is weakly monotone non-decreasing. All-zero and all-one panels are valid degenerate cases.
"binary_non_absorbing": values a subset of \(\{0, 1\}\) with at least two distinct values observed, where at least one unit switches from 1 back to 0.
"continuous": numeric treatment with more than two distinct values, or a 2-valued numeric whose values are not in \(\{0, 1\}\) (matches the ContinuousDiD convention).
"categorical": non-numeric dtype (object / category) or a column that is entirely NaN.

Bool-dtype columns (True / False) are classified the same way as numeric {0, 1}: the library’s binary estimators validate on value support via diff_diff.utils.validate_binary(), so True / False behave like 1 / 0 for absorbing / non-absorbing classification.

has_never_treated is computed across both binary and continuous numeric treatment types: some unit has treatment == 0 in every observed non-NaN row. For binary this flags the clean-control group; for continuous this flags zero-dose controls (required by ContinuousDiD). Always False for "categorical".

has_always_treated has binary-only semantics: some unit has treatment == 1 in every observed non-NaN row (no pre-treatment information in the DiD sense). For "continuous" and "categorical" treatment this field is always False regardless of dose positivity — pre-treatment periods on continuous DiD are determined by the separate first_treat column passed to ContinuousDiD.fit, not by whether the dose is strictly positive.

Rows with NaN in unit or time are dropped up front and surfaced via the missing_id_rows_dropped alert; all subsequent structural facts are computed on the non-missing subset, so observation_coverage is always in [0, 1]. Duplicate (unit, time) rows are surfaced separately via the duplicate_unit_time_rows alert.

The profile does not recommend an estimator. Consult diff_diff.get_llm_guide("autonomous") for the estimator-support matrix and per-design-feature reasoning.

PanelProfile#

class diff_diff.PanelProfile[source]

Bases: object

Structural facts about a DiD panel.

Returned by profile_panel(). Mirrors the BusinessContext frozen-dataclass pattern. Consume .to_dict() for a JSON-serializable representation and reason against the bundled llms-autonomous.txt guide.

n_units: int

n_periods: int

n_obs: int

is_balanced: bool

observation_coverage: float

treatment_type: str

is_staggered: bool

n_cohorts: int

cohort_sizes: Mapping[Any, int]

has_never_treated: bool

has_always_treated: bool

treatment_varies_within_unit: bool

first_treatment_period: Any | None

last_treatment_period: Any | None

min_pre_periods: int | None

min_post_periods: int | None

outcome_dtype: str

outcome_is_binary: bool

outcome_has_zeros: bool

outcome_has_negatives: bool

outcome_missing_fraction: float

outcome_summary: Mapping[str, float]

alerts: Tuple[Alert, ...]

outcome_shape: OutcomeShape | None = None

treatment_dose: TreatmentDoseShape | None = None

to_dict()[source]

Return a JSON-serializable dict representation of the profile.

Return type:: Dict[str, Any]

__init__(n_units, n_periods, n_obs, is_balanced, observation_coverage, treatment_type, is_staggered, n_cohorts, cohort_sizes, has_never_treated, has_always_treated, treatment_varies_within_unit, first_treatment_period, last_treatment_period, min_pre_periods, min_post_periods, outcome_dtype, outcome_is_binary, outcome_has_zeros, outcome_has_negatives, outcome_missing_fraction, outcome_summary, alerts, outcome_shape=None, treatment_dose=None)

Parameters:

n_units (int)
n_periods (int)
n_obs (int)
is_balanced (bool)
observation_coverage (float)
treatment_type (str)
is_staggered (bool)
n_cohorts (int)
cohort_sizes (Mapping[Any, int])
has_never_treated (bool)
has_always_treated (bool)
treatment_varies_within_unit (bool)
first_treatment_period (Any | None)
last_treatment_period (Any | None)
min_pre_periods (int | None)
min_post_periods (int | None)
outcome_dtype (str)
outcome_is_binary (bool)
outcome_has_zeros (bool)
outcome_has_negatives (bool)
outcome_missing_fraction (float)
outcome_summary (Mapping[str, float])
alerts (Tuple[Alert, ...])
outcome_shape (OutcomeShape | None)
treatment_dose (TreatmentDoseShape | None)

Return type:

None

OutcomeShape#

class diff_diff.OutcomeShape[source]

Bases: object

Distributional shape of a numeric outcome column.

Populated on PanelProfile when the outcome dtype is integer or float (np.dtype(...).kind in {"i", "u", "f"}); None otherwise. Descriptive only — these fields surface what is observed in the outcome distribution. They never recommend a specific estimator family.

n_distinct_values: int

pct_zeros: float

value_min: float

value_max: float

skewness: float | None

excess_kurtosis: float | None

is_integer_valued: bool

is_count_like: bool

is_bounded_unit: bool

__init__(n_distinct_values, pct_zeros, value_min, value_max, skewness, excess_kurtosis, is_integer_valued, is_count_like, is_bounded_unit)

Parameters:

n_distinct_values (int)
pct_zeros (float)
value_min (float)
value_max (float)
skewness (float | None)
excess_kurtosis (float | None)
is_integer_valued (bool)
is_count_like (bool)
is_bounded_unit (bool)

Return type:

None

TreatmentDoseShape#

class diff_diff.TreatmentDoseShape[source]

Bases: object

Distributional shape of a continuous treatment dose.

Populated on PanelProfile only when treatment_type == "continuous"; None otherwise. Most fields are descriptive distributional context.

profile_panel only sees the dose column, not the separate first_treat column ContinuousDiD.fit() consumes. In the canonical ContinuousDiD setup (Callaway, Goodman-Bacon, Sant’Anna 2024) the dose D_i is time-invariant per unit (D_i = 0 for never-treated, D_i > 0 constant across all periods for treated unit i) and first_treat is a separate column the caller supplies — not derived from the dose column. Under that canonical setup, several profile-side facts on the dose column predict ContinuousDiD.fit() outcomes:

PanelProfile.has_never_treated == True (some unit has dose 0 in every period). Predicts the estimator’s P(D=0) > 0 requirement under the default control_group="never_treated" / control_group="not_yet_treated" (the canonical setup ties first_treat == 0 to D_i == 0). When it is False, control_group="lowest_dose" (Remark 3.1) is the route — the lowest-dose group becomes the comparison (needs a mass point at the minimum dose) — so failure of (1) no longer rules ContinuousDiD out; see routing notes below.
PanelProfile.treatment_varies_within_unit == False (per-unit full-path dose constancy on the dose column). This IS the actual fit-time gate, matching ContinuousDiD.fit()’s df.groupby(unit)[dose].nunique() > 1 rejection at line 222-228; holds regardless of first_treat. True rules ContinuousDiD out — for graded-adoption panels with dose changes use HeterogeneousAdoptionDiD.
PanelProfile.is_balanced == True. Actual fit-time gate (continuous_did.py:329-338); not first_treat-dependent.
Absence of the duplicate_unit_time_rows alert. The precompute path silently resolves duplicate (unit, time) cells via last-row-wins (continuous_did.py:818-823); not a fit-time raise. The agent must deduplicate before fit because ContinuousDiD will otherwise overwrite silently.
treatment_dose.dose_min > 0 (over non-zero doses). Predicts ContinuousDiD.fit()’s strictly-positive-treated- dose requirement (raises ValueError on negative dose for first_treat > 0 units, continuous_did.py:287-294). Failure means some treated units have negative dose; see routing notes below.

Routing alternatives when (1) or (5) fails:

When (1) fails (no never-treated controls but all observed doses non-negative): use control_group="lowest_dose" (Remark 3.1) if there is a mass point at the minimum dose (>= 2 units at d_L) — the lowest-dose group becomes the comparison and the estimand is ATT(d) - ATT(d_L). Otherwise HeterogeneousAdoptionDiD IS a candidate for graded-adoption designs (HAD’s contract requires non-negative dose, satisfied here); linear DiD with the treatment as a continuous covariate is another.
When (5) fails (negative treated doses): HeterogeneousAdoptionDiD is not a fallback either — HAD raises on negative post-period dose (had.py:1450-1459, paper Section 2). Linear DiD with the treatment as a signed continuous covariate is the applicable routing alternative.
Re-encoding the treatment column (shifting, absolute value, etc.) is an agent-side preprocessing choice that changes the estimand and is not documented in REGISTRY as a supported fallback; if the agent re-encodes to non-negative support, both ContinuousDiD and HeterogeneousAdoptionDiD become candidates again on the re-encoded scale.
Do not relabel positive- or negative-dose units as first_treat == 0: that triggers ContinuousDiD.fit()’s force-zero coercion path, which is implementation behavior for inconsistent inputs (e.g., an accidentally-nonzero row on a never-treated unit), not a documented routing option.

The agent must still validate the supplied first_treat column independently: it must contain at least one first_treat == 0 unit (P(D=0) > 0), be non-negative integer-valued (or +inf / 0 for never-treated), and be consistent with the dose column on per-unit treated/untreated status. profile_panel does not see first_treat and cannot validate it.

has_zero_dose is a row-level fact (“at least one observation has dose == 0”); it is NOT a substitute for has_never_treated, which is the unit-level field. A panel can have has_zero_dose == True (pre-treatment zero rows) while has_never_treated == False (every unit eventually treated), in which case the standard-workflow agent would conclude no never-treated controls exist before calling ContinuousDiD.fit().

n_distinct_doses: int

has_zero_dose: bool

dose_min: float

dose_max: float

dose_mean: float

__init__(n_distinct_doses, has_zero_dose, dose_min, dose_max, dose_mean)

Parameters:

n_distinct_doses (int)
has_zero_dose (bool)
dose_min (float)
dose_max (float)
dose_mean (float)

Return type:

None

Alert#

class diff_diff.Alert[source]

Bases: object

A factual observation about a panel.

severity is "info" (descriptive) or "warn" (descriptive and likely relevant to the caller’s estimator choice). Alerts never recommend a specific estimator.

code: str

severity: str

message: str

observed: Any

__init__(code, severity, message, observed)

Parameters:

code (str)
severity (str)
message (str)
observed (Any)

Return type:

None