# diff-diff: Autonomous-agent reference guide This guide is reference material for AI agents using diff-diff without human-in-the-loop supervision. It catalogs the library's estimators, names the design features each supports, explains how to read the `profile_panel()` output, and points at post-fit validation utilities and report schemas. It is a reference, not a decision tree. Multiple estimators usually fit a given panel; choosing between them involves trade-offs the cited literature discusses and that this guide does not pretend to resolve. **Pair this guide with:** - `get_llm_guide("practitioner")` - the Baker et al. (2025) 8-step validation workflow in workflow-prose form. - `get_llm_guide("full")` - comprehensive API documentation for every public function and class. - `profile_panel(df, unit=..., time=..., treatment=..., outcome=...)` - the pre-fit describe utility whose output fields this guide's sections §2 and §4 reason about. ## Table of contents - §1. What this guide is (and is not) - §2. PanelProfile field reference - §3. Estimator-support matrix - §4. Estimator-choice reasoning by design feature - §5. Worked examples - §6. Post-fit validation utilities - §7. How to read BusinessReport / DiagnosticReport output - §8. Glossary + citations - §9. Intentional omissions ## §1. What this guide is (and is not) **What it is.** A reference you consult after running `profile_panel()` and before calling any estimator's `fit()`. The matrix in §3 and the per-design- feature discussions in §4 tell you which estimators are well-suited to the panel shape reported by the profile; the worked examples in §5 walk through several end-to-end PanelProfile -> reasoning -> validation flows; the post-fit index in §6 tells you which diagnostics apply once you have a fitted result. **What it is not.** A deterministic recommender. No function in diff-diff returns "pick estimator X." This guide does not either. When several estimators fit a design, it enumerates them and names the trade-offs. The agent is responsible for weighing those trade-offs (often with the cited references in §8) and justifying the choice in the final write-up. **Why this shape.** A rules-engine recommender would lock in a policy that ages poorly as new estimators land and as the applied-econometrics literature evolves. Static reference material + descriptive profiling is less brittle: when a new estimator is added it gets a row in §3 and a paragraph in §4, without rewriting a dispatcher. ## §2. PanelProfile field reference `profile_panel(df, unit=..., time=..., treatment=..., outcome=...)` returns a frozen `PanelProfile` dataclass. Call `.to_dict()` for a JSON-serializable view. Every field below appears as a top-level key in that dict. ### Panel structure - **`n_units: int`** - count of distinct values in the `unit` column. - **`n_periods: int`** - count of distinct values in the `time` column. - **`n_obs: int`** - total rows in the panel. - **`is_balanced: bool`** - true iff every distinct `(unit, time)` cell appears at least once in the panel (i.e. the unique `(unit, time)` support equals `n_units * n_periods`). Duplicate rows do not affect balance but are surfaced via the `duplicate_unit_time_rows` alert. - **`observation_coverage: float`** - ratio of unique `(unit, time)` keys to `n_units * n_periods`, always in `[0, 1]` (duplicates do not inflate). A value below `0.70` also triggers the `panel_highly_unbalanced` alert. ### Treatment variation - **`treatment_type: str`** - classification of the treatment column. Exactly one of: - `"binary_absorbing"`: observed non-NaN values are a subset of {0, 1} (one or two distinct values, covering all-zero and all-one panels as valid degenerate cases) and each unit's treatment sequence (ordered by `time`) is weakly monotone non-decreasing. The canonical DiD setting. - `"binary_non_absorbing"`: values a subset of {0, 1} with at least two distinct values observed, where at least one unit switches from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles this natively; the other absorbing-only estimators would misapply. - `"continuous"`: numeric with more than two distinct values, or a two-valued numeric column whose values are not in {0, 1} (e.g., a dose, a discrete-integer partial-adoption score). Use `ContinuousDiD` or `HeterogeneousAdoptionDiD`. - `"categorical"`: non-numeric dtype (object / category), or a column that is entirely NaN. Often indicates a treatment arm. Encode each arm as a binary indicator and fit separately, or use a multi-treatment workflow outside the current estimator suite. Bool-dtype treatment columns (`True` / `False`) are classified the same way as numeric `{0, 1}`: the library's binary estimators validate on value support rather than dtype, so `True` and `False` behave like `1` and `0` for absorbing / non-absorbing classification. - **`is_staggered: bool`** - true iff treatment is `binary_absorbing` and at least two distinct first-treatment periods are observed. Drives the choice between classic DiD/TWFE and staggered-robust estimators. - **`n_cohorts: int`** - for `binary_absorbing`, the number of distinct first-treatment periods (cohorts). Zero for other `treatment_type` values. - **`cohort_sizes: Mapping[Any, int]`** - map from first-treatment period to cohort size (number of units adopting at that time). Empty for non-absorbing / continuous / categorical treatments. - **`has_never_treated: bool`** - at least one unit has `treatment == 0` in every observed non-NaN row (applies to both binary and continuous treatment columns; for continuous this flags zero-dose control units). Required by `SyntheticDiD`, `SunAbraham`, `EfficientDiD` under both `assumption="PT-All"` and `assumption="PT-Post"` (unless `control_group="last_cohort"` is passed), and `ContinuousDiD` (which requires `P(D=0) > 0` - Remark 3.1 lowest-dose-as-control is not yet implemented). Preferred-but-optional by `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. Always `False` for `"categorical"`. - **`has_always_treated: bool`** - at least one binary-treatment unit has `treatment == 1` in every observed non-NaN row (no pre-treatment information for that unit in the DiD sense). Binary-only semantics: for `"continuous"` panels this field is always `False` because pre-treatment periods are determined by the `first_treat` column supplied to `ContinuousDiD.fit()`, not by whether the dose is positive - a unit with a constant positive dose can still have well-defined pre-treatment periods. Always `False` for `"categorical"` too. - **`treatment_varies_within_unit: bool`** - at least one unit has more than one distinct non-NaN treatment value across its observed rows. For binary panels this is normally `True` (pre vs. post the adoption period), and for continuous panels this flags time-varying dose. `ContinuousDiD.fit()` requires this to be `False` (dose must be time-invariant per unit, per Callaway et al. 2024); a `True` value on a continuous panel rules the estimator out. Always `False` for `"categorical"`. ### Timing - **`first_treatment_period: Optional[Any]`** - earliest first-treatment period observed (for `binary_absorbing`); `None` otherwise. - **`last_treatment_period: Optional[Any]`** - latest first-treatment period observed; `None` otherwise. - **`min_pre_periods: Optional[int]`** - across treated units, the smallest number of observed pre-treatment periods (each treated unit's observed `(unit, time)` support is counted independently, so this reflects the least-supported treated unit on unbalanced panels). Low values (< 3) fire the `short_pre_panel` alert and limit power for parallel-trends tests. - **`min_post_periods: Optional[int]`** - across treated units, the smallest number of observed post-treatment periods; same per-unit support semantics as above. Low values limit event-study dynamics. ### Outcome - **`outcome_dtype: str`** - the pandas dtype name (e.g. `"float64"`, `"int64"`, `"bool"`). - **`outcome_is_binary: bool`** - outcome has exactly two distinct non-NaN values, both in {0, 1}. For binary outcomes the linear parallel-trends assumption is restrictive; consider the logit/log-odds alternative in the Roth/Sant'Anna (2023) survey. - **`outcome_has_zeros: bool`** - any non-NaN outcome equals zero. Relevant for log-transform diagnostics. - **`outcome_has_negatives: bool`** - any non-NaN outcome is negative. Relevant for log-transform diagnostics. - **`outcome_missing_fraction: float`** - share of rows where the outcome column is NaN, in `[0, 1]`. - **`outcome_summary: Mapping[str, float]`** - `{min, max, mean, std}` computed with NaN-skipping; empty for non-numeric outcomes. - **`outcome_shape: Optional[OutcomeShape]`** - distributional facts for numeric outcomes; `None` when the outcome dtype is non-numeric. Sub-fields: - `n_distinct_values: int` - count of distinct non-NaN outcome values. - `pct_zeros: float` - share of non-NaN observations equal to zero, in `[0, 1]`. - `value_min: float`, `value_max: float` - range of observed values. - `skewness: Optional[float]` - sample skewness via the canonical `m3 / std^3` form. `None` when `n_distinct_values < 3` or variance is zero. - `excess_kurtosis: Optional[float]` - `m4 / m2^2 - 3`, gated the same way as `skewness`. - `is_integer_valued: bool` - all non-NaN values are whole numbers (covers integer dtype and floats that happen to be integer-valued). - `is_count_like: bool` - heuristic for count-shaped outcomes: `is_integer_valued AND pct_zeros > 0 AND skewness > 0.5 AND n_distinct_values > 2 AND value_min >= 0`. When `True`, OLS DiD imposes an additive functional form on a non-negative count outcome (cluster-robust SEs are still calibrated, but the model can be inefficient and may produce counterfactual predictions outside the non-negative support); `WooldridgeDiD(method="poisson")` (QMLE) is the multiplicative (log-link) ETWFE alternative that respects the non-negative support and matches the typical generative process for count data, with QMLE sandwich SEs robust to distributional misspecification. The Poisson fitter rejects negative outcomes at fit time, which is why the heuristic gates on `value_min >= 0`. See §5.3 for a worked example. - `is_bounded_unit: bool` - all non-NaN values lie in `[0, 1]`. When `True` and the linear-DiD point estimate is near the boundary of feasible support, interpret with care (the linear model can predict outside `[0, 1]`). - **`treatment_dose: Optional[TreatmentDoseShape]`** - distributional facts for continuous-treatment dose columns; `None` unless `treatment_type == "continuous"`. Most sub-fields are descriptive distributional context. **`profile_panel` does not see the separate `first_treat` column** that `ContinuousDiD.fit()` consumes: the estimator's actual fit-time gates key off `first_treat` (defines never-treated controls as `first_treat == 0`, force-zeroes nonzero `dose` on those rows with a `UserWarning`, drops units where `first_treat > 0` AND `dose == 0`, and rejects negative dose only among treated units where `first_treat > 0`; see `continuous_did.py:276-327` and `:348-360`). In the canonical `ContinuousDiD` setup (Callaway, Goodman-Bacon, Sant'Anna 2024), the dose `D_i` is **time-invariant per unit** (`D_i = 0` for never-treated, `D_i > 0` constant across all periods for treated unit `i`) and `first_treat` is a **separate column** the caller supplies — it is NOT derived from the dose column (the dose column has no within-unit time variation in this setup, so it cannot encode timing). Under the canonical setup, several facts on the dose column predict `ContinuousDiD.fit()` outcomes: `has_never_treated == True` (proxy for `P(D=0) > 0` under both `control_group="never_treated"` and `control_group="not_yet_treated"`; Remark 3.1 lowest-dose-as-control is not yet implemented); `treatment_varies_within_unit == False` (the actual fit-time gate, matching `ContinuousDiD.fit()`'s `df.groupby(unit)[dose].nunique() > 1` rejection at line 222-228; holds regardless of `first_treat`); `is_balanced == True` (the actual fit-time gate at line 329-338); absence of the `duplicate_unit_time_rows` alert (the precompute path silently resolves duplicate cells via last-row-wins); and `treatment_dose.dose_min > 0` (predicts the strictly-positive-treated-dose requirement at line 287-294; treated units carry their constant dose across all periods so `dose_min` over non-zero values is the smallest treated dose). When `has_never_treated == False` (no zero-dose controls but all observed doses non-negative), `ContinuousDiD` as currently implemented does not apply (Remark 3.1 lowest-dose-as-control is not implemented). Routing alternatives that do not require `P(D=0) > 0`: `HeterogeneousAdoptionDiD` for graded-adoption designs (HAD's own contract requires non-negative dose, which this branch satisfies), or linear DiD with the treatment as a continuous covariate. When `dose_min <= 0` (negative treated doses), the situation is different: `ContinuousDiD` does not apply, and `HeterogeneousAdoptionDiD` is **not** a fallback either — HAD raises on negative post-period dose (`had.py:1450-1459`). The applicable routing alternative on the negative-dose branch is linear DiD with the treatment as a signed continuous covariate. Re-encoding the treatment column to a non-negative scale (shifting, absolute value, etc.) is an agent-side preprocessing choice that changes the estimand and is not documented in REGISTRY as a supported fallback; if the agent does re-encode, both `ContinuousDiD` and `HeterogeneousAdoptionDiD` become candidates again on the re-encoded scale. Do not relabel positive- or negative-dose units as `first_treat == 0`: that triggers the force-zero coercion path, which is implementation behavior for inconsistent inputs (e.g., an accidentally-nonzero row on a never-treated unit), not a documented routing option. The agent must still validate the supplied `first_treat` column independently: it must contain at least one `first_treat == 0` unit (`P(D=0) > 0`), be non-negative integer-valued (or `+inf` / 0 for never-treated), and be consistent with the dose column on per-unit treated/untreated status. `profile_panel` does not see `first_treat` and cannot validate it. See §5.2 for the worked example. Sub-fields: - `n_distinct_doses: int` - count of distinct non-NaN dose values (including zero if observed). Useful supplement to the gate checks for understanding the dose support. - `has_zero_dose: bool` - at least one unit-period has dose exactly zero. **Row-level fact**: a panel can have `has_zero_dose == True` (some pre-treatment rows are zero) while `has_never_treated == False` (every unit eventually treated), in which case the panel still fails the ContinuousDiD never-treated gate. Consult `has_never_treated` for the unit-level gate. - `dose_min: float`, `dose_max: float`, `dose_mean: float` - computed over the strictly non-zero doses; useful for effect-size context and dose-response interpretation. As noted above, under the standard workflow `dose_min > 0` is the profile-side proxy for ContinuousDiD's strictly-positive- treated-dose requirement. A continuous panel with negative non-zero doses (e.g. `dose_min == -1.5`) labeled as `first_treat > 0` would be rejected at fit time (``continuous_did.py:287-294``); the same negative-dose units labeled as `first_treat == 0` would be coerced to dose=0 with a `UserWarning` instead. See §5.2 for the standard-workflow walkthrough. ### Alerts `alerts: tuple[Alert, ...]` is a list of factual observations. Each `Alert` has `code`, `severity` (`"info"` or `"warn"`), `message`, and `observed` (the numerical or boolean value that tripped the alert). The v1 alert catalogue is listed below. Alerts never name a specific estimator. Severity `"warn"` means the observation is likely relevant to estimator choice or to the interpretation of diagnostics; `"info"` means it is descriptive context. | Alert code | Severity | Fires when | |---|---|---| | `missing_id_rows_dropped` | warn | rows with NaN `unit` or `time` were dropped before computing structural facts | | `duplicate_unit_time_rows` | warn | panel contains more than one row per (unit, time) | | `min_cohort_size_below_10` | warn | smallest cohort has fewer than 10 units | | `only_one_cohort` | info | all treated units adopt simultaneously | | `short_pre_panel` | warn | `min_pre_periods < 3` | | `short_post_panel` | info | `min_post_periods < 3` | | `no_never_treated` | info | every unit is eventually treated | | `has_always_treated_units` | info | some units are treated in every observed period | | `all_units_treated_simultaneously` | info | single cohort and no never-treated group | | `panel_highly_unbalanced` | warn | `observation_coverage < 0.70` | | `only_two_periods` | info | `n_periods == 2` | | `outcome_looks_binary_but_dtype_float` | info | outcome takes {0, 1} values but is stored as float | ## §3. Estimator-support matrix Rows are estimator classes exported from `diff_diff`. Columns are design features derivable from `PanelProfile`. Cells: `✓` supported; `✗` not supported / out of scope; `warn` supported but with documented caveats; `partial` supported subject to restrictions discussed in §4. | Estimator | binary absorbing | staggered | continuous | triple-diff | never-treated required | covariate adjustment | few-treated (synthetic) | heterogeneous adoption | clustered SE | |---|---|---|---|---|---|---|---|---|---| | `DifferenceInDifferences` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `MultiPeriodDiD` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `TwoWayFixedEffects` | ✓ | warn | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `CallawaySantAnna` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | | `SunAbraham` | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | | `ChaisemartinDHaultfoeuille` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `ImputationDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `TwoStageDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `StackedDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `WooldridgeDiD` (ETWFE) | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `EfficientDiD` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | | `SyntheticDiD` | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | partial | | `TROP` | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | partial | | `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `ContinuousDiD` | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | | `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✗ | ✗ | ✓ | warn | **Footnotes.** - `TwoWayFixedEffects` + staggered: fits but mixes positive and negative cohort-weights that violate the ATT interpretation; consult `BaconDecomposition` to quantify. Prefer any staggered-robust estimator (CS, SA, dCDH, Imputation, TwoStage, ETWFE) for a staggered design. - `CallawaySantAnna` + never-treated: the "never-treated" control group is one option; "not-yet-treated" is the other. Pick via the `control_group` argument. If `has_never_treated == False`, use `control_group="not_yet_treated"`. - `EfficientDiD` + never-treated: both `assumption="PT-All"` and `assumption="PT-Post"` require actual never-treated units - PT-Post is the weaker parallel-trends assumption but still uses never-treated as the comparison group (REGISTRY.md `EfficientDiD` "Parallel Trends -- two variants"). To admit an all-eventually-treated panel, pass `control_group="last_cohort"` to reclassify the latest treatment cohort as a pseudo-never-treated control and trim post-treatment periods at/after its adoption. The `EfficientDiD.hausman_pretest` classmethod picks between `PT-All` and `PT-Post` on panels that do have never-treated units. - `SyntheticDiD` + staggered: not supported. `fit()` raises `ValueError` on within-unit treatment variation; SDiD requires block treatment (all treated units adopt at the same time). For staggered designs use a cohort-level fit loop externally or pick a staggered-robust estimator above. - `TROP` staggered support: treatment is an absorbing-state indicator, so staggered adoption is handled via the D matrix. TROP `fit()` has no covariate surface; its local method uses every unit untreated at period `t` as the donor pool (not a never-treated-only set). - `HeterogeneousAdoptionDiD` covariate adjustment: identification with covariates (paper Appendix B.1, Equation 19) is deferred to future work; `fit(covariates=...)` is not yet implemented. - `HeterogeneousAdoptionDiD` clustered SE: `cluster=` is honored on the mass-point / CR1 path; on the continuous nonparametric paths the kwarg emits a `UserWarning` and is ignored (Phase 2a scope). Use `bias_corrected_local_linear` directly for cluster-robust inference on the nonparametric path. - `HeterogeneousAdoptionDiD` continuous: supports partial-adoption intensity as a continuous first-stage variable; not a pure dose-response estimator - use `ContinuousDiD` for that. - `HeterogeneousAdoptionDiD` staggered support is `partial`, not general. Paper Appendix B.2 restricts staggered use to the **last treatment cohort plus never-treated units**. With `aggregate="event_study"` and a `first_treat_col` kwarg, `fit()` auto-filters to `F_last = max(cohorts)` and emits a `UserWarning` naming kept/dropped counts; earlier-cohort units are dropped. Without `first_treat_col`, a multi-cohort panel raises `ValueError`. For full staggered support that retains every cohort, use `ChaisemartinDHaultfoeuille` instead. **Balanced-panel eligibility.** The following estimators require exactly one observation per `(unit, time)` cell with every unit observed in every period: `ContinuousDiD`, `EfficientDiD`, `SyntheticDiD`, `HeterogeneousAdoptionDiD`, `StaggeredTripleDifference`. Gate these on BOTH `PanelProfile.is_balanced == True` AND the absence of the `duplicate_unit_time_rows` alert (`is_balanced` is computed from the unique-key support and stays `True` when duplicates exist; the alert is the separate signal for duplicates). Treat both conditions as hard gates: `EfficientDiD` and `HeterogeneousAdoptionDiD` raise `ValueError` at `fit()` on duplicate cells, and `ContinuousDiD`'s precompute path resolves duplicates with last-row-wins (silent overwrite that can change the estimand). If either condition fails, pre-process with `diff_diff.prep.balance_panel()` and a `drop_duplicates([unit, time])` pass, or pick a balance-tolerant estimator from the remaining rows (CS/SA/dCDH/Imputation/TwoStage/ Stacked/ETWFE all accept unbalanced input, with some caveats in their own docs). For two common reasoning patterns walked through end-to-end (continuous dose checked against the existing `has_never_treated` / `treatment_varies_within_unit` / `is_balanced` gates with `treatment_dose` providing descriptive context, and count-shaped outcome with `outcome_shape` introspection), see §5.2 and §5.3. ## §4. Estimator-choice reasoning by design feature Each subsection names a design feature and lists estimators applicable to it with the most important trade-offs. Multiple paths are always explicit; no subsection says "pick estimator X." ### §4.1 Classic 2×2 DiD (binary absorbing, two periods, no staggering) When `treatment_type == "binary_absorbing"`, `n_periods == 2`, and `is_staggered == False`, the classic Card-and-Krueger 2×2 design applies. Most estimators in the library produce the same point estimate in this case; the choice between them is mostly about output shape: - `DifferenceInDifferences` for a minimal results object. - `TwoWayFixedEffects` if you want the equivalent two-way-FE regression output (coefficient table, VCV, etc.). Identical to DiD in the 2×2 case. - `TripleDifference` if a second comparison dimension is available (DDD) - see §4.6. ### §4.2 Multi-period single-cohort (event-study without staggering) When `is_staggered == False` and `n_periods > 2`, event-study dynamics can be estimated but cohort-mixing bias is moot: - `MultiPeriodDiD` - per-period effect, standard event-study plot. - `TwoWayFixedEffects` with event-time dummies - similar output, no forbidden comparisons because there is only one cohort. ### §4.3 Staggered adoption (multi-cohort binary absorbing) When `is_staggered == True`, classic TWFE mixes positive- and negative-weighted cohort comparisons (Goodman-Bacon 2021, de Chaisemartin & d'Haultfoeuille 2020). Use one of the staggered-robust estimators: - `CallawaySantAnna` - group-time ATTs aggregated to ES / overall / cohort dimensions. Flexible control-group choice (never-treated vs. not-yet-treated). Covariate adjustment via doubly-robust (DR), IPW, or regression-adjustment (RA). - `SunAbraham` - interaction-weighted estimator; closely tied to two-way-FE output, computationally cheap, produces event-time coefficients. Requires a never-treated cohort (`fit` raises a `ValueError` when none exists). - `ChaisemartinDHaultfoeuille` - DID_M / DID_l estimators robust to non-absorbing / reversible treatment (see §4.5). Interference / between-unit spillovers are not supported natively - SUTVA is assumed like every other DiD estimator in the suite. - `ImputationDiD` (Borusyak, Jaravel, Spiess) - imputation-based, efficient under homoskedasticity, produces an imputation-based residual at the observation level. - `TwoStageDiD` (Gardner) - two-stage residualize-then-regress. - `StackedDiD` - stacked event-study regressions, one subpanel per cohort. Conservative interpretation. - `WooldridgeDiD` (ETWFE) - extended-TWFE with cohort-by-time-by- covariates interactions; heterogeneous covariate-by-cohort effects. - `EfficientDiD` (Chen, Sant'Anna, Xie 2025) - asymptotically efficient under either `PT-All` or `PT-Post`; use `EfficientDiD.hausman_pretest` to pick. Requires a balanced panel (`PanelProfile.is_balanced == True`); `fit()` raises `ValueError` on unbalanced input. Diagnostic: `bacon_decompose(df, ...)` shows the weight allocation of a TWFE fit to 2×2 comparison types. Forbidden-comparison weight > 10% is a strong signal that the TWFE estimate is biased. ### §4.4 No never-treated group When `has_never_treated == False`: - `SyntheticDiD` requires a never-treated donor pool - not applicable. - `TROP` does not require a strict never-treated partition: its donor pool is every unit untreated at the current period `t` (via the absorbing D matrix). When every unit is eventually treated TROP can still fit, with the donor pool shrinking over time - check the pre-treatment coverage of the factor-model fit in the results diagnostics. - `EfficientDiD` requires never-treated comparisons under both `assumption="PT-All"` and `assumption="PT-Post"`. To admit an all-treated panel, pass `control_group="last_cohort"` to use the latest treatment cohort as a pseudo-never-treated control (post-treatment periods at/after that cohort's adoption are trimmed). Distinct from CallawaySantAnna's `not_yet_treated` option. - `ContinuousDiD` requires zero-dose control units (`P(D=0) > 0`). Remark 3.1 of the paper (lowest-dose-as-control) is not yet implemented; `fit()` raises `ValueError` when no `D=0` units exist. - `CallawaySantAnna` - use `control_group="not_yet_treated"` to use not-yet-treated units as the control pool. - `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers directly; no never-treated requirement. - TWFE / `MultiPeriodDiD` / `ImputationDiD` / `TwoStageDiD` / `StackedDiD` / `WooldridgeDiD` - use the last-treated or untreated- until-late units as implicit controls; estimators do not error, but consider whether the implicit control structure is what you want. ### §4.5 Non-absorbing binary treatment (treatment switches back to 0) When `treatment_type == "binary_non_absorbing"`: - `ChaisemartinDHaultfoeuille` is the only estimator in the library that treats this natively. Switcher / non-switcher comparisons are its primitive object. - Other estimators assume absorbing treatment and will produce estimates whose interpretation is unclear. Do not use them without a well-argued reason. ### §4.6 Triple-difference design (DDD) When a second cross-cutting comparison axis exists (e.g., policy hits some states and some demographic subgroups within states): - `TripleDifference` - classic two-period DDD. - `StaggeredTripleDifference` - staggered DDD, robust to cohort-mixing. Triple-difference is not automatically detected by `profile_panel`; it requires the caller to identify the third comparison axis. If a `group` covariate in the panel drives differential exposure, DDD is worth considering. ### §4.7 Continuous / dose-response treatment When `treatment_type == "continuous"`: - `ContinuousDiD` (Callaway, Goodman-Bacon, Sant'Anna 2024) - continuous / dose-response treatment. The estimator's canonical setup expects a **time-invariant unit dose** `D_i` (constant across all periods for each unit, `0` for never-treated, `> 0` for treated) and a **separate `first_treat` column** carrying timing information — the dose column does not encode timing. Under that canonical setup, five facts on the dose column predict `fit()` outcomes (full discussion in the paragraph immediately below): (a) zero-dose control units must exist (`PanelProfile.has_never_treated == True`, proxying `ContinuousDiD`'s `P(D=0) > 0` requirement under both `control_group` options because Remark 3.1 lowest-dose-as-control is not yet implemented); (b) dose must be time-invariant per unit (rule out panels where `PanelProfile.treatment_varies_within_unit == True`); (c) the panel must be balanced (`PanelProfile.is_balanced == True`); (d) no `duplicate_unit_time_rows` alert (the precompute path silently resolves duplicate cells via last-row-wins); and (e) strictly positive treated doses (`treatment_dose.dose_min > 0`). `fit()` raises `ValueError` on (b) and (c) regardless of how `first_treat` is constructed; duplicate rows in (d) are silently overwritten with last-row-wins (a hard preflight veto, not a fit-time raise — the agent must deduplicate before fitting); (a) and (e) hold under the canonical setup. When (a) or (e) fails, see §2 for the full routing-alternatives discussion (the two branches differ: HAD applies on the no-never-treated branch but not on the negative-dose branch, since HAD requires non-negative dose support per `had.py:1450-1459`). Note that staggered adoption IS supported natively (adoption timing is expressed via the `first_treat` column, not via within-unit dose variation), and `ContinuousDiD.fit()` applies additional validation on the `first_treat` column itself — see the paragraph below and §2 for the full list. The estimator exposes several dose-indexed targets that require different assumptions: `ATT(d|d)` (effect of dose `d` on units that received `d`) and `ATT^{loc}` (binarized overall ATT) are identified under Parallel Trends; `ATT(d)` (full dose-response curve), `ACRT(d)` (marginal effect, i.e. the average causal response), and `ACRT^{glob}` require the stronger Strong Parallel Trends assumption. The BR headline scalar is the overall ATT; ACR and dose-response tables are available in the result object. Supports B-spline basis construction. - `HeterogeneousAdoptionDiD` - partial-adoption intensity, with a scalar first-stage adoption summary. Useful when adoption is graded rather than binary. See the `TreatmentDoseShape` field reference in §2 for the full preflight-vs-gate breakdown and the routing-alternative discussion when (a) or (e) fails. The remaining `treatment_dose` sub-fields are descriptive context only; §5.2 walks through the screen -> fit -> validation flow. ### §4.8 Few treated units (one or a handful) When few treated units exist (not a separate `PanelProfile` field yet, but derivable from `cohort_sizes` + `has_never_treated`): - `SyntheticDiD` - synthetic-control-meets-DiD. Requires never-treated donors and sufficient pre-treatment periods (Arkhangelsky et al. 2021). Block treatment only: all treated units must adopt at the same time. Requires a balanced panel (`PanelProfile.is_balanced == True`); `fit()` raises `ValueError` and points at `balance_panel()`. - `TROP` - factor-model-based generalized synthetic control. Uses every unit untreated at period `t` as the donor pool (via the absorbing-state D matrix); supports staggered adoption and more complex factor structures. No covariate-adjustment surface on `fit()`. Classical DiD estimators will still produce estimates, but inference is unreliable with very small treated groups; cluster-robust SE relies on the number of clusters, not the number of treated units. Bootstrap methods in the library are preferred. ### §4.9 Heterogeneous adoption intensity When adoption varies in strength across units (partial-adoption settings, intensity of exposure differs): - `HeterogeneousAdoptionDiD` - requires a balanced panel (`PanelProfile.is_balanced == True`; `fit()` raises `ValueError` when any unit is missing a period). Targets a Weighted Average Slope (WAS) on single-period Heterogeneous Adoption Designs where no genuinely untreated group exists (paper Equation 2 / Theorem 1). The `target_parameter` attribute on the results object is literally `"WAS"` for Design 1' and `"WAS_d_lower"` for Design 1 with lower-dose comparison under Assumption 6. `fit(aggregate="overall")` (Phase 2a) returns a single scalar WAS; `fit(aggregate="event_study")` (Phase 2b) returns per-event-time WAS estimates. `did_had_pretest_workflow()` runs the paper's three-step TWFE-suitability battery: (1) QUG null via `qug_test`, (2) Assumption 7 pre-trends via `stute_test` / `stute_joint_pretest` (event-study path only; the two-period overall path flags this step as deferred), and (3) linearity of `E[ΔY | D_2]` via `stute_test` / `yatchew_hr_test`. Assumption 3 (uniform continuity / no extensive-margin jump) is not testable; the pre-test battery does not and cannot validate it. Not ATT-shaped; do not relabel the headline as ATT in report text. **Staggered-timing scope is last-cohort-only (Appendix B.2).** HAD's staggered support is the `partial` cell in §3: on a multi-cohort panel passed to `aggregate="event_study"`, `fit()` auto-filters to the last treatment cohort (`F_last = max(cohorts)`) plus never-treated units and emits a `UserWarning` naming kept/dropped counts; earlier treated cohorts are dropped. The `first_treat_col` kwarg is **required** for the auto-filter to activate; without it a multi-cohort panel raises `ValueError` pointing the caller at `ChaisemartinDHaultfoeuille` for full staggered support. The resulting estimand is a **last-cohort-only WAS**, not a multi-cohort average — report it as such. ### §4.10 Repeated cross-sections (no panel structure) `profile_panel` assumes long-format panel data. When the same units are not observed across time (true repeated cross-sections), only the estimators whose documented contract explicitly admits RCS are applicable. Do not route RCS data to any other estimator in the suite - most of them are panel-only by construction and will either raise at fit time or estimate under a misspecified identifying assumption. Explicit RCS support in this library: - `CallawaySantAnna(panel=False)` - repeated-cross-section mode per REGISTRY.md §CallawaySantAnna; use this variant on RCS data. - `TripleDifference` - DDD cross-sectional use cases are documented in `docs/choosing_estimator.rst`; the two-period DDD estimator does not require within-unit tracking when the third comparison axis carries the identification. The staggered DDD variant is panel-only and listed separately below. Explicitly rejected for RCS (panel-only): - `EfficientDiD` - REGISTRY notes "does not handle ... repeated cross-sections." - `HeterogeneousAdoptionDiD` - panel-only (requires a balanced panel with per-unit adoption timing). - `SyntheticDiD` - requires balanced panel with per-unit donor matching. - `ContinuousDiD` - requires balanced panel with per-unit constant dose. - `StaggeredTripleDifference` - panel-only; `fit()` has no `panel=False` mode and rejects duplicate / unbalanced `(unit, time)` structure. For cross-sectional DDD data use `TripleDifference` instead. Treat other estimators in this guide as panel-only unless their own docs explicitly say otherwise. When routing, also: - Cluster SE on the unit proxy (state, region) rather than the individual cross-section respondent. - Confirm the treatment assignment is at the cluster level, not at the individual-respondent level, before interpreting the estimate as a group-time ATT. ### §4.11 Outcome-shape considerations The matrix in §3 routes by treatment shape. Outcome shape is a separate axis: a panel that is binary-absorbing and staggered may still have a count-shaped outcome (e.g., number of incidents per unit-period), and on such an outcome linear DiD imposes an additive functional form that can be inefficient and may produce counterfactual predictions outside the non-negative support — even though the cluster-robust SEs remain calibrated. The functional-form choice is what matters here, not SE calibration. `PanelProfile.outcome_shape` exposes the relevant facts: - `outcome_shape.is_count_like == True` (integer-valued, non-negative, has zeros, right-skewed, more than two distinct values) - linear-OLS DiD imposes an additive functional form on a non-negative count outcome: estimates are unbiased for the linear ATT (and the estimator already uses cluster-robust SEs that do not assume Gaussian errors), but the linear model can be inefficient on count data and can produce counterfactual predictions outside the non-negative support. `WooldridgeDiD(method="poisson")` is the multiplicative (log-link) ETWFE alternative — it respects the non-negative support, matches the typical generative process for count data, and uses QMLE sandwich SEs that are robust to distributional misspecification (Wooldridge 2023). It estimates the overall ATT as an ASF-based outcome-scale difference (per-cell average of `E[exp(η_1)] - E[exp(η_0)]`; see REGISTRY.md §WooldridgeDiD nonlinear / ASF path). The headline `overall_att` is a difference on the outcome scale, NOT a multiplicative ratio; a proportional interpretation can be derived post-hoc as `overall_att / E[Y_0]` if desired but is not the estimator's reported scalar. The choice between linear-OLS DiD and Poisson ETWFE is about which functional form (additive vs. multiplicative) best summarizes the treatment effect on the count outcome, not about whether SEs are calibrated. The shape field flags the consideration; §5.3 walks through this pattern with a concrete profile. - `outcome_shape.is_bounded_unit == True` (values in `[0, 1]`, e.g. a proportion) - linear DiD can produce predictions outside `[0, 1]` and inference at the boundary is questionable. No estimator in the suite handles this differently from a numeric outcome; flag the consideration in the write-up. - `outcome_shape.is_integer_valued == True` without `is_count_like` (e.g., 0/1 binary, ordinal Likert) - the binary case has its own caveats (logit/log-odds alternative per Roth and Sant'Anna 2023). Ordinal outcomes generally need a domain-specific design that the current suite does not provide. ## §5. Worked examples Each subsection shows a realistic `profile_panel` output, traces the agent reasoning that maps it to an estimator (or rules estimators out), and points at the validation step. Examples are illustrative: they do not exhaust the design space and they do not collapse a multi-path choice to a single mandated answer. ### §5.1 Binary staggered panel with never-treated controls A long panel of 200 stores observed across 20 quarters, with treatment applied to subsets of stores in three different quarters and a fourth group never treated. `profile_panel(...)` returns: ``` PanelProfile( n_units=200, n_periods=20, n_obs=4000, is_balanced=True, observation_coverage=1.0, treatment_type="binary_absorbing", is_staggered=True, n_cohorts=3, cohort_sizes={5: 40, 9: 35, 13: 45}, has_never_treated=True, has_always_treated=False, treatment_varies_within_unit=True, first_treatment_period=5, last_treatment_period=13, min_pre_periods=4, min_post_periods=7, outcome_dtype="float64", outcome_is_binary=False, outcome_has_zeros=False, outcome_has_negatives=False, outcome_missing_fraction=0.0, outcome_summary={"min": 12.4, "max": 88.1, "mean": 47.3, "std": 14.2}, outcome_shape=OutcomeShape( n_distinct_values=2841, pct_zeros=0.0, value_min=12.4, value_max=88.1, skewness=0.18, excess_kurtosis=-0.41, is_integer_valued=False, is_count_like=False, is_bounded_unit=False, ), treatment_dose=None, alerts=(), ) ``` Reasoning chain: 1. `treatment_type == "binary_absorbing"` and `is_staggered == True` -> §3 row narrows to the staggered-robust set: `CallawaySantAnna`, `SunAbraham`, `ChaisemartinDHaultfoeuille`, `ImputationDiD`, `TwoStageDiD`, `StackedDiD`, `WooldridgeDiD` (ETWFE), `EfficientDiD`. `TwoWayFixedEffects` is `warn` per the §3 footnote on cohort weights. `SyntheticDiD` is `✗` on staggered (block treatment only). `ContinuousDiD` and `HeterogeneousAdoptionDiD` are out (binary). 2. `has_never_treated == True` AND `n_cohorts == 3` (multi-cohort) -> `CallawaySantAnna(control_group="never_treated")` and `SunAbraham` are both well-suited; the never-treated controls preserve power. `EfficientDiD` (Hausman-pretested between PT-All and PT-Post) is another applicable path with the same control set. 3. `min_pre_periods == 4` -> parallel-trends and event-study pretests have meaningful power; no `short_pre_panel` alert fires. 4. Pick `CallawaySantAnna(control_group="never_treated")` for the group-time ATT decomposition; fit; then validate via `compute_pretrends_power(results)` and `compute_honest_did(results)` before reporting through `BusinessReport(results, data=df)`. ### §5.2 Continuous-dose panel with zero-dose controls A panel of 100 firms observed across 6 years, with 20 untreated firms (dose 0 in every period), 30 firms at dose 1.0 (in every period), 30 at dose 2.5 (in every period), and 20 at dose 4.0 (in every period). The dose column is time-invariant per unit; adoption timing is carried separately via the `first_treat` column passed to `ContinuousDiD.fit()` (e.g. `first_treat=3` for treated firms, `first_treat=0` for the never-treated). `profile_panel(...)` returns the relevant facts: ``` PanelProfile( treatment_type="continuous", treatment_varies_within_unit=False, has_never_treated=True, is_balanced=True, treatment_dose=TreatmentDoseShape( n_distinct_doses=4, has_zero_dose=True, dose_min=1.0, dose_max=4.0, dose_mean=2.4, ), outcome_shape=OutcomeShape( is_count_like=False, is_bounded_unit=False, ... ), alerts=(), ) ``` Reasoning chain: 1. `treatment_type == "continuous"` -> §3 row narrows to `ContinuousDiD` (`✓`) and `HeterogeneousAdoptionDiD` (`partial`, for graded adoption). All other estimators are `✗` on continuous. 2. The example matches the canonical `ContinuousDiD` setup (per-unit time-invariant `D_i`; `first_treat` will be a separate column the caller supplies, NOT derived from the dose column). On the dose column alone, profile_panel exposes five facts that predict `fit()` outcomes under that canonical setup: `has_never_treated == True` (proxy for `P(D=0) > 0` under both `control_group` options, since Remark 3.1 lowest-dose-as-control is not yet implemented), `treatment_varies_within_unit == False` (the actual fit-time gate matching `ContinuousDiD.fit()`'s `df.groupby(unit)[dose].nunique() > 1` rejection at line 222-228; not first_treat-dependent), `is_balanced == True` (actual fit-time gate at line 329-338), absence of a `duplicate_unit_time_rows` alert (the precompute path silently resolves duplicate cells via last-row-wins; the agent must deduplicate before fit), and `treatment_dose.dose_min > 0` (predicts the strictly-positive-treated-dose requirement at line 287-294 because treated units carry their constant dose across all periods so `dose_min` over non-zero values is the smallest treated dose). All five pass (`dose_min == 1.0 > 0`), so `ContinuousDiD` is a candidate. The remaining `treatment_dose` sub-fields (`n_distinct_doses`, `has_zero_dose`, `dose_max`, `dose_mean`) provide descriptive context — useful for reasoning about dose support and the eventual dose-response interpretation, but not themselves preflight checks. See §2 `TreatmentDoseShape` for the full preflight-vs-gate breakdown and the explicit warning against relabeling-to- manufacture-controls. `fit()` also rejects NaN `first_treat` rows, recodes `+inf` to 0 with a `UserWarning`, rejects negative `first_treat`, and drops units with `first_treat > 0` AND `dose == 0`. 3. Counter-example: had `treatment_varies_within_unit == True` (any unit's full dose path - including pre-treatment zeros - has more than one distinct value, e.g., a `0,0,d,d` adoption path with varying nonzero `d`), `ContinuousDiD` would not apply. The two paths from there are (a) `HeterogeneousAdoptionDiD` if a scalar adoption summary fits, or (b) aggregate the dose to a binary indicator and fall back to a binary staggered estimator. 4. Counter-example: had `has_never_treated == False` (every unit eventually treated, even if some pre-treatment rows have zero dose so `treatment_dose.has_zero_dose == True`), `ContinuousDiD.fit()` would reject the panel under both `control_group="never_treated"` and `control_group="not_yet_treated"` because Remark 3.1 lowest-dose-as-control is not yet implemented. On this branch (no never-treated controls but doses still non-negative), `HeterogeneousAdoptionDiD` IS a routing alternative for graded-adoption designs, and linear DiD with the treatment as a continuous covariate is another; see §2 for the full routing discussion. 5. Counter-example: had `treatment_dose.dose_min < 0` (continuous panel with some negative-valued treated doses, e.g. a centered-around-zero treatment encoding), with a `first_treat` column consistent with the dose column, `ContinuousDiD.fit()` would raise at line 287-294 ("Dose must be strictly positive for treated units"). `HeterogeneousAdoptionDiD` is **not** a routing alternative here either — HAD requires non-negative dose support (`had.py:1450-1459`, paper Section 2). The applicable alternative is linear DiD with the treatment as a signed continuous covariate; see §2 for the full routing discussion. 6. Fit `ContinuousDiD`; the result object exposes the dose-response curve (`ATT(d)`) and average causal response (`ACRT(d)`); choose the headline estimand based on the business question (overall ATT under PT, or the dose-response curve under Strong PT). ### §5.3 Count-shaped outcome on a binary-staggered panel A panel of 300 retail outlets observed across 12 months, with three adoption cohorts. The outcome is "number of customer complaints per outlet per month" - integer-valued, lots of zero months, right-skewed. `profile_panel(...)` returns: ``` PanelProfile( treatment_type="binary_absorbing", is_staggered=True, has_never_treated=True, n_cohorts=3, outcome_dtype="int64", outcome_is_binary=False, outcome_has_zeros=True, outcome_has_negatives=False, outcome_summary={"min": 0, "max": 47, "mean": 1.8, "std": 3.2}, outcome_shape=OutcomeShape( n_distinct_values=18, pct_zeros=0.43, value_min=0, value_max=47, skewness=2.1, excess_kurtosis=6.4, is_integer_valued=True, is_count_like=True, is_bounded_unit=False, ), treatment_dose=None, alerts=(), ) ``` Reasoning chain: 1. Same staggered-binary narrowing as §5.1 (CS/SA/dCDH/Imputation/ TwoStage/Stacked/ETWFE/EfficientDiD applicable). 2. `outcome_shape.is_count_like == True` AND `outcome_shape.value_min >= 0` -> linear-OLS DiD imposes an additive functional form on a non-negative count outcome: estimates are unbiased for the linear ATT and the implementation already uses cluster-robust SEs that do not assume Gaussian errors. The trade-off is functional-form / efficiency, not inference calibration: the linear model can be inefficient on count data and may produce counterfactual predictions outside the non-negative support, while `WooldridgeDiD(method="poisson")` (QMLE) imposes a multiplicative (log-link) functional form that respects the non-negative support and matches the typical generative process for count data. It estimates the overall ATT as an ASF-based outcome-scale difference: the per-cell average of `E[exp(η_1)] - E[exp(η_0)]` (Wooldridge 2023; see REGISTRY.md §WooldridgeDiD nonlinear / ASF path), with QMLE sandwich SEs that are robust to distributional misspecification. The Poisson fitter hard-rejects negative outcomes (`y < 0` raises `ValueError` at line ~1105 of `wooldridge.py`), which is why `is_count_like` gates on `value_min >= 0`. 3. Decision: fit `WooldridgeDiD(method="poisson")` if you want a multiplicative effect summary (with the outcome-scale headline reported as a difference; a percent-change reading can be derived post-hoc as `overall_att / E[Y_0]`). Fit linear-OLS DiD if the additive ATT is the right summary and counterfactual predictions stay safely within the non-negative support — the cluster-robust SEs are calibrated either way. Document which functional form the headline reflects either way; the two estimands are on the same outcome scale but parameterize the treatment effect differently. 4. Caveat: when the outcome includes structural zeros that violate Poisson conditional moments (overdispersion), consider negative binomial QMLE or a hurdle model; the current suite does not provide these natively, but the linear DiD with cluster-robust SEs remains defensible at sufficient sample size. The shape field flags the consideration; the choice is yours. ## §6. Post-fit validation utilities After any `fit()`, the Baker et al. (2025) 8-step workflow recommends a diagnostic sequence. The library exposes utilities covering each step. Consult `get_llm_guide("practitioner")` for the workflow-prose form; this section is the API-reference index. ### Parallel-trends and pre-trends - `check_parallel_trends(df, ...)` - exported from `diff_diff`. Regression-based visual-plus-numeric test on pre-treatment periods. Returns a structured result with p-value and per-period coefficients. - `check_parallel_trends_robust(df, ...)` - Roth (2022) power-adjusted version; adds a "believable-magnitude" check against a power curve. - `equivalence_test_trends(df, ...)` - Bilinski-Hatfield-style equivalence test (alternative framing of the PT test). - `compute_pretrends_power(results, ...)` - standalone power analysis for the PT test; takes a fitted `MultiPeriodDiDResults` (or compatible event-study results object), not raw DataFrame. Useful when `min_pre_periods` is small. ### Sensitivity / robustness - `compute_honest_did(results, ...)` - Rambachan-Roth (2023) honest DiD. Quantifies the sensitivity of ATT to parallel-trends violations. Outputs sensitivity bounds under smoothness restrictions. - `compute_pretrends_power(results, ...)` - complementary tool for power-aware pre-trends interpretation (same fitted-results-first signature as above). ### Placebo tests - `run_placebo_test(df, ...)` - generic placebo runner. - `run_all_placebo_tests(df, ...)` - batch runner over predefined placebos. - `placebo_timing_test(df, ...)` - false placebo-treatment time. - `placebo_group_test(df, ...)` - placebo treatment-group assignment. - `permutation_test(df, ...)` - Fisher-style exact permutation. - `leave_one_out_test(df, ...)` - refit dropping one unit at a time. ### Estimator-native diagnostics Some estimators expose diagnostics as methods on the result object: - `SyntheticDiDResults.in_time_placebo()` - placebo treatment applied in a pre-treatment period. - `SyntheticDiDResults.sensitivity_to_zeta_omega()` - regularization- hyperparameter sensitivity. - `SyntheticDiDResults.get_weight_concentration()` - donor-weight concentration summary. - `CallawaySantAnna.diagnose_propensity(df, ...)` - propensity-score overlap check when using DR / IPW controls. - `EfficientDiD.hausman_pretest(df, ...)` - chooses between `PT-All` and `PT-Post` for `EfficientDiD`. - `did_had_pretest_workflow(df, ...)` - bundled QUG / Stute / Yatchew- Härdle pre-test battery for `HeterogeneousAdoptionDiD`. ### Decomposition and weight auditing - `bacon_decompose(df, ...)` - Goodman-Bacon (2021) TWFE weight decomposition. Returns a `BaconDecompositionResults` with the weight on forbidden (later-vs-earlier) comparisons. Run before interpreting any TWFE staggered fit. ### Event-study plotting - `plot_event_study(results, ...)` - `plot_group_effects(results, ...)` - `plot_group_time_heatmap(results, ...)` - `plot_staircase(results, ...)` - `plot_honest_event_study(honest_results, ...)` - takes a `HonestDiDResults` returned by `compute_honest_did`, not a fit result directly. - `plot_sensitivity(sensitivity_results, ...)` - takes a `SensitivityResults` object (the result of honest-DiD sensitivity analysis), not a fit result directly. - `plot_synth_weights(results, ...)` - `plot_dose_response(results, ...)` - `plot_power_curve(...)` Event-study plots are also a diagnostic - pre-treatment coefficients close to zero support parallel trends. ## §7. How to read BusinessReport / DiagnosticReport output `BusinessReport(results)` and `DiagnosticReport(results)` are experimental in the 3.2 line. Their schema is versioned (`BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION`, both `"2.0"` at time of writing) and expected to evolve. Treat `.to_dict()` output as the agent-legible contract; the prose renderers (`summary()`, `full_report()`) are derived from it. ### BusinessReport `to_dict()` schema (v2.0) Top-level keys emitted by `BusinessReport.to_dict()` (source: `diff_diff/business_report.py`): - `schema_version: str` - `BUSINESS_REPORT_SCHEMA_VERSION`, e.g. `"2.0"`. - `estimator: dict` - `class_name` (the fitted result class) and a human-friendly `display_name`. - `context: dict` - the `BusinessContext` bundle: `outcome_label`, `outcome_unit`, `outcome_direction`, `business_question`, `treatment_label`, `alpha`. - `headline: dict` - the main point estimate plus framing fields. - `target_parameter: dict` - what the headline scalar represents. Fields: `name` (e.g. `"ATT"`, `"DID_M"`, `"dose-response"`, `"WAS"`), `definition` (plain-English description), `aggregation` (machine tag), `headline_attribute` (raw result attribute), and `reference` (REGISTRY.md citation string). - `assumption: dict` - named assumptions relied on (parallel trends, no anticipation, SUTVA, ...). Note: singular `"assumption"`, not `"assumptions"`. - `pre_trends: dict` - pre-trends test result with verdict string (e.g. `"clean"`, `"inconclusive"`, `"violated"`), p-value, and power assessment if available. Note: underscore-split `"pre_trends"`. - `sensitivity: dict` - HonestDiD sensitivity summary when available. - `sample: dict` - sample size and coverage details. Note: bare `"sample"`, not `"sample_summary"`. - `heterogeneity: dict` - heterogeneity summary if applicable. - `robustness: dict` - placebo / robustness summaries if available. - `diagnostics: dict` - a wrapper around the auto-constructed `DiagnosticReport`. Always has a `status` field: `"skipped"` with a `reason` when `auto_diagnostics=False`, otherwise `"ran"` with the full DR `to_dict()` payload under `diagnostics["schema"]` and a mirrored `overall_interpretation` string. Parse `schema` (not `diagnostics` directly) to access the DR sections documented below. - `next_steps: list[dict]` - Baker et al. next-step guidance from `practitioner_next_steps`. - `caveats: list[str]` - free-text caveats generated from failed checks. - `references: list[dict]` - citations relevant to the estimator. ### DiagnosticReport `to_dict()` schema (v2.0) Top-level keys (source: `diff_diff/diagnostic_report.py`): - `schema_version: str` - `DIAGNOSTIC_REPORT_SCHEMA_VERSION`. - `estimator: str` - the fitted result class name. - `headline_metric: dict` - the main scalar the report headlines. - `target_parameter: dict` - same shape as the BR field above. - `parallel_trends: dict` - PT test result. - `pretrends_power: dict` - power-aware pre-trends assessment when applicable. - `sensitivity: dict` - HonestDiD sensitivity summary. - `placebo: dict` - placebo-test results. - `bacon: dict` - Goodman-Bacon decomposition when applicable. - `design_effect: dict` - survey / clustering design-effect summary. - `heterogeneity: dict` - group-time heterogeneity summary. - `epv: dict` - events-per-variable / sample-adequacy. - `estimator_native_diagnostics: dict` - estimator-specific diagnostics (e.g. SDiD weight concentration, TROP factor-model fit). - `skipped: dict` - checks skipped on this estimator type, with the reason. - `warnings: list[str]` - top-level aggregated warnings. - `overall_interpretation: str` - rendered prose summary of the sections. - `next_steps: list[dict]` - same shape as the BR field. Each section value is a dict. Parse it in two layers: 1. `status: str` — execution state, not qualitative interpretation. The values actually emitted by `DiagnosticReport.to_dict()` are: `"ran"` (section executed), `"not_applicable"` (check does not apply to this estimator or design), `"not_run"` (implementation pending), `"no_scalar_by_design"` (for estimators that return a table instead of a scalar headline, e.g. dCDH with `trends_linear=True, L_max>=2`), and `"skipped"` (auto-diagnostics disabled or the section was short-circuited at top level). 2. `verdict: str` (only present when `status == "ran"`) — qualitative interpretation of the executed check. Candidate values include `"clean"`, `"inconclusive"`, `"violated"`, and section-specific labels. `reason: str` is an optional free-text explanation that usually accompanies non-`"ran"` statuses; it may also appear on `"ran"` sections as supplementary context. The rest of each section dict is section-specific payload (e.g. p-values, coefficients, cohort tables). Forthcoming schema additions (not yet shipped): a top-level `sanity_checks` block (machine-legible pass/warn/fail summary) and a `mismatch_warnings` list (post-hoc estimator-mismatch detection) are queued for a later wave. Treat their current absence as expected. ## §8. Glossary + citations **ATT**: Average Treatment Effect on the Treated. The target parameter of most DiD estimators. **Parallel trends**: counterfactual trends in treated and control outcomes would have moved together absent treatment. Untestable directly; pre-treatment dynamics are a necessary (not sufficient) indicator. **No anticipation**: units do not respond to treatment before it occurs. If plausible, test via pre-treatment event-study coefficients. **SUTVA**: Stable Unit Treatment Value Assumption. Rules out spillovers and interference between units. **Forbidden comparison**: in TWFE, a comparison where already-treated units serve as controls for later-treated units. Weights are negative and the resulting estimate can flip sign vs. the true ATT. **Cohort / treatment timing**: first-treatment period for an absorbing-treatment unit. Units sharing a cohort share an adoption date. **Staggered adoption**: two or more cohorts present in the panel. **Doubly-robust (DR) / IPW / RA**: three covariate-adjustment strategies in `CallawaySantAnna`. DR is consistent if either the propensity model or the outcome model is correctly specified. ### Primary references - **Baker, Andrew, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant'Anna (2025).** "Difference-in- Differences Designs: A Practitioner's Guide." arXiv:2503.13323. The 8-step workflow and best-practice framing. Ships as `get_llm_guide("practitioner")`. - **Roth, Jonathan, Pedro H. C. Sant'Anna, Alyssa Bilinski, and John Poe (2023).** "What's Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature." Journal of Econometrics 235(2): 2218-2244. Canonical-assumption framing; classification of estimator relaxations. - **Goodman-Bacon, Andrew (2021).** "Difference-in-Differences with Variation in Treatment Timing." Journal of Econometrics 225(2): 254-277. TWFE weight decomposition; `bacon_decompose` implements this. - **Callaway, Brantly, and Pedro H. C. Sant'Anna (2021).** "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics 225(2): 200-230. Group-time ATT. - **Sun, Liyang, and Sarah Abraham (2021).** "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." Journal of Econometrics 225(2): 175-199. IW estimator. - **de Chaisemartin, Clément, and Xavier d'Haultfoeuille (2020).** "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects." American Economic Review 110(9): 2964-2996. DID_M estimator. - **Borusyak, Kirill, Xavier Jaravel, and Jann Spiess (2024).** "Revisiting Event-Study Designs: Robust and Efficient Estimation." Review of Economic Studies 91(6): 3253-3285. Imputation estimator. - **Gardner, John (2022).** "Two-Stage Differences in Differences." arXiv:2207.05943. Two-stage estimator. - **Wooldridge, Jeffrey M. (2021).** "Two-Way Fixed Effects, the Two- Way Mundlak Regression, and Difference-in-Differences Estimators." ETWFE formulation. - **Arkhangelsky, Dmitry, Susan Athey, David Hirshberg, Guido Imbens, and Stefan Wager (2021).** "Synthetic Difference-in-Differences." American Economic Review 111(12): 4088-4118. SDiD estimator. - **Rambachan, Ashesh, and Jonathan Roth (2023).** "A More Credible Approach to Parallel Trends." Review of Economic Studies 90(5): 2555-2591. HonestDiD sensitivity. - **Bilinski, Alyssa, and Laura A. Hatfield (2019).** "Nothing to See Here? Non-Inferiority Approaches to Parallel Trends and Other Model Assumptions." arXiv:1805.03273. Equivalence test. - **Sant'Anna, Pedro H. C., and Jun Zhao (2020).** "Doubly Robust Difference-in-Differences Estimators." Journal of Econometrics 219(1): 101-122. DR adjustment. - **Chen, Xiaohong, Pedro H. C. Sant'Anna, and Haitian Xie (2025).** "Efficient Difference-in-Differences and Event Study Estimators." Primary source for the `EfficientDiD` estimator (PT-All / PT-Post framing and efficient combination weights). - **Callaway, Brantly, Andrew Goodman-Bacon, and Pedro H. C. Sant'Anna (2024).** "Difference-in-Differences with a Continuous Treatment." Primary source for `ContinuousDiD`; introduces the Parallel Trends vs Strong Parallel Trends distinction underlying `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, and `ACRT^{glob}`. ### Online resources - **psantanna.com/did-resources** - practitioner checklist + reading list maintained by Pedro Sant'Anna. - **bcallaway11.github.io/did** - `did` R package tutorials (Callaway-Sant'Anna). ## §9. Intentional omissions This guide does **not**: - Recommend a specific estimator for a specific dataset. When multiple estimators fit, §4 lists them and names the trade-offs; the choice is the agent's. - Enumerate every possible design edge case. The literature cited in §8 covers them; this guide is a navigation aid, not a substitute. - Promise forward-compatibility of the BR / DR schema or the alert catalogue. Treat these as experimental until the 12-item foundation- gap list closes. - Replace `bacon_decompose()`, `compute_honest_did()`, or any of the estimator-native diagnostics. Post-fit validation is mandatory, not optional, and belongs in the final write-up. - Cover methods outside diff-diff's estimator suite (e.g., instrumental variables, regression discontinuity, synthetic control for a single treated unit). When those apply, point the user at dedicated libraries. **If in doubt, consult the primary references in §8 and use `get_llm_guide("practitioner")` for the Baker et al. workflow.**