Bacon Decomposition (Goodman-Bacon 2021)#

Diagnostic decomposition of Two-Way Fixed Effects (TWFE) estimators for staggered treatment designs.

This module implements the Goodman-Bacon (2021) decomposition, which reveals that a TWFE estimate with variation in treatment timing is a weighted average of all possible 2x2 Difference-in-Differences comparisons. The decomposition exposes the implicit comparisons that drive the TWFE estimate – including potentially problematic “forbidden comparisons” where already-treated units serve as controls – and quantifies their relative importance.

When to use BaconDecomposition:

  • You have a staggered adoption design and want to diagnose whether the TWFE estimate is driven by clean or problematic comparisons

  • You need to assess the severity of heterogeneous treatment effect bias in existing TWFE results

  • You want to understand why TWFE and robust estimators (e.g., Callaway-Sant’Anna) produce different estimates

  • You are deciding whether a simple TWFE specification is adequate or whether a robust staggered estimator is needed

Reference: Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.

BaconDecomposition#

Main estimator class for the Goodman-Bacon decomposition.

class diff_diff.BaconDecomposition[source]

Bases: object

Goodman-Bacon (2021) decomposition of Two-Way Fixed Effects estimator.

This class decomposes a TWFE estimate into a weighted average of all possible 2x2 DiD comparisons, revealing the implicit comparisons that drive the TWFE estimate and their relative importance.

The decomposition identifies three types of comparisons:

  1. Treated vs Never-treated: Uses never-treated units as controls. These are “clean” comparisons without bias concerns.

  2. Earlier vs Later treated: Units treated earlier are compared to units treated later, using the later group as controls before they are treated. These are valid comparisons.

  3. Later vs Earlier treated: Units treated later are compared to units treated earlier, using the earlier group as controls AFTER they are already treated. These are “forbidden comparisons” that can introduce bias when treatment effects vary over time.

Parameters:

weights (str, default="exact") –

Weight calculation method:

  • ”exact” (default): Variance-based weights from Goodman-Bacon (2021) Theorem 1, Eqs. 7-9 and 10e-g. Produces the paper-faithful decomposition where the weighted sum matches the TWFE estimate to machine precision. Use for publication-quality work and the standard methodology contract.

  • ”approximate”: Fast simplified formula using group shares and treatment variance, with post-hoc sum-to-1 normalization. Opt in for speed-sensitive diagnostic loops where the relative weight structure is sufficient. Approximate-mode results may differ numerically from R bacondecomp::bacon().

weights

The weight calculation method.

Type:

str

results_

Decomposition results after calling fit().

Type:

BaconDecompositionResults

is_fitted_

Whether the model has been fitted.

Type:

bool

Examples

Basic usage:

>>> import pandas as pd
>>> from diff_diff import BaconDecomposition
>>>
>>> # Panel data with staggered treatment
>>> data = pd.DataFrame({
...     'unit': [...],
...     'time': [...],
...     'outcome': [...],
...     'first_treat': [...]  # 0 for never-treated
... })
>>>
>>> bacon = BaconDecomposition()
>>> results = bacon.fit(data, outcome='outcome', unit='unit',
...                     time='time', first_treat='first_treat')
>>> results.print_summary()

Visualizing the decomposition:

>>> from diff_diff import plot_bacon
>>> plot_bacon(results)

Notes

The key insight from Goodman-Bacon (2021) is that TWFE with staggered treatment timing implicitly makes comparisons using already-treated units as controls. When treatment effects are dynamic (changing over time since treatment), these “forbidden comparisons” can bias the TWFE estimate, potentially even reversing its sign.

The decomposition helps diagnose this issue by showing: - How much weight is on each type of comparison - Whether forbidden comparisons contribute significantly to the estimate - How the 2x2 estimates vary across comparison types

If forbidden comparisons have substantial weight and different estimates than clean comparisons, consider using robust estimators like Callaway-Sant’Anna that avoid these problematic comparisons.

References

Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.

See also

CallawaySantAnna

Robust estimator for staggered DiD

TwoWayFixedEffects

The TWFE estimator being decomposed

Methods

fit(data, outcome, unit, time, first_treat)

Perform the Goodman-Bacon decomposition.

get_params()

Get estimator parameters (sklearn-compatible).

set_params(**params)

Set estimator parameters (sklearn-compatible).

__init__(weights='exact')[source]

Initialize BaconDecomposition.

Parameters:

weights (str, default="exact") –

Weight calculation method:

  • ”exact” (default): Variance-based weights from Goodman-Bacon (2021) Theorem 1 (paper-faithful Eqs. 7-9 and 10e-g).

  • ”approximate”: Fast simplified formula. Opt in for speed.

results_: BaconDecompositionResults | None
is_fitted_: bool
fit(data, outcome, unit, time, first_treat, survey_design=None)[source]

Perform the Goodman-Bacon decomposition.

Parameters:
  • data (pd.DataFrame) – Panel data with unit and time identifiers.

  • outcome (str) – Name of outcome variable column.

  • unit (str) – Name of unit identifier column.

  • time (str) – Name of time period column.

  • first_treat (str) – Name of column indicating when unit was first treated. The values 0 and np.inf are reserved as never-treated sentinels (not configurable today); a real treatment cohort with first_treat == 0 would be folded into U and should instead be re-labeled to a non-sentinel value before fitting. Units whose first_treat is at or before the first observable period (first_treat <= min(time), excluding the never-treated sentinels 0 and np.inf) are automatically remapped to the U (untreated) bucket per Goodman-Bacon (2021) footnote 11, with a UserWarning. Library boundary extension: the paper uses the strict inequality t_i < 1 (units treated before the first observable period); the library uses the inclusive first_treat <= min(time) rule, additionally folding units treated at the first observable period (first_treat == min(time)) into U because such units have no untreated cell in-panel. See REGISTRY’s **Deviation (first-period boundary extension on always-treated remap)** block for the full contract. Detection uses ordered-time logic on the time axis so panels whose time column contains negative or zero-crossing labels (e.g. event-time time [-2,..,3]) are handled correctly; the 0 sentinel restriction applies only to first_treat, not to time. The user’s original first_treat column on data is preserved unchanged; remapping happens in an internal column. The count of remapped units is exposed on the result as BaconDecompositionResults.n_always_treated_remapped.

  • survey_design (SurveyDesign, optional) – Survey design specification for weighted estimation. When provided, all means and group shares use survey weights. The decomposition remains diagnostic (no survey vcov needed).

Returns:

Object containing decomposition results.

Return type:

BaconDecompositionResults

Raises:

ValueError – If required columns are missing or data validation fails.

get_params()[source]

Get estimator parameters (sklearn-compatible).

Return type:

Dict[str, Any]

set_params(**params)[source]

Set estimator parameters (sklearn-compatible).

Return type:

BaconDecomposition

summary()[source]

Get summary of decomposition results.

Return type:

str

print_summary()[source]

Print summary to stdout.

Return type:

None

BaconDecompositionResults#

Results container for the Bacon decomposition.

class diff_diff.bacon.BaconDecompositionResults[source]

Bases: object

Results from Goodman-Bacon decomposition of TWFE.

This decomposition shows that the TWFE estimate equals a weighted average of all possible 2x2 DiD comparisons between timing groups.

twfe_estimate

The overall TWFE coefficient (should equal weighted sum of 2x2 estimates).

Type:

float

comparisons

List of all 2x2 comparisons with their estimates and weights.

Type:

List[Comparison2x2]

total_weight_treated_vs_never

Total weight on treated vs never-treated comparisons.

Type:

float

total_weight_earlier_vs_later

Total weight on earlier vs later treated comparisons.

Type:

float

total_weight_later_vs_earlier

Total weight on later vs earlier treated comparisons (forbidden).

Type:

float

weighted_avg_treated_vs_never

Weighted average effect from treated vs never-treated comparisons.

Type:

float

weighted_avg_earlier_vs_later

Weighted average effect from earlier vs later comparisons.

Type:

float

weighted_avg_later_vs_earlier

Weighted average effect from later vs earlier comparisons.

Type:

float

n_timing_groups

Number of distinct treatment timing groups.

Type:

int

n_never_treated

Number of never-treated units.

Type:

int

n_always_treated_remapped

Number of units whose first_treat was at or before the first observable period (first_treat <= min(time), excluding the never-treated sentinels 0 and np.inf) and which were automatically remapped to the U (untreated) bucket per Goodman-Bacon (2021) footnote 11. Detection uses ordered-time logic so negative or zero-crossing period labels work correctly. Zero on inputs where the user only used the first_treat {0, np.inf} sentinels. The user’s original first_treat column is preserved unchanged on the input data frame; remapping happens in an internal column.

Type:

int

timing_groups

List of treatment timing cohorts.

Type:

List[Any]

Methods

summary()

Generate a formatted summary of the decomposition.

print_summary()

Print the summary to stdout.

to_dataframe()

Convert comparisons to a DataFrame.

weight_by_type()

Get total weight by comparison type.

effect_by_type()

Get weighted average effect by comparison type.

twfe_estimate: float
comparisons: List[Comparison2x2]
total_weight_treated_vs_never: float
total_weight_earlier_vs_later: float
total_weight_later_vs_earlier: float
weighted_avg_treated_vs_never: float | None
weighted_avg_earlier_vs_later: float | None
weighted_avg_later_vs_earlier: float | None
n_timing_groups: int
n_never_treated: int
timing_groups: List[Any]
n_obs: int = 0
decomposition_error: float = 0.0
n_always_treated_remapped: int = 0
survey_metadata: Any | None = None
summary()[source]

Generate a formatted summary of the decomposition.

Returns:

Formatted summary table.

Return type:

str

print_summary()[source]

Print the summary to stdout.

Return type:

None

to_dataframe()[source]

Convert comparisons to a DataFrame.

Returns:

DataFrame with one row per 2x2 comparison.

Return type:

pd.DataFrame

weight_by_type()[source]

Get total weight by comparison type.

Returns:

Dictionary mapping comparison type to total weight.

Return type:

Dict[str, float]

effect_by_type()[source]

Get weighted average effect by comparison type.

Returns:

Dictionary mapping comparison type to weighted average effect.

Return type:

Dict[str, Optional[float]]

__init__(twfe_estimate, comparisons, total_weight_treated_vs_never, total_weight_earlier_vs_later, total_weight_later_vs_earlier, weighted_avg_treated_vs_never, weighted_avg_earlier_vs_later, weighted_avg_later_vs_earlier, n_timing_groups, n_never_treated, timing_groups, n_obs=0, decomposition_error=0.0, n_always_treated_remapped=0, survey_metadata=None)
Parameters:
  • twfe_estimate (float)

  • comparisons (List[Comparison2x2])

  • total_weight_treated_vs_never (float)

  • total_weight_earlier_vs_later (float)

  • total_weight_later_vs_earlier (float)

  • weighted_avg_treated_vs_never (float | None)

  • weighted_avg_earlier_vs_later (float | None)

  • weighted_avg_later_vs_earlier (float | None)

  • n_timing_groups (int)

  • n_never_treated (int)

  • timing_groups (List[Any])

  • n_obs (int)

  • decomposition_error (float)

  • n_always_treated_remapped (int)

  • survey_metadata (Any | None)

Return type:

None

Comparison2x2#

Container for an individual 2x2 DiD comparison within the decomposition.

class diff_diff.bacon.Comparison2x2[source]

Bases: object

A single 2x2 DiD comparison in the Bacon decomposition.

treated_group

The timing group used as “treated” in this comparison.

Type:

Any

control_group

The timing group used as “control” in this comparison. For comparison_type="treated_vs_never", this is the literal string "never_treated", which refers to the post-remap U bucket (the paper’s U per Goodman-Bacon 2021 footnote 11). On inputs with no remapped always-treated units this is exactly the true never-treated set; with remapping it is the broader U bucket. Check BaconDecompositionResults.n_never_treated and n_always_treated_remapped for the precise composition.

Type:

Any

comparison_type

Type of comparison: “treated_vs_never”, “earlier_vs_later”, or “later_vs_earlier”.

Type:

str

estimate

The 2x2 DiD estimate for this comparison.

Type:

float

weight

The weight assigned to this comparison in the TWFE average.

Type:

float

n_treated

Number of treated observations in this comparison.

Type:

int

n_control

Number of control observations in this comparison.

Type:

int

time_window

The (start, end) time period for this comparison.

Type:

Tuple[float, float]

treated_group: Any
control_group: Any
comparison_type: str
estimate: float
weight: float
n_treated: int
n_control: int
time_window: Tuple[float, float]
__init__(treated_group, control_group, comparison_type, estimate, weight, n_treated, n_control, time_window)
Parameters:
Return type:

None

Convenience Function#

diff_diff.bacon_decompose(data, outcome, unit, time, first_treat, weights='exact', survey_design=None)[source]#

Convenience function for Goodman-Bacon decomposition.

Decomposes a TWFE estimate into weighted 2x2 DiD comparisons, showing which comparisons drive the estimate and whether problematic “forbidden comparisons” are involved.

Parameters:
  • data (pd.DataFrame) – Panel data with unit and time identifiers.

  • outcome (str) – Name of outcome variable column.

  • unit (str) – Name of unit identifier column.

  • time (str) – Name of time period column.

  • first_treat (str) – Name of column indicating when unit was first treated. The values 0 and np.inf are reserved as never-treated sentinels; a real treatment cohort with first_treat == 0 would be folded into U and should be re-labeled to a non-sentinel value before fitting. Units whose first_treat is at or before the first observable period (first_treat <= min(time), excluding the sentinels) are automatically remapped to the U (untreated) bucket per Goodman-Bacon (2021) footnote 11, with a UserWarning. See BaconDecomposition.fit() for the full contract and BaconDecompositionResults.n_always_treated_remapped for the count. The user’s original first_treat column is preserved unchanged.

  • weights (str, default="exact") –

    Weight calculation method:

    • ”exact” (default): Variance-based weights from Goodman-Bacon (2021) Theorem 1, Eqs. 7-9 and 10e-g. Paper-faithful.

    • ”approximate”: Fast simplified formula. Opt in for speed-sensitive diagnostic loops; numerical output may differ from R bacondecomp::bacon().

  • survey_design (SurveyDesign, optional) –

    Survey design specification for weighted estimation. When provided, cell means, group shares, and within-transform use survey weights. The decomposition remains diagnostic (no survey vcov needed).

    Default-flip caveat (PR-B, 2026-05-16): the new weights="exact" default routes through _validate_unit_constant_survey, which rejects survey designs whose weights / strata / PSU / FPC columns vary within a unit across periods (the exact path collapses to per-unit aggregation via groupby().first()). Users whose survey design has time-varying within-unit columns must either (a) collapse the columns to be unit-constant or (b) pass explicit weights="approximate" to retain the legacy observation-level weighted-means path.

Returns:

Object containing decomposition results with:

  • twfe_estimate: The overall TWFE coefficient

  • comparisons: List of all 2x2 comparisons with estimates and weights

  • Weight totals by comparison type

  • Methods for visualization and export

Return type:

BaconDecompositionResults

Examples

>>> from diff_diff import bacon_decompose
>>>
>>> # Default: paper-faithful Goodman-Bacon (2021) Theorem 1 weights
>>> # (weights="exact"); matches R bacondecomp::bacon() at atol=1e-6 on
>>> # the aggregate (TWFE coefficient + weights-sum) across all panels,
>>> # and on the per-component breakdown when there are no
>>> # always-treated / first-period-treated cohorts (i.e. all
>>> # non-sentinel first_treat values are strictly greater than
>>> # min(time)). For panels with always-treated units, the
>>> # per-component breakdown diverges by convention (Python remaps
>>> # to U per paper footnote 11; R emits `Later vs Always Treated`);
>>> # see REGISTRY note on R parity convention divergence. Validated
>>> # via tests/test_methodology_bacon.py::TestBaconParityR.
>>> results = bacon_decompose(
...     data=panel_df,
...     outcome='earnings',
...     unit='state',
...     time='year',
...     first_treat='treatment_year'
... )
>>>
>>> # Opt-in: simplified-variance fast path for diagnostic loops
>>> # (numerical output may differ from R; sum-to-1 still holds).
>>> results_approx = bacon_decompose(
...     data=panel_df,
...     outcome='earnings',
...     unit='state',
...     time='year',
...     first_treat='treatment_year',
...     weights='approximate'
... )
>>>
>>> # View summary
>>> results.print_summary()
>>>
>>> # Check weight on forbidden comparisons
>>> print(f"Forbidden weight: {results.total_weight_later_vs_earlier:.1%}")
>>>
>>> # Convert to DataFrame for analysis
>>> df = results.to_dataframe()

See also

BaconDecomposition

Class-based interface with more options

plot_bacon

Visualize the decomposition

CallawaySantAnna

Robust estimator that avoids forbidden comparisons

Example Usage#

Basic usage:

from diff_diff import BaconDecomposition, generate_staggered_data

data = generate_staggered_data(n_units=200, n_periods=12,
                                cohort_periods=[4, 6, 8], seed=42)

bacon = BaconDecomposition()
results = bacon.fit(data, outcome='outcome', unit='unit',
                    time='period', first_treat='first_treat')
results.print_summary()

Visualizing with plot_bacon:

from diff_diff import plot_bacon

# Scatter plot of 2x2 estimates vs weights, colored by comparison type
ax = plot_bacon(results)
ax.figure.show()

Interpreting the decomposition:

# Convert to DataFrame for detailed inspection
df = results.to_dataframe()
print(df[['treated_group', 'control_group', 'comparison_type',
          'estimate', 'weight']])

# Check weight breakdown by comparison type
weights = results.weight_by_type()
print(f"Treated vs Never-treated: {weights['treated_vs_never']:.1%}")
print(f"Earlier vs Later:         {weights['earlier_vs_later']:.1%}")
print(f"Later vs Earlier:         {weights['later_vs_earlier']:.1%}")

# Compare weighted average effects across comparison types
effects = results.effect_by_type()
for comp_type, effect in effects.items():
    if effect is not None:
        print(f"  {comp_type}: {effect:.4f}")

Using exact weights for publication-quality results:

bacon = BaconDecomposition(weights='exact')
results = bacon.fit(data, outcome='outcome', unit='unit',
                    time='period', first_treat='first_treat')

# Verify the weighted sum closely matches the TWFE estimate
print(f"TWFE estimate:       {results.twfe_estimate:.4f}")
print(f"Decomposition error: {results.decomposition_error:.6f}")

When Is TWFE Reliable?#

The Bacon decomposition helps answer whether a standard TWFE regression is adequate for a particular dataset. As a rule of thumb:

  • TWFE is likely reliable when the weight on “later vs earlier” (forbidden) comparisons is small, or when 2x2 estimates are similar across all comparison types. This suggests treatment effect heterogeneity is not meaningfully biasing the TWFE estimate.

  • TWFE may be unreliable when forbidden comparisons carry substantial weight and their estimates differ markedly from the clean comparisons. In this case, consider using a robust staggered estimator such as CallawaySantAnna, SunAbraham, or StackedDiD.