Interactive notebook

This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.

Spillover-aware DiD with SpilloverDiD — a TVA-style worked example#

Use this notebook when: your treatment plausibly affects nearby untreated units through geographic externalities, and your standard DiD specification treats those neighbors as part of the control group.

Place-based interventions (industrial policy, public-works investment, vaccination campaigns, retail rollouts, technology subsidies) routinely fit this pattern. The canonical reference is Kline & Moretti (2014)’s study of the Tennessee Valley Authority (TVA): treated counties received federal investment, but adjacent untreated counties absorbed positive spillovers via labor flows, supply chains, and population movement. Naive DiD that lumps adjacent counties into the control group then underestimates the direct effect on treated counties, because the controls are contaminated by spillover.

Butts (2021) formalizes this with a single-stage regression that explicitly models per-distance-band spillover effects on near-controls while keeping far-controls as the clean comparison group:

\[y_{it} = \mu_i + \lambda_t + \tau\,D_{it} + \sum_j \delta_j\,(1 - D_{it})\,\text{Ring}_{it,j} + \varepsilon_{it}\]

The diff-diff library’s SpilloverDiD estimator (Butts 2021 + Gardner 2022 two-stage) implements this on panels with one or more spatial rings, recovers the direct effect \(\tau\) and the spillover coefficients \(\delta_j\), and supports Conley (1999) spatial-HAC standard errors out of the box.

This tutorial walks through the practitioner workflow:

  1. Build a synthetic TVA-style panel with known \(\tau\) and \(\delta_1\).

  2. Fit a naive multi-period TWFE and observe the bias.

  3. Choose a spillover bandwidth via theory + a sensitivity grid.

  4. Fit SpilloverDiD; verify recovery of \(\tau\) and \(\delta_1\).

  5. Add Conley spatial-HAC variance for robust inference.

References: Butts (2021) — arXiv:2105.03737; Kline & Moretti

  1. QJE 129(1); Conley (1999) — Journal of Econometrics 92(1).

2. The synthetic panel#

We build a 4-period, 200-unit panel laid out as three geographic bands:

Band

Count

Location

Treatment

Treated

25

clustered around (0,0); max ~12 km from origin, cluster diameter ~22 km at seed 23

Yes, from period 3

Near-control

120

~12-82 km north (0.1°-0.7° latitude)

No (but absorbs spillover)

Far-control

55

~224-331 km north (2.0°-3.0° latitude)

No (clean control)

True parameters: \(\tau_{\text{total}} = -7.4\) (the direct effect on treated, matching the Butts §4 agriculture-employment magnitude), \(\delta_1 = -4.5\) (the per-period spillover on near-controls), and a locked random seed of 23. The data-generating equation is

\[y_{it} = \alpha_i + \lambda_t + \tau_{\text{total}}\,D_{it} + \delta_1\,(1 - D_{it})\,\text{Ring}_{it,1} + \mathcal{N}(0, 0.5)\]

We tune the band sizes and \(\delta_1\) so the naive TWFE coefficient lands at ~58% of \(\tau_{\text{total}}\) — i.e., the ~40% understatement direction documented in Butts (2021) §4 Table 1 Panel A.

[ ]:
import warnings

import numpy as np
import pandas as pd

from diff_diff import MultiPeriodDiD, SpilloverDiD

MAIN_SEED = 23
N_TREATED = 25
N_NEAR = 120
N_FAR = 55
T_PERIODS = 4
FIRST_TREAT = 3
TAU_TOTAL = -7.4
DELTA_1 = -4.5
D_BAR_KM = 100.0
NOISE_SD = 0.5


def build_t23_panel(seed: int = MAIN_SEED) -> pd.DataFrame:
    """4-period TVA-style synthetic panel with known direct + spillover.

    Layout: treated cluster + near-control band (within D_BAR_KM) + far-control
    band (clean). Outcomes: y_it = alpha_i + lambda_t + TAU_TOTAL * D_it +
    DELTA_1 * Ring1_it * (1 - D_it) + N(0, NOISE_SD).
    """
    rng = np.random.default_rng(seed)
    n_units = N_TREATED + N_NEAR + N_FAR
    units = [f"u{i:04d}" for i in range(n_units)]
    alpha = rng.normal(0.0, 1.0, size=n_units)
    lambda_t = np.array([0.0, 0.5, 1.0, 1.5])[:T_PERIODS]

    coords = np.empty((n_units, 2))
    is_treated_unit = np.zeros(n_units, dtype=bool)
    is_near_unit = np.zeros(n_units, dtype=bool)
    for i in range(N_TREATED):
        coords[i] = (rng.normal(0, 0.05), rng.normal(0, 0.05))
        is_treated_unit[i] = True
    for i in range(N_TREATED, N_TREATED + N_NEAR):
        coords[i] = (rng.uniform(0.1, 0.7), rng.uniform(-0.3, 0.3))
        is_near_unit[i] = True
    for i in range(N_TREATED + N_NEAR, n_units):
        coords[i] = (rng.uniform(2.0, 3.0), rng.uniform(-0.5, 0.5))

    rows = []
    for i, u in enumerate(units):
        for t in range(1, T_PERIODS + 1):
            D_it = int(is_treated_unit[i] and t >= FIRST_TREAT)
            Ring1_it = int(is_near_unit[i] and t >= FIRST_TREAT)
            y = (
                alpha[i]
                + lambda_t[t - 1]
                + TAU_TOTAL * D_it
                + DELTA_1 * Ring1_it * (1 - D_it)
                + rng.normal(0, NOISE_SD)
            )
            rows.append(
                {
                    "unit": u,
                    "time": t,
                    "lat": coords[i, 0],
                    "lon": coords[i, 1],
                    "ever_treated": int(is_treated_unit[i]),
                    "D": D_it,
                    "y": y,
                }
            )
    return pd.DataFrame(rows)


df = build_t23_panel()
print(f"Panel: {len(df)} rows, {df['unit'].nunique()} units, T={df['time'].nunique()}")
df.head()

[ ]:
# Quick geographic check: treated cluster, near-control band, far-control band
band_counts = (
    df.drop_duplicates("unit")
    .assign(
        band=lambda d: np.select(
            [d["ever_treated"] == 1, d["lat"] <= 1.0],
            ["treated", "near"],
            default="far",
        )
    )
    .groupby("band")
    .size()
    .reindex(["treated", "near", "far"])
)
print(band_counts)

3. The naive headline — multi-period TWFE on the full sample#

A practitioner reaching for a default DiD specification fits a multi-period TWFE on the full panel, treating ALL non-treated-cluster units as controls — both the near-controls (which absorb spillover) and the far-controls (which don’t).

[ ]:
naive = MultiPeriodDiD()
with warnings.catch_warnings():
    # absorb=['unit'] makes the unit-invariant 'ever_treated' indicator
    # perfectly collinear with unit FE; MultiPeriodDiD drops it (with a
    # UserWarning) and identifies the ATT through the ever_treated × post
    # interaction columns. This is the expected TWFE specification.
    warnings.filterwarnings("ignore", category=UserWarning, message="Rank-deficient design matrix")
    naive_res = naive.fit(
        df,
        outcome="y",
        treatment="ever_treated",
        time="time",
        post_periods=[3, 4],
        unit="unit",
        absorb=["unit"],
        reference_period=2,  # explicit pre-period; matches the current MPD default
    )

print(f"Naive TWFE ATT: {naive_res.att:.4f}  (true tau_total = {TAU_TOTAL})")
print(f"  SE:           {naive_res.se:.4f}")
print(f"  Ratio to true: {naive_res.att / TAU_TOTAL:.3f}")
print(f"  Understatement: {(1 - naive_res.att / TAU_TOTAL) * 100:.1f}%")

The naive estimate is roughly -4.29, about 58% of the true \(\tau_{\text{total}} = -7.4\) — a ~42% understatement that matches the Butts (2021) §4 agriculture direction. The mechanism is straightforward: with \(D_{i,t} = 1\) only for the treated cluster, the near-control band contributes negative outcomes in the post-period (via \(\delta_1 = -4.5\)) into the control arm. The TWFE within-comparison then sees a smaller pre/post gap on the treatment side relative to the contaminated control side, and reports a deflated direct effect.

This is exactly the bias SpilloverDiD is designed to remove.

4. Choosing the spillover bandwidth#

SpilloverDiD is parameterized by rings=[r_0, r_1, ..., r_K] where the \(j\)-th ring covers distance band \([r_{j-1}, r_j]\) km. The outermost ring edge max(rings) defines the spillover horizon (d_bar); units outside that horizon are treated as the clean control group. For a single near-control band, the simplest spec is rings=[0.0, d_bar] — one ring, one \(\delta_1\) coefficient.

Choosing d_bar is a design decision driven by (i) theory about the plausible spillover range and (ii) a sensitivity check that you have not under- or over-stated it. Below we vary the outer edge across 50, 100, 150, 200 km and report the recovered \(\tau\) and \(\delta_1\) at each.

API note: SpilloverDiD accepts an optional explicit d_bar kwarg, but if set it must equal max(rings). We omit it and let the default (d_bar = max(rings)) apply.

[ ]:
sensitivity = []
for outer in (50.0, 100.0, 150.0, 200.0):
    est = SpilloverDiD(rings=[0.0, outer], conley_coords=("lat", "lon"))
    with warnings.catch_warnings():
        # Narrow filter for the Apple Silicon M4 + numpy<2.3 Accelerate
        # BLAS RuntimeWarnings ("divide by zero" / "overflow" / "invalid"
        # in matmul) documented at TODO.md "RuntimeWarnings in Linear
        # Algebra Operations". These do NOT fire on M3 / Intel / Linux
        # or numpy>=2.3, so this filter is a no-op there. Values are
        # recovered correctly on every platform.
        warnings.filterwarnings(
            "ignore", category=RuntimeWarning, message=".*encountered in matmul"
        )
        res = est.fit(df, outcome="y", unit="unit", time="time", treatment="D")
    delta1 = float(res.spillover_effects.iloc[0]["coef"])
    sensitivity.append({"d_bar_km": outer, "tau_total": res.att, "delta_1": delta1})

sens_df = pd.DataFrame(sensitivity)
print(sens_df.to_string(index=False, float_format=lambda x: f"{x:>8.4f}"))

At d_bar = 50 km the ring is too narrow: near-controls in the 50-78 km band ARE exposed in the DGP (they’re within the true 100 km spillover horizon and carry \(\delta_1 = -4.5\)), but the ring spec misclassifies them as far-away clean controls. Both estimates suffer from the misspecification — \(\tau\) deflates to ~-5.4 because the “clean control” arm now contains genuinely-affected units, and the spillover coefficient \(\delta_1\) attenuates to ~-2.6 because the \(S = 1\) ring averages 50-78 km exposure into the cleaner \(S = 0\) comparison. This is the registry-documented failure mode for undershooting d_bar. (The exact values are quoted to one decimal here rather than two because the d_bar = 50 design is borderline- rank-deficient — the two-decimal value can shift slightly across BLAS paths even at the same locked seed.)

At d_bar = 100 km the ring covers the entire DGP near-control band (0.1°-0.7° latitude ≈ 11-78 km). \(\tau\) recovers to -7.34 and \(\delta_1\) to -4.53 — both within 1% of the truth.

At d_bar = 150 km and d_bar = 200 km, the estimates are identical to d_bar = 100. This is by DGP design: there are no units in the 80-200 km band, so widening the ring adds zero observations and zero new information. Once d_bar covers the true spillover horizon, further widening is benign on this panel.

Verdict: d_bar = 100 km is the right choice here, balancing “capture all spillover” against “don’t sweep clean controls into the spillover bin”. On real data, a similar sensitivity grid plus paper / theory guidance is the practitioner workflow Butts (2021) §3.1 recommends.

5. Fit SpilloverDiD and interpret#

With d_bar = 100, the headline fit recovers both the direct effect and the spillover. We compare side-by-side with the naive TWFE from §3.

[ ]:
est = SpilloverDiD(rings=[0.0, D_BAR_KM], conley_coords=("lat", "lon"))
with warnings.catch_warnings():
    # See §4 for the rationale on this narrow matmul-RuntimeWarning filter.
    warnings.filterwarnings(
        "ignore", category=RuntimeWarning, message=".*encountered in matmul"
    )
    spill = est.fit(df, outcome="y", unit="unit", time="time", treatment="D")

delta_1_row = spill.spillover_effects.iloc[0]

comparison = pd.DataFrame(
    {
        "estimate": [naive_res.att, spill.att, delta_1_row["coef"]],
        "se": [naive_res.se, spill.se, delta_1_row["se"]],
        "true value": [TAU_TOTAL, TAU_TOTAL, DELTA_1],
    },
    index=["naive TWFE (direct, biased)", "SpilloverDiD tau_total", "SpilloverDiD delta_1"],
)
print(comparison.to_string(float_format=lambda x: f"{x:>8.4f}"))

With the spillover term in the regression, SpilloverDiD cleanly separates the direct effect (-7.34, very close to true -7.4) from the spillover on near-controls (-4.53, very close to true -4.5). The spillover-aware estimator removes the bias the naive TWFE suffered.

The interpretation is straightforward: every period in the post window, treated units lose 7.4 units of \(y\) on average from the policy, AND every near-control unit (within 100 km of any treated unit) loses 4.5 units of \(y\) from the spillover. A standard DiD that ignored the spillover would report only the contaminated -4.3 and miss the spillover-on-controls entirely.

6. Robust inference with Conley spatial-HAC#

The HC1 standard error in §5 treats each observation’s score as independent of every other. On spatially-correlated panels — and this is the canonical case for place-based interventions — that independence assumption is violated. Conley (1999) provides a spatial-HAC variance estimator that downweights pairs by their geographic distance through a Bartlett kernel out to a conley_cutoff_km bandwidth. Newey-West (1987) provides the analogous serial term via conley_lag_cutoff.

Per Butts (2021) §3.1, the Conley bandwidth is typically chosen to match the spillover horizon d_bar. Below we compare the §5 HC1 SE against Conley with conley_cutoff_km = d_bar = 100, both with conley_lag_cutoff = 0 (spatial only) and conley_lag_cutoff = 1 (add within-unit serial correlation across consecutive periods).

[ ]:
def fit_with(vcov_type, lag=None):
    kwargs = dict(rings=[0.0, D_BAR_KM], conley_coords=("lat", "lon"), vcov_type=vcov_type)
    if vcov_type == "conley":
        kwargs["conley_cutoff_km"] = D_BAR_KM
        kwargs["conley_lag_cutoff"] = lag
    est = SpilloverDiD(**kwargs)
    with warnings.catch_warnings():
        # See §4 for the rationale on this narrow matmul-RuntimeWarning filter.
        warnings.filterwarnings(
            "ignore", category=RuntimeWarning, message=".*encountered in matmul"
        )
        return est.fit(df, outcome="y", unit="unit", time="time", treatment="D")


res_hc1 = fit_with("hc1")
res_c0 = fit_with("conley", lag=0)
res_c1 = fit_with("conley", lag=1)

se_table = pd.DataFrame(
    {
        "tau_total": [res_hc1.att, res_c0.att, res_c1.att],
        "SE": [res_hc1.se, res_c0.se, res_c1.se],
    },
    index=["HC1", "Conley (lag=0, spatial only)", "Conley (lag=1, spatial + serial)"],
)
print(se_table.to_string(float_format=lambda x: f"{x:>8.4f}"))

Point estimates are identical across all three rows — the variance choice doesn’t move the coefficients. But the standard errors differ: on this DGP, the Conley spatial-HAC SE comes in lower than HC1.

The mechanism is the pairwise covariance structure of the influence-function score matrix. With conley_cutoff_km = 100, the spatial term in the no-survey Conley meat sums Bartlett-weighted score cross-products over every within-period observation pair whose units lie within 100 km of each other (support is pairwise, not relative to the treated cluster — no survey-design PSU aggregation here, the no-survey path operates on the observation-level Wave D Gardner GMM-corrected influence rows \(\psi_i\)). On this DGP that means non-zero contribution from:

  • treated × treated pairs (cluster diameter ~22 km at seed 23),

  • treated × near-control pairs (max pairwise separation ~90 km),

  • near × near pairs (100% of within-band pairs are within 100 km), and

  • far × far pairs within the far-control band (max within-band pairwise distance is ~131 km, but ~95% of within-band pair distances are within 100 km of each other — median far/far pairwise distance is ~56 km).

Cross-band pairs at long distance — treated × far and near × far — sit outside the 100 km support and contribute zero kernel weight. The Conley meat is the sum of the observation-level diagonal self-product terms \(\psi_i \psi_i'\) PLUS the Bartlett-weighted off-diagonal score cross-products \(K(d_{ij}/h)\,\psi_i \psi_j'\) from all the within-support pair classes enumerated above; the ATT variance is then the corresponding sandwich entry. (Note: the no-survey Conley path does NOT apply an HC1 n/(n-p) finite-sample multiplier — that lives on the HC1 path only. Whether the resulting sandwich ends up above or below HC1 depends on both the missing multiplier AND the net sign and magnitude of the weighted off-diagonal cross-products under the specific DGP and design.) On this seed they net out to a smaller ATT variance, so Conley < HC1, but the direction is a feature of the data and the sign mechanism is not easily attributable to any one pair class. Adding the serial term (lag = 1) further reshapes the SE through a separate within-unit cross-period score covariance sum.

The takeaway: Conley vcov is a correction. It can move the SE in either direction relative to HC1 depending on the residual covariance structure within the kernel’s support, but it’s the methodologically correct choice when scores are spatially or serially correlated. On real applications the correction often inflates the SE (and the t-stat shrinks), but the direction is a feature of the data, not a guarantee. See the ConleySpatialHAC and SpilloverDiD sections of docs/methodology/REGISTRY.md for the full variance formula.

7. Practitioner takeaways and where to go next#

When to reach for ``SpilloverDiD``. Whenever your treatment has a plausible geographic externality — labor-market spillovers, agglomeration effects, disease transmission, supply-chain ripples, peer effects in retail or technology adoption — a naive DiD that treats nearby units as controls will give you a biased estimate of the direct effect. The magnitude of the bias is roughly \((\text{near share}) \times \delta_1\) in absolute value (here: \((120/175) \times 4.5 \approx 3.1\), matching the observed -4.3 vs -7.4 gap).

How to choose ``d_bar``. Start from theory: what’s the plausible spatial range of the spillover mechanism? Triangulate with a sensitivity grid (§4 above). The estimate stabilizes once d_bar covers the true horizon; further widening is benign unless it sweeps clean controls into the spillover bin.

When Conley matters. Whenever your residuals have spatial or serial correlation. On clustered-treatment designs (treatment radius \(\ll\) panel extent) the spatial-HAC adjustment can move the SE in either direction relative to HC1; the direction is a feature of the DGP. Per Butts (2021) §3.1, set conley_cutoff_km = d_bar as the default and run a sensitivity grid on the cutoff if the SE matters for your inference.

Extensions not covered here.

  • Multiple rings — pass rings=[0, 50, 100] for separate near-vs-mid spillover coefficients \(\delta_1, \delta_2\) when theory or diagnostics suggest distance-graded externalities.

  • Staggered treatmentSpilloverDiD automatically handles unit-specific treatment onsets via the first_treat column.

  • Event study — pass event_study=True and horizon_max for per-event-time direct and spillover effects (Wave C; see tutorials/12_two_stage_did.ipynb for the TwoStageDiD event-study pattern this mirrors).

  • Survey weightssurvey_design=SurveyDesign(...) supports Hájek-weighted point estimates with Binder TSL variance. Combined with vcov_type="conley" and conley_lag_cutoff=0 (cross- sectional spatial only) this is the Wave E.2 stratified-Conley sandwich on PSU totals. The panel-block extension conley_lag_cutoff > 0 (spatial + within-PSU serial Bartlett HAC) is the Wave E.2 follow-up library synthesis, and it requires an effective PSU on the survey design (either explicit survey_design.psu= or cluster=<col> injected per Wave E.1’s warn-and-use-PSU pattern). A dedicated SpilloverDiD survey-design tutorial is on the roadmap.

References:

  • Butts, K. (2021). Difference-in-Differences with Spatial Spillovers. arXiv:2105.03737.

  • Kline, P., & Moretti, E. (2014). Local Economic Development, Agglomeration Economies, and the Big Push: 100 Years of Evidence from the Tennessee Valley Authority. Quarterly Journal of Economics 129(1), 275-331.

  • Conley, T. G. (1999). GMM Estimation with Cross-Sectional Dependence. Journal of Econometrics 92(1), 1-45.

  • Newey, W. K., & West, K. D. (1987). A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica 55(3), 703-708.

  • Gerber, I. (2026). Design-Based Standard Errors for Two-Stage Difference-in-Differences. arXiv:2605.04124.