Local-Linear Infrastructure#
Kernels, kernel-moment constants, univariate local-linear regression at a boundary, the MSE-optimal bandwidth selector, and the bias-corrected local-linear fit.
This module ships the nonparametric building blocks that
HeterogeneousAdoptionDiD composes for its continuous-dose
fit paths (continuous_at_zero and continuous_near_d_lower); see
Heterogeneous Adoption Difference-in-Differences for the estimator that consumes them.
Two scope tiers are exposed:
Generic helpers (
local_linear_fit, the three kernels,kernel_moments,KERNELS) - usable on their own for any one-sided boundary local-linear regression problem with a strictly nonnegative running variable.HAD-scoped public wrappers (
mse_optimal_bandwidth,bias_corrected_local_linear) - the configuration is hard-coded to HAD Phase 1b/1c (p=1,deriv=0,interior=False,vce="nn") and only those settings are parity-tested against Rnprobust. Other settings (hc0/hc1/hc2/hc3variance, interior evaluation, higher polynomial orders) require dropping to the privatediff_diff._nprobust_portmodule and accepting that parity has not been verified outside the HAD configuration. Seedocs/methodology/REGISTRY.mdHeterogeneousAdoptionDiDsection for the full Phase 1b/1c contract.
The selector and bias correction are ports of the Calonico-Cattaneo-Farrell
(2018) plug-in bandwidth and Calonico-Cattaneo-Titiunik (2014) robust-bias
correction from the R nprobust package; methodology context lives in
Heterogeneous Adoption Difference-in-Differences and docs/methodology/REGISTRY.md.
Kernels#
Bounded one-sided kernels on [0, 1] for boundary-point nonparametric
regression. The library normalizes the Epanechnikov and triangular kernels
to int_0^1 k(u) du = 1/2; the uniform kernel uses
int_0^1 k(u) du = 1.
- diff_diff.epanechnikov_kernel(u)[source]#
Epanechnikov kernel on
[0, 1].k(u) = (3/4)(1 - u^2)foru in [0, 1], zero elsewhere.- Parameters:
u (np.ndarray) – Points on the scaled domain
u = (d - d0) / h.- Returns:
Kernel values, same shape as
u.- Return type:
np.ndarray
- diff_diff.triangular_kernel(u)[source]#
Triangular kernel on
[0, 1].k(u) = 1 - uforu in [0, 1], zero elsewhere.Using the convention
int_0^1 k(u) du = 1/2to match Epanechnikov’s one-sided normalization.
- diff_diff.uniform_kernel(u)[source]#
Uniform (rectangular) kernel on
[0, 1].k(u) = 1foru in [0, 1], zero elsewhere.Already normalized to
int_0^1 k(u) du = 1(no factor of 1/2 like the other two).
- diff_diff.KERNELS: dict[str, Callable[[np.ndarray], np.ndarray]]#
Mapping from kernel name (
"epanechnikov"/"triangular"/"uniform") to its callable evaluator on[0, 1]. Pass the name string (not the callable) tolocal_linear_fitandmse_optimal_bandwidthvia theirkernel=argument.
- diff_diff.kernel_moments(kernel='epanechnikov')[source]#
Return kernel-moment constants used in boundary local-linear asymptotics.
The returned dict contains five raw moments
kappa_kfork in {0, 1, 2, 3, 4}, plus two derived constants:"C": the paper’s boundary-kernel constant used in the asymptotic bias termh^2 * C * m''(0). Per de Chaisemartin, Ciccia, D’Haultfoeuille & Knau (2026, Section 3.1.3),C = (kappa_2^2 - kappa_1 * kappa_3) / (kappa_0 * kappa_2 - kappa_1^2)."kstar_L2_norm": the asymptotic-variance constantint_0^1 k*(t)^2 dtwherek*(t) = (kappa_2 - kappa_1 * t) / (kappa_0 * kappa_2 - kappa_1^2) * k(t)is the equivalent kernel for local-linear at a boundary. Computed by numerical integration.
- Parameters:
kernel (str) – One of
"epanechnikov","triangular","uniform".- Returns:
dict of {str – Keys
kappa_0,kappa_1,kappa_2,kappa_3,kappa_4,C,kstar_L2_norm.- Return type:
float}
- Raises:
ValueError – If
kernelis not a recognized name.
Boundary local-linear fit#
Kernel-weighted OLS estimator of the conditional mean
m(d0) := E[Y | D = d0] at the boundary of D’s support.
- diff_diff.local_linear_fit(d, y, bandwidth, boundary=0.0, kernel='epanechnikov', weights=None)[source]#
Local-linear regression of
yondat a boundary.Fits
y ~ a + b * (d - boundary)using kernel weightsk((d - boundary) / h)on observations withd in [boundary, boundary + h]. Returns the intercept (the boundary estimatemu_hat) and slope.- Parameters:
d (np.ndarray, shape (n,)) – Regressor values. For the HAD application,
dis the period-2 doseD_{g,2}and the boundary is 0.y (np.ndarray, shape (n,)) – Outcome values. For the HAD application,
yis the first-differenceDelta Y_g.bandwidth (float) – Bandwidth
h > 0.boundary (float, default=0.0) – Evaluation point
d0. Observations withd < d0are excluded (one-sided boundary estimation).kernel (str, default="epanechnikov") – One of
"epanechnikov","triangular","uniform".weights (np.ndarray or None, optional) – Optional per-observation weights
w_i >= 0multiplied into the kernel weights. Useful for survey weighting; whenNone, treated as unit weights.
- Returns:
Named container with
intercept,slope,n_effective, and diagnostics needed by downstream bias-correction phases.- Return type:
- Raises:
ValueError – If
bandwidth <= 0,kernelis unknown,dandydiffer in length, or the bandwidth window retains fewer than 2 observations.
- class diff_diff.LocalLinearFit[source]
Bases:
objectResult of a local-linear regression at a boundary.
- intercept
Estimated conditional mean at the boundary,
mu_hat_h(d0).- Type:
- slope
Estimated slope of the local linear fit (coefficient on
d - d0).- Type:
- n_effective
Count of observations with strictly positive kernel weight (within
[d0, d0 + h]for the one-sided kernels shipped here).- Type:
- bandwidth
Bandwidth
hused.- Type:
- kernel
Kernel name.
- Type:
- boundary
Evaluation point
d0.- Type:
- residuals
Residuals from the weighted OLS fit, in the order of the retained observations.
- Type:
np.ndarray, shape (n_effective,)
- kernel_weights
Kernel weights
k((d_i - d0) / h). These are the pre-scaled weights; the1/hscaling cancels out of the weighted-OLS estimator (a constant factor on all weights does not change the point estimate).- Type:
np.ndarray, shape (n_effective,)
- design_matrix
Design matrix
X = [1, d_i - d0]used in the fit. Preserved for Phase 1c bias-correction machinery.- Type:
np.ndarray, shape (n_effective, 2)
- intercept: float
- slope: float
- n_effective: int
- bandwidth: float
- kernel: str
- boundary: float
- residuals: ndarray
- kernel_weights: ndarray
- design_matrix: ndarray
- __init__(intercept, slope, n_effective, bandwidth, kernel, boundary, residuals, kernel_weights, design_matrix)
MSE-optimal bandwidth selector#
Plug-in MSE-optimal bandwidth (Calonico-Cattaneo-Farrell 2018) for the
boundary local-linear fit. Returns BandwidthResult carrying the final
bandwidth h_mse plus the per-stage diagnostics needed to audit the
selector against the R nprobust reference.
- diff_diff.mse_optimal_bandwidth(d, y, boundary=0.0, kernel='epanechnikov', weights=None, bwcheck=21, bwregul=1.0, return_diagnostics=False)[source]#
MSE-optimal bandwidth for local-linear regression at a boundary.
Port of
nprobust::lpbwselect(bwselect="mse-dpi")(R packagenprobust0.5.0, SHA36e4e53). Implements the Calonico, Cattaneo, and Farrell (2018, JASA 113(522)) direct-plug-in DPI bandwidth selector for the local-linear boundary estimator used in Design 1’ of de Chaisemartin et al. (2026) (HAD).The three-stage algorithm produces five bandwidths (
c_bw,bw_mp2,bw_mp3,b_mse,h_mse); seeBandwidthResultfor full diagnostics.Public API scope (Phase 1b of HAD). This wrapper is intentionally restricted to the HAD configuration:
p=1(local-linear),deriv=0(mean regression),interior=False(boundary eval point),vce="nn"(nearest-neighbor variance), andnnmatch=3are hard-coded. The underlyingdiff_diff._nprobust_portsupports additionalvcemodes (hc0/hc1/hc2/hc3), interior evaluation, and higher polynomial orders, but those paths are NOT parity-tested againstnprobustand are deferred to Phase 1c / Phase 2. Do not rely on this public wrapper for anything outside HAD Phase 1b; use the port directly if you need the broader surface and accept that parity has not been separately verified.- Parameters:
d (np.ndarray, shape (G,)) – Regressor values (the dose
D_{g,2}in HAD).y (np.ndarray, shape (G,)) – Outcome values (the first-difference
Delta Y_gin HAD).boundary (float, default=0.0) – Evaluation point
d_0. The Phase 1b wrapper accepts only two values (within float tolerance):boundary = 0for Design 1’ orboundary = float(d.min())for Design 1 continuous-near-d_lower. Use the sample minimumd.min()(not a known theoretical lower bound of the support), because the downstream selector operates on the realized data. Any other value – includingboundary < d.min(), interior points, orboundary > d.min()– raisesValueError.kernel (str, default="epanechnikov") – One of
"epanechnikov","triangular","uniform".weights (np.ndarray or None, default=None) – Not supported in Phase 1b (raises
NotImplementedError).nprobust::lpbwselecthas no weight argument and thus no parity anchor. Weighted-data support is queued for Phase 2+.bwcheck (int or None, default=21) – If set, clip
c_bw(and all downstream bandwidths) below the distance to thebwcheck-th nearest neighbor ofboundary. Matchesnprobust::lpbwselectdefault.bwregul (float, default=1.0) – Bias-regularization scale used in Stage 3.
bwregul=0disables the regularization term;bwregul=1matchesnprobustdefault.return_diagnostics (bool, default=False) – When
False(default) return the final bandwidth as afloat. WhenTruereturn aBandwidthResultwith all five stage bandwidths plus per-stage(V, B1, B2, R)diagnostics.
- Returns:
The MSE-optimal bandwidth
h*, or (ifreturn_diagnostics=True) aBandwidthResultdataclass.- Return type:
- Raises:
ValueError – Raised on: shape mismatch between
dandy; non-finite values ind,y, orboundary; unknownkernelname;bwcheckoutside[1, len(d)];boundarythat is not approximately 0 or approximatelyd.min()(the only two supported HAD estimands in Phase 1b); or a rank-deficient / under-determined pilot fit inside the DPI port (surfaced fromqrXXinvor the per-stage count guards inlprobust_bw).NotImplementedError – Raised on:
weights=passed (no nprobust parity anchor); detected Design 1 mass-point design (d.min() > 0and modal fraction atd.min()exceeds 2%, per the paper’s Section 3.2.4 redirection to the 2SLS sample-average estimator, queued for Phase 2).
Notes
The port parity-tests at 1% relative error against R
nprobust::lpbwselect(bwselect="mse-dpi", vce="nn")on three deterministic DGPs (seebenchmarks/R/generate_nprobust_golden.R). In practice the agreement is at machine precision (0.0000% on the golden tests); the 1% tolerance is the hard gate before a deviation is considered a regression.
- class diff_diff.BandwidthResult[source]
Bases:
objectMSE-optimal bandwidth selector output plus per-stage diagnostics.
Returned by
mse_optimal_bandwidth(..., return_diagnostics=True). Mirrors the five-bandwidth + four-stage structure ofnprobust::lpbwselect.mse.dpi; seediff_diff/_nprobust_port.pyfor the source mapping.- h_mse
Final MSE-optimal bandwidth
h*for local-linear estimation atboundary. The argument to pass tolocal_linear_fit(..., bandwidth=h_mse).- Type:
- b_mse
Bias-correction bandwidth. Consumed by Phase 1c for the bias-corrected confidence interval (CCF 2018 Equation 8).
- Type:
- c_bw
Stage 1 preliminary bandwidth used as
h.Vin everylprobust.bwcall downstream:C_kernel * min(sd(d), IQR(d)/1.349) * G^{-1/5}. Kernel constants:epa=2.34,uni=1.843,tri=2.576.- Type:
- bw_mp2
Stage 2 pilot bandwidth for the
m^{(q+1)}derivative estimator.- Type:
- bw_mp3
Stage 2 pilot bandwidth for the
m^{(q+2)}derivative estimator.- Type:
- stage_d1_V, stage_d1_B1, stage_d1_B2, stage_d1_R
Variance and bias coefficients from the first Stage-2
lprobust.bwcall (orderq+1, reading the(q+1)-th derivative). Parity-checked to 1% against R.- Type:
- stage_d2_V, stage_d2_B1, stage_d2_B2, stage_d2_R
Same for the second Stage-2 call (order
q+2).- Type:
- stage_b_V, stage_b_B1, stage_b_B2, stage_b_R
Same for the Stage-3 bias-bandwidth call (order
q, nup+1).- Type:
- stage_h_V, stage_h_B1, stage_h_B2, stage_h_R
Same for the final Stage-3 main-bandwidth call (order
p, nuderiv).- Type:
- n
Sample size.
- Type:
- kernel
Kernel name as user supplied it (“epanechnikov” / “triangular” / “uniform”).
- Type:
- boundary
Evaluation point
d_0.- Type:
- h_mse: float
- b_mse: float
- c_bw: float
- bw_mp2: float
- bw_mp3: float
- stage_d1_V: float
- stage_d1_B1: float
- stage_d1_B2: float
- stage_d1_R: float
- stage_d2_V: float
- stage_d2_B1: float
- stage_d2_B2: float
- stage_d2_R: float
- stage_b_V: float
- stage_b_B1: float
- stage_b_B2: float
- stage_b_R: float
- stage_h_V: float
- stage_h_B1: float
- stage_h_B2: float
- stage_h_R: float
- n: int
- kernel: str
- boundary: float
- __init__(h_mse, b_mse, c_bw, bw_mp2, bw_mp3, stage_d1_V, stage_d1_B1, stage_d1_B2, stage_d1_R, stage_d2_V, stage_d2_B1, stage_d2_B2, stage_d2_R, stage_b_V, stage_b_B1, stage_b_B2, stage_b_R, stage_h_V, stage_h_B1, stage_h_B2, stage_h_R, n, kernel, boundary)
- Parameters:
h_mse (float)
b_mse (float)
c_bw (float)
bw_mp2 (float)
bw_mp3 (float)
stage_d1_V (float)
stage_d1_B1 (float)
stage_d1_B2 (float)
stage_d1_R (float)
stage_d2_V (float)
stage_d2_B1 (float)
stage_d2_B2 (float)
stage_d2_R (float)
stage_b_V (float)
stage_b_B1 (float)
stage_b_B2 (float)
stage_b_R (float)
stage_h_V (float)
stage_h_B1 (float)
stage_h_B2 (float)
stage_h_R (float)
n (int)
kernel (str)
boundary (float)
- Return type:
None
Bias-corrected local-linear fit#
Calonico-Cattaneo-Titiunik (2014) robust-bias-corrected boundary
estimator. Composes the bandwidth selector and the local-linear fit and
returns a BiasCorrectedFit with point estimate, robust standard error,
and 95% confidence interval.
- diff_diff.bias_corrected_local_linear(d, y, boundary=0.0, kernel='epanechnikov', h=None, b=None, alpha=0.05, vce='nn', cluster=None, nnmatch=3, weights=None, return_influence=False)[source]#
Bias-corrected local-linear fit with robust CI at a boundary.
Ports the single-eval-point path of
nprobust::lprobust(Calonico, Cattaneo, and Titiunik 2014) and produces the mu-scale outputs of Equation 8 in de Chaisemartin, Ciccia, D’Haultfoeuille, and Knau (2026). Phase 2’sHeterogeneousAdoptionDiDclass applies the beta-scale rescaling to return the final estimand.Public API scope (Phase 1c of HAD). Hard-coded for the HAD configuration:
p=1(local-linear),q=2(bias-correction order),deriv=0(level),interior=False(boundary eval point),bwcheck=21,bwregul=1, andvce="nn"(the only variance mode golden-parity-tested against R in this phase). The underlyingdiff_diff._nprobust_port.lprobustaccepts the broader surface (hc0/hc1/hc2/hc3, higherp, interior eval), but those paths are not separately parity-tested and remain private until Phase 2+ ships dedicated goldens.- Parameters:
d (np.ndarray, shape (G,)) – Regressor (dose
D_{g,2}in HAD).y (np.ndarray, shape (G,)) – Outcome (first-difference
Delta Y_gin HAD).boundary (float, default=0.0) – Evaluation point
d_0. Accepts onlyboundary ~ 0(Design 1’) orboundary ~ float(d.min())(Design 1 continuous-near-d_lower); see Phase 1b’s input contract.kernel ({"epanechnikov", "triangular", "uniform"}, default="epanechnikov")
h (float or None, default=None) – Main bandwidth.
Noneauto-selectshby callingdiff_diff._nprobust_port.lpbwselect_mse_dpidirectly with the suppliedcluster,vce, andnnmatchso the selector and the final fit use the same estimator.bis then set tohper nprobust’srho=1default; the selector’s distinctb_mseis surfaced throughBiasCorrectedFit.bandwidth_diagnosticsfor inspection but not applied. Ifhis provided andbisNone,b = hlikewise.b (float or None, default=None) – Bias-correction bandwidth. Pairs with
h(see above).bprovided withouthraisesValueError.alpha (float, default=0.05) – CI level;
0.05gives a 95% CI. Must be in(0, 1).vce ({"nn"}, default="nn") – Variance-estimation method. Only
"nn"is supported in Phase 1c; hc0/hc1/hc2/hc3 are queued for Phase 2+ pending dedicated R parity goldens. Passing anything else raisesNotImplementedError.cluster (np.ndarray or None) – Per-observation cluster IDs for cluster-robust variance. Missing (NaN) cluster IDs raise
ValueErrorrather than silently dropping rows.nnmatch (int, default=3) – Number of nearest neighbors for
vce="nn"residuals.weights (np.ndarray or None, default=None) – Per-unit non-negative weights (e.g., survey sampling weights). Forwarded to the final
lprobustfit; propagates through kernel composition, design matrices, Q.q bias correction, and variance matrices. Whenweights=np.ones(G)the output is bit-identical to the unweighted path. Known methodology gap: the auto-bandwidth MSE-optimal DPI (Phase 1b) remains unweighted in Phase 4.5; passh/bexplicitly for a weight-aware bandwidth. See REGISTRY “Weighted extension (Phase 4.5)” for the analytic derivation + parity-ceiling note.return_influence (bool)
- Return type:
- Raises:
ValueError – Shape mismatch, non-finite inputs, off-support boundary, negative doses,
alphaoutside(0, 1), unknownkernel, NaN / None cluster IDs,bsupplied withouth, negative or non-finiteweights, or a rank-deficient window.NotImplementedError –
vce != "nn"(hc0/hc1/hc2/hc3 deferred to Phase 2+ pending dedicated R parity goldens); a Design 1 mass-point sample (redirects to Phase 2’s 2SLS sample-average path per the paper’s Section 3.2.4).
Notes
Parity against
nprobust::lprobust(..., bwselect="mse-dpi")is asserted atatol=1e-12ontau_cl,tau_bc,se_cl,se_rb,ci_low, andci_highacross the three unclustered golden DGPs; DGP 1 and DGP 3 typically land closer to1e-13. The Python wrapper computes its ownz_{1-alpha/2}viascipy.stats.norm.ppfinsidesafe_inference(); R’sqnormvalue is stored in the golden JSON for audit, and the parity harness compares Python’s CI bounds to R’s pre-computed CI bounds, so any residual drift is purely the floating-point arithmetic intau.bc +/- z * se.rb, not a critical-value disagreement. Clustered DGP 4 achieves bit-parity (atol=1e-14) when cluster IDs are in first-appearance order; otherwise BLAS reduction ordering can drift toatol=1e-10.
- class diff_diff.BiasCorrectedFit[source]
Bases:
objectBias-corrected local-linear fit at a boundary (Phase 1c).
Output of
bias_corrected_local_linear(). Produces the mu-scale quantities needed by Equation 8 of de Chaisemartin, Ciccia, D’Haultfoeuille, and Knau (2026). Phase 2’sHeterogeneousAdoptionDiDclass applies the beta-scale(1/G) * sum(D_{g,2})rescaling.- estimate_classical
Classical point estimate
tau.clfromnprobust::lprobust(local-linear boundary intercept ath; no bias correction).- Type:
- estimate_bias_corrected
Bias-corrected point estimate
tau.bc = mu_hat + M_hatfrom the Calonico-Cattaneo-Titiunik (2014) combined design-matrix statistic.- Type:
- se_classical
Naive plug-in standard error.
- Type:
- se_robust
Robust standard error accounting for the additional variability introduced by the bias-correction term (CCT 2014).
- Type:
- ci_low, ci_high
Endpoints of the bias-corrected CI:
tau.bc +/- z_{1-alpha/2} * se.rb.- Type:
- alpha
CI level (
0.05gives a 95% CI).- Type:
- h, b
Main and bias-correction bandwidths actually used (post-
bwcheckfloor).- Type:
- bandwidth_source
"auto"when the wrapper called the Phase 1b DPI selector (_nprobust_port.lpbwselect_mse_dpi) internally with the caller’scluster/vce/nnmatch;"user"when the caller passed explicit bandwidths. Auto mode then enforces nprobust’srho=1default by settingb = h; the selector’s distinctb_mseis surfaced viabandwidth_diagnosticsbut not applied.- Type:
{“auto”, “user”}
- bandwidth_diagnostics
Full Phase 1b selector output when
bandwidth_source == "auto";Nonewhen the user supplied bandwidths (to avoid a redundant selector call).- Type:
BandwidthResult or None
- n_used
Observations retained in the active kernel window (
sum(ind.b)whenh <= bandsum(ind.h)whenh > b; with therho=1default the two coincide).- Type:
- n_total
Total observations passed in (before kernel filtering).
- Type:
- kernel
Kernel name as supplied by the caller.
- Type:
- boundary
Evaluation point
c.- Type:
Notes
p=1,q=2,deriv=0are hard-coded for HAD Phase 1c and are not exposed as fields. Phase 2 may surface them on the estimator-level result class if a use case materializes.- estimate_classical: float
- estimate_bias_corrected: float
- se_classical: float
- se_robust: float
- ci_low: float
- ci_high: float
- alpha: float
- h: float
- b: float
- bandwidth_source: str
- bandwidth_diagnostics: BandwidthResult | None
- n_used: int
- n_total: int
- kernel: str
- boundary: float
- influence_function: ndarray | None = None
Per-observation influence function of the BIAS-CORRECTED point estimate
tau.bc(Phase 4.5 survey composition). Aligned with the original caller-suppliedd/yordering; observations outside the active kernel window have IF=0. Populated only whenreturn_influence=True;Noneotherwise.Derived from
Q_q+res_bso the variance self-check issum(IF^2) == V_Y_bc[0, 0]under unclustered HC0 — matching the bias-corrected scale ofestimate_bias_corrected. Using the classical IF here would silently under-estimate survey SE by ignoring the CCT-2014 bias-correction variance inflation.
- __init__(estimate_classical, estimate_bias_corrected, se_classical, se_robust, ci_low, ci_high, alpha, h, b, bandwidth_source, bandwidth_diagnostics, n_used, n_total, kernel, boundary, influence_function=None)
- Parameters:
estimate_classical (float)
estimate_bias_corrected (float)
se_classical (float)
se_robust (float)
ci_low (float)
ci_high (float)
alpha (float)
h (float)
b (float)
bandwidth_source (str)
bandwidth_diagnostics (BandwidthResult | None)
n_used (int)
n_total (int)
kernel (str)
boundary (float)
influence_function (ndarray | None)
- Return type:
None