Benchmarks#
This document presents validation benchmarks comparing diff-diff against established R packages for difference-in-differences analysis. As of v2.0.0, diff-diff includes an optional Rust backend for accelerated computation.
Overview#
diff-diff is validated against the following R packages:
diff-diff Estimator |
R Package |
Reference |
|---|---|---|
|
|
Standard OLS with interaction |
|
|
Callaway & Sant’Anna (2021) |
|
|
Event study with treatment × period interactions |
|
|
Arkhangelsky et al. (2021) |
Methodology#
Validation Approach#
Synthetic Data: Generate data with known true effects using
generate_did_data()from diff_diff.prepIdentical Inputs: Both Python and R estimators receive the same CSV data
JSON Interchange: R scripts output JSON for comparison
Automated Comparison: Python script validates numerical equivalence
Multiple Scales: Test at small (200-400 obs), 1K, 5K, 10K, and 20K unit scales
Replicated Timing: 3 replications per benchmark to report mean ± std
Reproducible Seed: Benchmarks use seed 42 for data generation
Three-Way Comparison: Compare R, Python (pure NumPy/SciPy), and Python (Rust backend)
Tolerance Thresholds#
Point estimates (ATT): Absolute difference < 1e-4 or relative < 1%
Standard errors: Relative difference < 10%
Confidence intervals: Must overlap
Benchmark Results#
Summary Table#
Estimator |
ATT Diff |
SE Rel Diff |
CI Overlap |
Status |
|---|---|---|---|---|
BasicDiD/TWFE |
< 1e-10 |
0.0% |
Yes |
PASS |
MultiPeriodDiD |
< 1e-11 |
0.0% |
Yes |
PASS |
CallawaySantAnna |
< 1e-10 |
0.0% |
Yes |
PASS |
SyntheticDiD |
< 1e-10 |
0.3% |
Yes |
PASS |
Basic DiD Results#
Data: 100 units, 4 periods, true ATT = 5.0 (small scale)
Metric |
diff-diff (Pure) |
diff-diff (Rust) |
R fixest |
Difference |
|---|---|---|---|---|
ATT |
5.112 |
5.112 |
5.112 |
< 1e-10 |
SE |
0.183 |
0.183 |
0.183 |
0.0% |
Time (s) |
0.002 |
0.002 |
0.041 |
22x faster |
Validation: PASS - Results are numerically identical across all implementations.
MultiPeriodDiD Results#
Data: 200 units, 8 periods (4 pre, 4 post), true ATT = 3.0 (small scale)
Metric |
diff-diff (Pure) |
diff-diff (Rust) |
R fixest |
Difference |
|---|---|---|---|---|
ATT |
2.912 |
2.912 |
2.912 |
< 1e-11 |
SE |
0.158 |
0.158 |
0.158 |
0.0% |
Period corr. |
1.000 |
1.000 |
(ref) |
Period max diff < 3e-11 |
Time (s) |
0.005 |
0.035 |
0.035 |
7x faster (pure) |
Validation: PASS - Both average ATT and all period-level effects match R’s
fixest::feols(outcome ~ treated * time_f | unit) to machine precision. The
regression includes unit fixed effects (absorbed via | unit in R, within-
transformation via absorb=["unit"] in Python) and treatment × period
interactions with cluster-robust SEs.
Synthetic DiD Results#
Data: 50 units (40 control, 10 treated), 20 periods, true ATT = 4.0
Metric |
diff-diff (Pure) |
diff-diff (Rust) |
R synthdid |
Difference |
|---|---|---|---|---|
ATT |
3.840 |
3.840 |
3.840 |
< 1e-10 |
SE |
0.105 |
0.099 |
0.105 |
0.3% (pure) |
Time (s) |
3.41 |
1.65 |
8.19 |
2.4x faster (pure) |
Validation: PASS - ATT estimates are numerically identical across all
implementations. Both diff-diff and R’s synthdid use Frank-Wolfe optimization
with two-pass sparsification and auto-computed regularization (zeta_omega,
zeta_lambda), producing identical unit and time weights. Both use
placebo-based variance estimation (Algorithm 4 from Arkhangelsky et al. 2021).
The small SE difference (0.3% at small scale, up to ~7% at larger scales) is due to Monte Carlo variance in the placebo procedure, which randomly permutes control units to construct pseudo-treated groups. Different random seeds across implementations produce slightly different placebo samples.
Callaway-Sant’Anna Results#
Data: 200 units, 8 periods, 3 treatment cohorts, dynamic effects (small scale)
Metric |
diff-diff (Pure) |
diff-diff (Rust) |
R did |
Difference |
|---|---|---|---|---|
ATT |
2.519 |
2.519 |
2.519 |
< 1e-10 |
SE |
0.063 |
0.063 |
0.063 |
0.0% |
Time (s) |
0.007 ± 0.000 |
0.007 ± 0.000 |
0.070 ± 0.001 |
10x faster |
Validation: PASS - Both point estimates and standard errors match R exactly.
Key findings from investigation:
Individual ATT(g,t) effects match perfectly (~1e-11 difference)
Never-treated coding: R’s
didpackage requiresfirst_treat=Inffor never-treated units. diff-diff acceptsfirst_treat=0. The benchmark converts 0 to Inf for R compatibility.Standard errors: As of v2.0.2, analytical SEs match R’s
didpackage exactly (0.0% difference). The weight influence function (wif) formula was corrected to match R’s implementation, achieving numerical equivalence across all dataset scales.
Event study per-event-time validation:
As of v2.1.0, event study SEs include the WIF adjustment matching R’s
did::aggte(..., type="dynamic"). Validation targets:
Per-event-time point estimates: match R’s
aggte(..., type="dynamic")to <1e-10Per-event-time analytical SEs (
bstrap=FALSE): match R with WIF includedPer-event-time bootstrap SEs (
bstrap=TRUE): consistent with analyticalSimultaneous confidence bands (
cband=TRUE): sup-t critical value matches R
Performance Comparison#
We benchmarked performance across multiple dataset scales with 3 replications each to provide mean ± std timing statistics. As of v2.0.0, we compare three implementations:
R: Reference implementation (fixest, did packages)
Python (Pure): diff-diff with NumPy/SciPy only (no Rust backend)
Python (Rust): diff-diff with optional Rust backend enabled
Note
v2.0.0 Rust Backend: diff-diff v2.0.0 introduces an optional Rust backend for accelerated computation. The Rust backend provides significant speedups for SyntheticDiD (4-8x faster than pure Python), which uses custom Rust implementations for synthetic weight computation and simplex projection. For BasicDiD and CallawaySantAnna, the Rust backend provides minimal additional speedup since these estimators primarily use OLS and variance computations that are already highly optimized in NumPy/SciPy via BLAS/LAPACK.
As of v2.5.0, pre-built wheels on macOS and Linux link platform-optimized BLAS libraries (Apple Accelerate and OpenBLAS respectively) for matrix-vector and matrix-matrix products across all Rust-accelerated code paths. Windows wheels continue to use pure Rust with no external dependencies.
Three-Way Performance Summary#
BasicDiD/TWFE Results:
Scale |
R (s) |
Python Pure (s) |
Python Rust (s) |
Rust/R |
Rust/Pure |
|---|---|---|---|---|---|
small |
0.034 |
0.002 |
0.002 |
17x |
1.1x |
1k |
0.036 |
0.003 |
0.003 |
13x |
1.0x |
5k |
0.042 |
0.005 |
0.006 |
7x |
0.8x |
10k |
0.043 |
0.010 |
0.012 |
4x |
0.8x |
20k |
0.050 |
0.022 |
0.025 |
2x |
0.9x |
CallawaySantAnna Results:
Scale |
R (s) |
Python Pure (s) |
Python Rust (s) |
Pure/R |
Rust/Pure |
|---|---|---|---|---|---|
small |
0.069 |
0.006 |
0.007 |
11x |
1.0x |
1k |
0.119 |
0.014 |
0.013 |
9x |
1.0x |
5k |
0.363 |
0.055 |
0.055 |
7x |
1.0x |
10k |
0.771 |
0.146 |
0.145 |
5x |
1.0x |
20k |
1.559 |
0.366 |
0.373 |
4x |
1.0x |
SyntheticDiD Results:
Scale |
R (s) |
Python Pure (s) |
Python Rust (s) |
Pure/R |
Rust/Pure |
|---|---|---|---|---|---|
small |
8.19 |
3.41 |
1.65 |
2.4x |
2.1x |
1k |
111.7 |
24.0 |
76.1 |
4.7x |
0.3x |
5k |
524.2 |
31.7 |
307.5 |
16.5x |
0.1x |
Note
SyntheticDiD Performance: diff-diff’s pure Python backend achieves 2.4x to 16.5x speedup over R’s synthdid package using the same Frank-Wolfe optimization algorithm. At 5k scale, R takes ~9 minutes while pure Python completes in 32 seconds. ATT estimates are numerically identical (< 1e-10 difference) since both implementations use the same Frank-Wolfe optimizer with two-pass sparsification. The Rust backend uses a Gram-accelerated Frank-Wolfe solver for time weights (reducing per-iteration cost from O(N×T0) to O(T0)) and an allocation-free solver for unit weights (1 GEMV per iteration instead of 3, zero heap allocations). These optimizations make the Rust backend faster than pure Python at all scales.
Dataset Sizes#
Scale |
BasicDiD |
MultiPeriodDiD |
CallawaySantAnna |
SyntheticDiD |
Observations |
|---|---|---|---|---|---|
small |
100 × 4 |
200 × 8 |
200 × 8 |
50 × 20 |
400 - 1,600 |
1k |
1,000 × 6 |
1,000 × 10 |
1,000 × 10 |
1,000 × 30 |
6,000 - 30,000 |
5k |
5,000 × 8 |
5,000 × 12 |
5,000 × 12 |
5,000 × 40 |
40,000 - 200,000 |
10k |
10,000 × 10 |
10,000 × 12 |
10,000 × 15 |
10,000 × 50 |
100,000 - 500,000 |
20k |
20,000 × 12 |
20,000 × 16 |
20,000 × 18 |
20,000 × 60 |
240,000 - 1,200,000 |
Key Observations#
Performance varies by estimator and scale:
BasicDiD/TWFE: 2-17x faster than R at all scales
CallawaySantAnna: 4-11x faster than R at all scales (vectorized WIF computation)
SyntheticDiD: 2.4-16.5x faster than R (pure Python), with both implementations using the same Frank-Wolfe algorithm
Rust backend benefit depends on the estimator:
SyntheticDiD: Rust provides speedup at small scale (2.1x) but is slower at larger scales due to placebo variance loop overhead
BasicDiD/CallawaySantAnna: Rust provides minimal benefit (~1x) since these estimators use OLS/variance computations already optimized in NumPy/SciPy
When to use Rust backend:
SyntheticDiD at small scale: Rust is ~2x faster than pure Python
Bootstrap inference: May help with parallelized iterations
BasicDiD/CallawaySantAnna: Optional - pure Python is equally fast
Scaling behavior: Python implementations show excellent scaling behavior across all estimators. SyntheticDiD pure Python is 16.5x faster than R at 5k scale. CallawaySantAnna achieves exact SE accuracy (0.0% difference) while being 4-11x faster than R through vectorized NumPy operations.
No Rust required for most use cases: Users without Rust/maturin can install diff-diff and get full functionality with excellent performance. Pure Python is the fastest option for SyntheticDiD at 1k+ scales.
CallawaySantAnna accuracy and speed: As of v2.0.3, CallawaySantAnna achieves both exact numerical accuracy (0.0% SE difference from R) AND superior performance (4-10x faster than R) through vectorized weight influence function (WIF) computation using NumPy matrix operations.
Performance Optimization Details#
The performance improvements come from:
Unified ``linalg.py`` backend: Single optimized OLS/SE implementation using scipy’s gelsy LAPACK driver (QR-based, faster than SVD)
Vectorized cluster-robust SE: Eliminated O(n × clusters) loop with pandas groupby aggregation
Pre-computed data structures (CallawaySantAnna): Wide-format outcome matrix and cohort masks computed once, reused across all ATT(g,t) calculations
Vectorized bootstrap (CallawaySantAnna): Matrix operations instead of nested loops, batch weight generation
Vectorized WIF computation (CallawaySantAnna, v2.0.3): Weight influence function computation uses NumPy matrix operations instead of O(n_units × n_keepers) nested loops. The indicator matrix, if1/if2 matrices, and wif contribution are computed using broadcasting and matrix multiplication:
wif_contrib = wif_matrix @ effectsOptional Rust backend (v2.0.0): PyO3-based Rust extension for compute-intensive operations (OLS, robust variance, bootstrap weights, simplex projection)
Why is diff-diff Fast?#
Optimized LAPACK: scipy’s gelsy driver for least squares
Vectorized operations: NumPy/pandas for matrix operations and aggregations
Efficient memory access: Pre-computed structures avoid repeated data reshaping
Pure Python overhead minimized: Hot paths use compiled NumPy/scipy routines
Optional Rust acceleration: Native code for bootstrap and optimization algorithms
Real-World Data Validation#
In addition to synthetic data benchmarks, we validate diff-diff against the
MPDTA (Minimum Wage and Teen Employment) dataset - the canonical benchmark
used in Callaway & Sant’Anna (2021) and the R did package.
MPDTA Dataset#
The MPDTA dataset contains county-level teen employment data with staggered minimum wage policy changes:
500 counties across 5 years (2003-2007)
2,500 observations total
4 treatment cohorts: Never-treated (309), 2004 (20), 2006 (40), 2007 (131)
Outcome: Log teen employment (
lemp)Source: Built into R’s
didpackage
Results Comparison#
Metric |
diff-diff |
R did |
Difference |
|---|---|---|---|
ATT |
-0.039951 |
-0.039951 |
0 (exact match) |
SE (analytical) |
0.0117 |
0.0118 |
< 1% |
Time (10 reps) |
0.003s ± 0.000s |
0.039s ± 0.006s |
14.4x faster |
Key Findings:
Point estimates match exactly: The overall ATT of -0.039951 is identical between diff-diff and R’s
didpackage, validating the core estimation logic.Standard errors match exactly: As of v2.0.2, analytical SEs use the corrected weight influence function formula, achieving 0.0% difference from R’s
didpackage. Both point estimates and standard errors are numerically equivalent.Performance: diff-diff is ~14x faster than R on this real-world dataset at small scale. Performance scales differently at larger sizes (see performance tables above).
This validation on real-world data with known published results confirms that diff-diff produces correct estimates that match the reference R implementation.
Survey Real-Data Validation#
In addition to synthetic-data survey cross-validation (see
test_survey_r_crossvalidation.py), diff-diff’s survey variance is validated
against R’s survey package using three real federal survey datasets. All
comparisons match to machine precision (differences < 1e-10).
Datasets:
Dataset |
Source |
Size |
Survey Design |
Policy Context |
|---|---|---|---|---|
API (apistrat) |
R |
200 schools |
Strata + FPC + weights |
California school accountability (PSAA 1999) |
NHANES |
CDC/NCHS |
2,946 adults |
Strata + PSU + weights (nest) |
ACA young adult coverage provision (2010) |
RECS 2020 |
U.S. EIA |
2,000 households |
60 JK1 replicate weights |
Residential energy consumption survey |
Suite A — API Dataset (TSL Variance):
Test |
Design Variant |
ATT Gap |
SE Gap |
df |
|---|---|---|---|---|
A1 |
Strata + FPC + weights |
2.1e-12 |
3.2e-11 (0.0000%) |
Exact |
A2 |
Strata + weights (no FPC) |
2.1e-12 |
5.3e-11 (0.0000%) |
Exact |
A3 |
Weights only |
2.1e-12 |
2.7e-11 (0.0000%) |
Exact |
A4 |
TWFE (strata + FPC + weights) |
2.1e-12 (ATT only) |
n/a (TWFE absorbs unit FE) |
Exact |
A5 |
Subpopulation (elementary) |
1.5e-11 |
7.5e-12 (0.0000%) |
Differs (see note) |
A6 |
Covariates (meals, ell) |
2.2e-12 |
8.4e-12 (0.0000%) |
Exact |
A7 |
Fay’s BRR replicates (rho=0.3) |
2.1e-12 |
7.7e-11 (0.0000%) |
Exact |
Suite B — NHANES (TSL with Strata + PSU + nest=TRUE):
Test |
Design Variant |
ATT Gap |
SE Gap |
df |
|---|---|---|---|---|
B1 |
Strata + PSU + weights |
4.6e-13 |
2.3e-14 (0.0000%) |
Exact (31) |
B2 |
Covariates (gender, poverty) |
4.9e-13 |
2.3e-13 (0.0000%) |
Exact |
B3 |
Weights only |
4.6e-13 |
2.6e-13 (0.0000%) |
Exact |
B4 |
Subpopulation (female) |
1.1e-13 |
8.7e-14 (0.0000%) |
Exact |
Suite C — RECS 2020 (JK1 Replicate Weights):
Test |
Model |
Coef Gap |
SE Gap |
df |
|---|---|---|---|---|
C1 |
TOTALBTU ~ KOWNRENT |
1.5e-11 |
2.0e-11 (0.0000%) |
Exact (59) |
C2 |
|
3.8e-10 |
2.9e-11 (0.0000%) |
Exact (59) |
Key Findings:
Machine-precision agreement on ATT, SE, df, and CI wherever directly comparable — differences are < 1e-10 (floating-point rounding only). Tolerances are set to 1e-8 in the test suite.
All survey design features validated with real data: stratification, PSU clustering, FPC corrections, probability weight normalization, nested PSU handling (
nest=TRUE), subpopulation analysis, covariate adjustment, Fay’s BRR (212 replicates), and JK1 replicate weight variance.Known differences: A4 (TWFE) validates ATT only — SE differs because TWFE absorbs unit fixed effects. A5 (subpopulation) validates ATT/SE but df differs:
subpopulation()preserves all strata (df=397) while R’ssubset()drops empty strata (df=199). This is a documented deviation (see REGISTRY.md); the diff-diff approach is conservative per Lumley (2004).
Survey Estimator Validation#
Four additional estimators are validated against R’s survey::svyglm() using
synthetic staggered-adoption and DDD datasets. Each estimator reduces to a WLS
regression under survey weights, so the R comparison fits the equivalent
svyglm() model and compares coefficients and standard errors.
Data: 150-unit staggered panel (5 periods, 4 strata, 10 PSUs, FPC) with cohorts at t = 3 and t = 4; 200-observation DDD cross-section (4 strata, 10 PSUs, FPC). Both generated with seed 42.
Test |
Estimator |
R Comparison |
Coef Gap |
SE Gap |
Tolerance |
|---|---|---|---|---|---|
S1 |
|
|
< 1e-10 |
0.00% |
1.5% |
S2 |
|
|
< 1e-10 |
0.77% |
1.5% |
S3 |
|
|
< 1e-11 |
0.00% |
1.5% |
S4 |
|
|
< 1e-10 |
0.36% |
1.5% |
Key details:
S1 validates the WLS building block that
ImputationDiDuses internally (control-only regression with absorbed unit + time FE and time-varying covariates). A companion smoke test confirmsImputationDiD.fit()produces finite ATT/SE under survey weights.S2 replicates the full stacking pipeline in R: sub-experiment construction, sample-share Q-weight computation, Q x survey weight composition with normalization, then
svyglm()on the stacked data with strata/PSU structure. The 0.77% SE gap arises because R omits FPC on the stacked data while Python re-resolves the full survey design.S3 compares both individual cohort x relative-time effects and the IW-aggregated overall ATT (with survey-weighted cohort masses and delta-method SE via the vcov submatrix).
S4 exploits the algebraic equivalence between the pairwise DDD decomposition (
estimation_method="reg", no covariates) and the three-way interaction coefficient from a single OLS regression.
Reproducing Survey Estimator Validation#
# Generate golden values
Rscript benchmarks/R/benchmark_survey_estimators.R
# Run validation tests
pytest tests/test_survey_estimator_validation.py -v
Reproducing Survey Real-Data Validation#
# 1. Generate API golden values (no download needed — data ships with R)
Rscript benchmarks/R/benchmark_realdata_api.R
# 2. Download and process NHANES data from CDC
python benchmarks/scripts/download_nhanes.py
Rscript benchmarks/R/benchmark_realdata_nhanes.R
# 3. Download and subset RECS 2020 from EIA
python benchmarks/scripts/download_recs.py
Rscript benchmarks/R/benchmark_realdata_recs.R
# 4. Run validation tests
pytest tests/test_survey_real_data.py -v
Reproducing Benchmarks#
Prerequisites#
Install R (>= 4.0):
# macOS brew install r
Install R packages:
Rscript benchmarks/R/requirements.RInstall diff-diff:
pip install -e ".[dev]"
Running Benchmarks#
# Run all benchmarks at small scale
python benchmarks/run_benchmarks.py --all
# Run all benchmarks at all scales with 3 replications
python benchmarks/run_benchmarks.py --all --scale all --replications 3
# Run specific estimator at specific scale
python benchmarks/run_benchmarks.py --estimator callaway --scale 1k --replications 3
python benchmarks/run_benchmarks.py --estimator synthdid --scale small --replications 3
python benchmarks/run_benchmarks.py --estimator basic --scale 20k --replications 3
python benchmarks/run_benchmarks.py --estimator multiperiod --scale small --replications 3
# Available scales: small, 1k, 5k, 10k, 20k, all
# Default: small (backward compatible)
# Generate synthetic data only
python benchmarks/run_benchmarks.py --generate-data-only --scale all
The benchmarks run both pure Python and Rust backends automatically, producing a three-way comparison table (R vs Python Pure vs Python Rust).
Output#
Results are saved to:
benchmarks/results/accuracy/- JSON files with estimatesbenchmarks/results/comparison_report.txt- Summary report
Interpretation Notes#
When to Trust Results#
BasicDiD/TWFE: Results are identical to R. Use with confidence.
MultiPeriodDiD: Results are identical to R’s
fixest::feolswithtreated * time_f | unitinteraction syntax (unit FE absorbed). Both average ATT and all period-level effects match to machine precision. Use with confidence.SyntheticDiD: Point estimates are numerically identical (< 1e-10 diff) and standard errors match closely (0.3% diff at small scale). Both implementations use Frank-Wolfe optimization with identical weights. Use
variance_method="placebo"(default) to match R’s inference. Results are fully validated.CallawaySantAnna: Both group-time effects (ATT(g,t)) and overall ATT aggregation match R exactly. Standard errors are numerically equivalent (0.0% difference) as of v2.0.2.
Known Differences#
Inference Methods: diff-diff defaults to analytical SEs; R
diddefaults to multiplier bootstrap. Enable bootstrap in diff-diff for direct comparison.Aggregation Weights: Overall ATT is a weighted average of ATT(g,t). Weighting schemes may differ between implementations.
Placebo Variance: SyntheticDiD SE estimates differ slightly (0.3-7%) across implementations due to Monte Carlo variance in the placebo procedure. Point estimates and unit/time weights are numerically identical since both implementations use the same Frank-Wolfe optimizer.