.. meta::
   :description: Validation benchmarks comparing diff-diff against R packages (did, synthdid, fixest). Coefficient accuracy, standard error comparison, and performance metrics.
   :keywords: difference-in-differences benchmark, DiD validation R, python econometrics accuracy, did package comparison

Benchmarks
==========

This document presents validation benchmarks comparing diff-diff against
established R packages for difference-in-differences analysis. As of v2.0.0,
diff-diff includes an optional Rust backend for accelerated computation.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
--------

diff-diff is validated against the following R packages:

.. list-table::
   :header-rows: 1
   :widths: 30 30 40

   * - diff-diff Estimator
     - R Package
     - Reference
   * - ``DifferenceInDifferences``
     - ``fixest::feols``
     - Standard OLS with interaction
   * - ``CallawaySantAnna``
     - ``did::att_gt``
     - Callaway & Sant'Anna (2021)
   * - ``MultiPeriodDiD``
     - ``fixest::feols``
     - Event study with treatment × period interactions
   * - ``SyntheticDiD``
     - ``synthdid::synthdid_estimate``
     - Arkhangelsky et al. (2021)

Methodology
-----------

Validation Approach
~~~~~~~~~~~~~~~~~~~

1. **Synthetic Data**: Generate data with known true effects using
   ``generate_did_data()`` from diff_diff.prep
2. **Identical Inputs**: Both Python and R estimators receive the same CSV data
3. **JSON Interchange**: R scripts output JSON for comparison
4. **Automated Comparison**: Python script validates numerical equivalence
5. **Multiple Scales**: Test at small (200-400 obs), 1K, 5K, 10K, and 20K unit scales
6. **Replicated Timing**: 3 replications per benchmark to report mean ± std
7. **Reproducible Seed**: Benchmarks use seed 42 for data generation
8. **Three-Way Comparison**: Compare R, Python (pure NumPy/SciPy), and Python (Rust backend)

Tolerance Thresholds
~~~~~~~~~~~~~~~~~~~~

- **Point estimates (ATT)**: Absolute difference < 1e-4 or relative < 1%
- **Standard errors**: Relative difference < 10%
- **Confidence intervals**: Must overlap

Benchmark Results
-----------------

Summary Table
~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 25 20 20 15 20

   * - Estimator
     - ATT Diff
     - SE Rel Diff
     - CI Overlap
     - Status
   * - BasicDiD/TWFE
     - < 1e-10
     - 0.0%
     - Yes
     - **PASS**
   * - MultiPeriodDiD
     - < 1e-11
     - 0.0%
     - Yes
     - **PASS**
   * - CallawaySantAnna
     - < 1e-10
     - 0.0%
     - Yes
     - **PASS**
   * - SyntheticDiD
     - < 1e-10
     - 0.3%
     - Yes
     - **PASS**

Basic DiD Results
~~~~~~~~~~~~~~~~~

**Data**: 100 units, 4 periods, true ATT = 5.0 (small scale)

.. list-table::
   :header-rows: 1

   * - Metric
     - diff-diff (Pure)
     - diff-diff (Rust)
     - R fixest
     - Difference
   * - ATT
     - 5.112
     - 5.112
     - 5.112
     - < 1e-10
   * - SE
     - 0.183
     - 0.183
     - 0.183
     - 0.0%
   * - Time (s)
     - 0.002
     - 0.002
     - 0.041
     - **22x faster**

**Validation**: PASS - Results are numerically identical across all implementations.

MultiPeriodDiD Results
~~~~~~~~~~~~~~~~~~~~~~

**Data**: 200 units, 8 periods (4 pre, 4 post), true ATT = 3.0 (small scale)

.. list-table::
   :header-rows: 1

   * - Metric
     - diff-diff (Pure)
     - diff-diff (Rust)
     - R fixest
     - Difference
   * - ATT
     - 2.912
     - 2.912
     - 2.912
     - < 1e-11
   * - SE
     - 0.158
     - 0.158
     - 0.158
     - 0.0%
   * - Period corr.
     - 1.000
     - 1.000
     - (ref)
     - Period max diff < 3e-11
   * - Time (s)
     - 0.005
     - 0.035
     - 0.035
     - **7x faster** (pure)

**Validation**: PASS - Both average ATT and all period-level effects match R's
``fixest::feols(outcome ~ treated * time_f | unit)`` to machine precision. The
regression includes unit fixed effects (absorbed via ``| unit`` in R, within-
transformation via ``absorb=["unit"]`` in Python) and treatment × period
interactions with cluster-robust SEs.

Synthetic DiD Results
~~~~~~~~~~~~~~~~~~~~~

**Data**: 50 units (40 control, 10 treated), 20 periods, true ATT = 4.0

.. list-table::
   :header-rows: 1

   * - Metric
     - diff-diff (Pure)
     - diff-diff (Rust)
     - R synthdid
     - Difference
   * - ATT
     - 3.840
     - 3.840
     - 3.840
     - < 1e-10
   * - SE
     - 0.105
     - 0.099
     - 0.105
     - 0.3% (pure)
   * - Time (s)
     - 3.41
     - 1.65
     - 8.19
     - **2.4x faster** (pure)

**Validation**: PASS - ATT estimates are numerically identical across all
implementations. Both diff-diff and R's synthdid use Frank-Wolfe optimization
with two-pass sparsification and auto-computed regularization (``zeta_omega``,
``zeta_lambda``), producing identical unit and time weights. Both use
placebo-based variance estimation (Algorithm 4 from Arkhangelsky et al. 2021).

The small SE difference (0.3% at small scale, up to ~7% at larger scales) is
due to Monte Carlo variance in the placebo procedure, which randomly permutes
control units to construct pseudo-treated groups. Different random seeds across
implementations produce slightly different placebo samples.

Callaway-Sant'Anna Results
~~~~~~~~~~~~~~~~~~~~~~~~~~

**Data**: 200 units, 8 periods, 3 treatment cohorts, dynamic effects (small scale)

.. list-table::
   :header-rows: 1

   * - Metric
     - diff-diff (Pure)
     - diff-diff (Rust)
     - R did
     - Difference
   * - ATT
     - 2.519
     - 2.519
     - 2.519
     - < 1e-10
   * - SE
     - 0.063
     - 0.063
     - 0.063
     - 0.0%
   * - Time (s)
     - 0.007 ± 0.000
     - 0.007 ± 0.000
     - 0.070 ± 0.001
     - **10x faster**

**Validation**: PASS - Both point estimates and standard errors match R exactly.

**Key findings from investigation:**

1. **Individual ATT(g,t) effects match perfectly** (~1e-11 difference)
2. **Never-treated coding**: R's ``did`` package requires ``first_treat=Inf``
   for never-treated units. diff-diff accepts ``first_treat=0``. The benchmark
   converts 0 to Inf for R compatibility.
3. **Standard errors**: As of v2.0.2, analytical SEs match R's ``did`` package
   exactly (0.0% difference). The weight influence function (wif) formula was
   corrected to match R's implementation, achieving numerical equivalence across
   all dataset scales.

**Event study per-event-time validation:**

As of v2.1.0, event study SEs include the WIF adjustment matching R's
``did::aggte(..., type="dynamic")``. Validation targets:

- Per-event-time point estimates: match R's ``aggte(..., type="dynamic")`` to <1e-10
- Per-event-time analytical SEs (``bstrap=FALSE``): match R with WIF included
- Per-event-time bootstrap SEs (``bstrap=TRUE``): consistent with analytical
- Simultaneous confidence bands (``cband=TRUE``): sup-t critical value matches R

Performance Comparison
----------------------

We benchmarked performance across multiple dataset scales with 3 replications
each to provide mean ± std timing statistics. As of v2.0.0, we compare three
implementations:

- **R**: Reference implementation (fixest, did packages)
- **Python (Pure)**: diff-diff with NumPy/SciPy only (no Rust backend)
- **Python (Rust)**: diff-diff with optional Rust backend enabled

.. note::

   **v2.0.0 Rust Backend**: diff-diff v2.0.0 introduces an optional Rust backend
   for accelerated computation. The Rust backend provides significant speedups
   for **SyntheticDiD** (4-8x faster than pure Python), which uses custom Rust
   implementations for synthetic weight computation and simplex projection.
   For **BasicDiD** and **CallawaySantAnna**, the Rust backend provides minimal
   additional speedup since these estimators primarily use OLS and variance
   computations that are already highly optimized in NumPy/SciPy via BLAS/LAPACK.

   As of v2.5.0, pre-built wheels on macOS and Linux link platform-optimized
   BLAS libraries (Apple Accelerate and OpenBLAS respectively) for matrix-vector
   and matrix-matrix products across all Rust-accelerated code paths. Windows
   wheels continue to use pure Rust with no external dependencies.

Three-Way Performance Summary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**BasicDiD/TWFE Results:**

.. list-table::
   :header-rows: 1
   :widths: 12 15 18 18 12 12

   * - Scale
     - R (s)
     - Python Pure (s)
     - Python Rust (s)
     - Rust/R
     - Rust/Pure
   * - small
     - 0.034
     - 0.002
     - 0.002
     - **17x**
     - 1.1x
   * - 1k
     - 0.036
     - 0.003
     - 0.003
     - **13x**
     - 1.0x
   * - 5k
     - 0.042
     - 0.005
     - 0.006
     - **7x**
     - 0.8x
   * - 10k
     - 0.043
     - 0.010
     - 0.012
     - **4x**
     - 0.8x
   * - 20k
     - 0.050
     - 0.022
     - 0.025
     - **2x**
     - 0.9x

**CallawaySantAnna Results:**

.. list-table::
   :header-rows: 1
   :widths: 12 15 18 18 12 12

   * - Scale
     - R (s)
     - Python Pure (s)
     - Python Rust (s)
     - Pure/R
     - Rust/Pure
   * - small
     - 0.069
     - 0.006
     - 0.007
     - **11x**
     - 1.0x
   * - 1k
     - 0.119
     - 0.014
     - 0.013
     - **9x**
     - 1.0x
   * - 5k
     - 0.363
     - 0.055
     - 0.055
     - **7x**
     - 1.0x
   * - 10k
     - 0.771
     - 0.146
     - 0.145
     - **5x**
     - 1.0x
   * - 20k
     - 1.559
     - 0.366
     - 0.373
     - **4x**
     - 1.0x

**SyntheticDiD Results:**

.. list-table::
   :header-rows: 1
   :widths: 12 15 18 18 12 12

   * - Scale
     - R (s)
     - Python Pure (s)
     - Python Rust (s)
     - Pure/R
     - Rust/Pure
   * - small
     - 8.19
     - 3.41
     - 1.65
     - **2.4x**
     - **2.1x**
   * - 1k
     - 111.7
     - 24.0
     - 76.1
     - **4.7x**
     - 0.3x
   * - 5k
     - 524.2
     - 31.7
     - 307.5
     - **16.5x**
     - 0.1x

.. note::

   **SyntheticDiD Performance**: diff-diff's pure Python backend achieves
   **2.4x to 16.5x speedup** over R's synthdid package using the same
   Frank-Wolfe optimization algorithm. At 5k scale, R takes ~9 minutes while
   pure Python completes in 32 seconds. ATT estimates are numerically identical
   (< 1e-10 difference) since both implementations use the same Frank-Wolfe
   optimizer with two-pass sparsification. The Rust backend uses a
   Gram-accelerated Frank-Wolfe solver for time weights (reducing per-iteration
   cost from O(N×T0) to O(T0)) and an allocation-free solver for unit weights
   (1 GEMV per iteration instead of 3, zero heap allocations). These
   optimizations make the Rust backend faster than pure Python at all scales.

Dataset Sizes
~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 10 18 18 18 18 18

   * - Scale
     - BasicDiD
     - MultiPeriodDiD
     - CallawaySantAnna
     - SyntheticDiD
     - Observations
   * - small
     - 100 × 4
     - 200 × 8
     - 200 × 8
     - 50 × 20
     - 400 - 1,600
   * - 1k
     - 1,000 × 6
     - 1,000 × 10
     - 1,000 × 10
     - 1,000 × 30
     - 6,000 - 30,000
   * - 5k
     - 5,000 × 8
     - 5,000 × 12
     - 5,000 × 12
     - 5,000 × 40
     - 40,000 - 200,000
   * - 10k
     - 10,000 × 10
     - 10,000 × 12
     - 10,000 × 15
     - 10,000 × 50
     - 100,000 - 500,000
   * - 20k
     - 20,000 × 12
     - 20,000 × 16
     - 20,000 × 18
     - 20,000 × 60
     - 240,000 - 1,200,000

Key Observations
~~~~~~~~~~~~~~~~

1. **Performance varies by estimator and scale**:

   - **BasicDiD/TWFE**: 2-17x faster than R at all scales
   - **CallawaySantAnna**: 4-11x faster than R at all scales (vectorized WIF computation)
   - **SyntheticDiD**: 2.4-16.5x faster than R (pure Python), with both
     implementations using the same Frank-Wolfe algorithm

2. **Rust backend benefit depends on the estimator**:

   - **SyntheticDiD**: Rust provides speedup at small scale (2.1x) but is
     slower at larger scales due to placebo variance loop overhead
   - **BasicDiD/CallawaySantAnna**: Rust provides minimal benefit (~1x) since
     these estimators use OLS/variance computations already optimized in NumPy/SciPy

3. **When to use Rust backend**:

   - **SyntheticDiD at small scale**: Rust is ~2x faster than pure Python
   - **Bootstrap inference**: May help with parallelized iterations
   - **BasicDiD/CallawaySantAnna**: Optional - pure Python is equally fast

4. **Scaling behavior**: Python implementations show excellent scaling behavior
   across all estimators. SyntheticDiD pure Python is 16.5x faster than R at
   5k scale. CallawaySantAnna achieves **exact SE accuracy** (0.0% difference)
   while being 4-11x faster than R through vectorized NumPy operations.

5. **No Rust required for most use cases**: Users without Rust/maturin can
   install diff-diff and get full functionality with excellent performance.
   Pure Python is the fastest option for SyntheticDiD at 1k+ scales.

6. **CallawaySantAnna accuracy and speed**: As of v2.0.3, CallawaySantAnna
   achieves both exact numerical accuracy (0.0% SE difference from R) AND
   superior performance (4-10x faster than R) through vectorized weight
   influence function (WIF) computation using NumPy matrix operations.

Performance Optimization Details
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The performance improvements come from:

1. **Unified ``linalg.py`` backend**: Single optimized OLS/SE implementation
   using scipy's gelsy LAPACK driver (QR-based, faster than SVD)

2. **Vectorized cluster-robust SE**: Eliminated O(n × clusters) loop with
   pandas groupby aggregation

3. **Pre-computed data structures** (CallawaySantAnna): Wide-format outcome
   matrix and cohort masks computed once, reused across all ATT(g,t) calculations

4. **Vectorized bootstrap** (CallawaySantAnna): Matrix operations instead of
   nested loops, batch weight generation

5. **Vectorized WIF computation** (CallawaySantAnna, v2.0.3): Weight influence
   function computation uses NumPy matrix operations instead of O(n_units × n_keepers)
   nested loops. The indicator matrix, if1/if2 matrices, and wif contribution are
   computed using broadcasting and matrix multiplication: ``wif_contrib = wif_matrix @ effects``

6. **Optional Rust backend** (v2.0.0): PyO3-based Rust extension for compute-intensive
   operations (OLS, robust variance, bootstrap weights, simplex projection)

Why is diff-diff Fast?
~~~~~~~~~~~~~~~~~~~~~~

1. **Optimized LAPACK**: scipy's gelsy driver for least squares
2. **Vectorized operations**: NumPy/pandas for matrix operations and aggregations
3. **Efficient memory access**: Pre-computed structures avoid repeated data reshaping
4. **Pure Python overhead minimized**: Hot paths use compiled NumPy/scipy routines
5. **Optional Rust acceleration**: Native code for bootstrap and optimization algorithms

Real-World Data Validation
--------------------------

In addition to synthetic data benchmarks, we validate diff-diff against the
**MPDTA (Minimum Wage and Teen Employment)** dataset - the canonical benchmark
used in Callaway & Sant'Anna (2021) and the R ``did`` package.

MPDTA Dataset
~~~~~~~~~~~~~

The MPDTA dataset contains county-level teen employment data with staggered
minimum wage policy changes:

- **500 counties** across 5 years (2003-2007)
- **2,500 observations** total
- **4 treatment cohorts**: Never-treated (309), 2004 (20), 2006 (40), 2007 (131)
- **Outcome**: Log teen employment (``lemp``)
- **Source**: Built into R's ``did`` package

Results Comparison
~~~~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 25 25 25 25

   * - Metric
     - diff-diff
     - R did
     - Difference
   * - ATT
     - -0.039951
     - -0.039951
     - **0** (exact match)
   * - SE (analytical)
     - 0.0117
     - 0.0118
     - **< 1%**
   * - Time (10 reps)
     - 0.003s ± 0.000s
     - 0.039s ± 0.006s
     - **14.4x faster**

**Key Findings:**

1. **Point estimates match exactly**: The overall ATT of -0.039951 is identical
   between diff-diff and R's ``did`` package, validating the core estimation logic.

2. **Standard errors match exactly**: As of v2.0.2, analytical SEs use the corrected
   weight influence function formula, achieving 0.0% difference from R's ``did``
   package. Both point estimates and standard errors are numerically equivalent.

3. **Performance**: diff-diff is ~14x faster than R on this real-world dataset
   at small scale. Performance scales differently at larger sizes (see performance
   tables above).

This validation on real-world data with known published results confirms that
diff-diff produces correct estimates that match the reference R implementation.

Survey Real-Data Validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to synthetic-data survey cross-validation (see
``test_survey_r_crossvalidation.py``), diff-diff's survey variance is validated
against R's ``survey`` package using three real federal survey datasets. All
comparisons match to machine precision (differences < 1e-10).

**Datasets:**

.. list-table::
   :header-rows: 1
   :widths: 15 20 15 20 30

   * - Dataset
     - Source
     - Size
     - Survey Design
     - Policy Context
   * - API (apistrat)
     - R ``survey`` package
     - 200 schools
     - Strata + FPC + weights
     - California school accountability (PSAA 1999)
   * - NHANES
     - CDC/NCHS
     - 2,946 adults
     - Strata + PSU + weights (nest)
     - ACA young adult coverage provision (2010)
   * - RECS 2020
     - U.S. EIA
     - 2,000 households
     - 60 JK1 replicate weights
     - Residential energy consumption survey

**Suite A — API Dataset (TSL Variance):**

.. list-table::
   :header-rows: 1
   :widths: 10 30 20 20 20

   * - Test
     - Design Variant
     - ATT Gap
     - SE Gap
     - df
   * - A1
     - Strata + FPC + weights
     - 2.1e-12
     - 3.2e-11 (0.0000%)
     - Exact
   * - A2
     - Strata + weights (no FPC)
     - 2.1e-12
     - 5.3e-11 (0.0000%)
     - Exact
   * - A3
     - Weights only
     - 2.1e-12
     - 2.7e-11 (0.0000%)
     - Exact
   * - A4
     - TWFE (strata + FPC + weights)
     - 2.1e-12 (ATT only)
     - n/a (TWFE absorbs unit FE)
     - Exact
   * - A5
     - Subpopulation (elementary)
     - 1.5e-11
     - 7.5e-12 (0.0000%)
     - Differs (see note)
   * - A6
     - Covariates (meals, ell)
     - 2.2e-12
     - 8.4e-12 (0.0000%)
     - Exact
   * - A7
     - Fay's BRR replicates (rho=0.3)
     - 2.1e-12
     - 7.7e-11 (0.0000%)
     - Exact

**Suite B — NHANES (TSL with Strata + PSU + nest=TRUE):**

.. list-table::
   :header-rows: 1
   :widths: 10 30 20 20 20

   * - Test
     - Design Variant
     - ATT Gap
     - SE Gap
     - df
   * - B1
     - Strata + PSU + weights
     - 4.6e-13
     - 2.3e-14 (0.0000%)
     - Exact (31)
   * - B2
     - Covariates (gender, poverty)
     - 4.9e-13
     - 2.3e-13 (0.0000%)
     - Exact
   * - B3
     - Weights only
     - 4.6e-13
     - 2.6e-13 (0.0000%)
     - Exact
   * - B4
     - Subpopulation (female)
     - 1.1e-13
     - 8.7e-14 (0.0000%)
     - Exact

**Suite C — RECS 2020 (JK1 Replicate Weights):**

.. list-table::
   :header-rows: 1
   :widths: 10 30 20 20 20

   * - Test
     - Model
     - Coef Gap
     - SE Gap
     - df
   * - C1
     - TOTALBTU ~ KOWNRENT
     - 1.5e-11
     - 2.0e-11 (0.0000%)
     - Exact (59)
   * - C2
     - + TYPEHUQ + REGIONC
     - 3.8e-10
     - 2.9e-11 (0.0000%)
     - Exact (59)

**Key Findings:**

1. **Machine-precision agreement** on ATT, SE, df, and CI wherever directly
   comparable — differences are < 1e-10 (floating-point rounding only).
   Tolerances are set to 1e-8 in the test suite.

2. **All survey design features validated with real data:** stratification, PSU
   clustering, FPC corrections, probability weight normalization, nested PSU
   handling (``nest=TRUE``), subpopulation analysis, covariate adjustment,
   Fay's BRR (212 replicates), and JK1 replicate weight variance.

3. **Known differences:** A4 (TWFE) validates ATT only — SE differs because
   TWFE absorbs unit fixed effects. A5 (subpopulation) validates ATT/SE but
   df differs: ``subpopulation()`` preserves all strata (df=397) while R's
   ``subset()`` drops empty strata (df=199). This is a documented deviation
   (see REGISTRY.md); the diff-diff approach is conservative per Lumley (2004).

Survey Estimator Validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Four additional estimators are validated against R's ``survey::svyglm()`` using
synthetic staggered-adoption and DDD datasets. Each estimator reduces to a WLS
regression under survey weights, so the R comparison fits the equivalent
``svyglm()`` model and compares coefficients and standard errors.

**Data:** 150-unit staggered panel (5 periods, 4 strata, 10 PSUs, FPC) with
cohorts at *t* = 3 and *t* = 4; 200-observation DDD cross-section (4 strata,
10 PSUs, FPC). Both generated with seed 42.

.. list-table::
   :header-rows: 1
   :widths: 8 18 30 15 15 14

   * - Test
     - Estimator
     - R Comparison
     - Coef Gap
     - SE Gap
     - Tolerance
   * - S1
     - ``ImputationDiD``
     - ``svyglm()`` on control-only (Omega_0) FE regression; covariate coefficients
     - < 1e-10
     - 0.00%
     - 1.5%
   * - S2
     - ``StackedDiD``
     - ``svyglm()`` on stacked dataset with Q-weight x survey weight composition
     - < 1e-10
     - 0.77%
     - 1.5%
   * - S3
     - ``SunAbraham``
     - ``svyglm()`` with cohort x period interactions; IW-aggregated ATT
     - < 1e-11
     - 0.00%
     - 1.5%
   * - S4
     - ``TripleDifference``
     - ``svyglm()`` three-way interaction (``group:partition:time``)
     - < 1e-10
     - 0.36%
     - 1.5%

**Key details:**

- **S1** validates the WLS building block that ``ImputationDiD`` uses internally
  (control-only regression with absorbed unit + time FE and time-varying
  covariates). A companion smoke test confirms ``ImputationDiD.fit()`` produces
  finite ATT/SE under survey weights.
- **S2** replicates the full stacking pipeline in R: sub-experiment construction,
  sample-share Q-weight computation, Q x survey weight composition with
  normalization, then ``svyglm()`` on the stacked data with strata/PSU structure.
  The 0.77% SE gap arises because R omits FPC on the stacked data while Python
  re-resolves the full survey design.
- **S3** compares both individual cohort x relative-time effects and the
  IW-aggregated overall ATT (with survey-weighted cohort masses and delta-method
  SE via the vcov submatrix).
- **S4** exploits the algebraic equivalence between the pairwise DDD
  decomposition (``estimation_method="reg"``, no covariates) and the three-way
  interaction coefficient from a single OLS regression.

Reproducing Survey Estimator Validation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # Generate golden values
   Rscript benchmarks/R/benchmark_survey_estimators.R

   # Run validation tests
   pytest tests/test_survey_estimator_validation.py -v

Reproducing Survey Real-Data Validation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   # 1. Generate API golden values (no download needed — data ships with R)
   Rscript benchmarks/R/benchmark_realdata_api.R

   # 2. Download and process NHANES data from CDC
   python benchmarks/scripts/download_nhanes.py
   Rscript benchmarks/R/benchmark_realdata_nhanes.R

   # 3. Download and subset RECS 2020 from EIA
   python benchmarks/scripts/download_recs.py
   Rscript benchmarks/R/benchmark_realdata_recs.R

   # 4. Run validation tests
   pytest tests/test_survey_real_data.py -v

Reproducing Benchmarks
----------------------

Prerequisites
~~~~~~~~~~~~~

1. Install R (>= 4.0):

   .. code-block:: bash

      # macOS
      brew install r

2. Install R packages:

   .. code-block:: bash

      Rscript benchmarks/R/requirements.R

3. Install diff-diff:

   .. code-block:: bash

      pip install -e ".[dev]"

Running Benchmarks
~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   # Run all benchmarks at small scale
   python benchmarks/run_benchmarks.py --all

   # Run all benchmarks at all scales with 3 replications
   python benchmarks/run_benchmarks.py --all --scale all --replications 3

   # Run specific estimator at specific scale
   python benchmarks/run_benchmarks.py --estimator callaway --scale 1k --replications 3
   python benchmarks/run_benchmarks.py --estimator synthdid --scale small --replications 3
   python benchmarks/run_benchmarks.py --estimator basic --scale 20k --replications 3
   python benchmarks/run_benchmarks.py --estimator multiperiod --scale small --replications 3

   # Available scales: small, 1k, 5k, 10k, 20k, all
   # Default: small (backward compatible)

   # Generate synthetic data only
   python benchmarks/run_benchmarks.py --generate-data-only --scale all

The benchmarks run both pure Python and Rust backends automatically, producing
a three-way comparison table (R vs Python Pure vs Python Rust).

Output
~~~~~~

Results are saved to:

- ``benchmarks/results/accuracy/`` - JSON files with estimates
- ``benchmarks/results/comparison_report.txt`` - Summary report

Interpretation Notes
--------------------

When to Trust Results
~~~~~~~~~~~~~~~~~~~~~

- **BasicDiD/TWFE**: Results are identical to R. Use with confidence.

- **MultiPeriodDiD**: Results are identical to R's ``fixest::feols`` with
  ``treated * time_f | unit`` interaction syntax (unit FE absorbed). Both average
  ATT and all period-level effects match to machine precision. Use with confidence.

- **SyntheticDiD**: Point estimates are numerically identical (< 1e-10 diff) and
  standard errors match closely (0.3% diff at small scale). Both implementations
  use Frank-Wolfe optimization with identical weights. Use
  ``variance_method="placebo"`` (default) to match R's inference. Results are
  fully validated.

- **CallawaySantAnna**: Both group-time effects (ATT(g,t)) and overall ATT
  aggregation match R exactly. Standard errors are numerically equivalent
  (0.0% difference) as of v2.0.2.

Known Differences
~~~~~~~~~~~~~~~~~

1. **Inference Methods**: diff-diff defaults to analytical SEs; R ``did``
   defaults to multiplier bootstrap. Enable bootstrap in diff-diff for
   direct comparison.

2. **Aggregation Weights**: Overall ATT is a weighted average of ATT(g,t).
   Weighting schemes may differ between implementations.

3. **Placebo Variance**: SyntheticDiD SE estimates differ slightly (0.3-7%)
   across implementations due to Monte Carlo variance in the placebo procedure.
   Point estimates and unit/time weights are numerically identical since both
   implementations use the same Frank-Wolfe optimizer.