.. meta:: :description: Validation benchmarks comparing diff-diff against R packages (did, synthdid, fixest). Coefficient accuracy, standard error comparison, and performance metrics. :keywords: difference-in-differences benchmark, DiD validation R, python econometrics accuracy, did package comparison Benchmarks ========== This document presents validation benchmarks comparing diff-diff against established R packages for difference-in-differences analysis. As of v2.0.0, diff-diff includes an optional Rust backend for accelerated computation. .. contents:: Table of Contents :local: :depth: 2 Overview -------- diff-diff is validated against the following R packages: .. list-table:: :header-rows: 1 :widths: 30 30 40 * - diff-diff Estimator - R Package - Reference * - ``DifferenceInDifferences`` - ``fixest::feols`` - Standard OLS with interaction * - ``CallawaySantAnna`` - ``did::att_gt`` - Callaway & Sant'Anna (2021) * - ``MultiPeriodDiD`` - ``fixest::feols`` - Event study with treatment × period interactions * - ``SyntheticDiD`` - ``synthdid::synthdid_estimate`` - Arkhangelsky et al. (2021) Methodology ----------- Validation Approach ~~~~~~~~~~~~~~~~~~~ 1. **Synthetic Data**: Generate data with known true effects using ``generate_did_data()`` from diff_diff.prep 2. **Identical Inputs**: Both Python and R estimators receive the same CSV data 3. **JSON Interchange**: R scripts output JSON for comparison 4. **Automated Comparison**: Python script validates numerical equivalence 5. **Multiple Scales**: Test at small (200-400 obs), 1K, 5K, 10K, and 20K unit scales 6. **Replicated Timing**: 3 replications per benchmark to report mean ± std 7. **Reproducible Seed**: Benchmarks use seed 42 for data generation 8. **Three-Way Comparison**: Compare R, Python (pure NumPy/SciPy), and Python (Rust backend) Tolerance Thresholds ~~~~~~~~~~~~~~~~~~~~ - **Point estimates (ATT)**: Absolute difference < 1e-4 or relative < 1% - **Standard errors**: Relative difference < 10% - **Confidence intervals**: Must overlap Benchmark Results ----------------- Summary Table ~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 25 20 20 15 20 * - Estimator - ATT Diff - SE Rel Diff - CI Overlap - Status * - BasicDiD/TWFE - < 1e-10 - 0.0% - Yes - **PASS** * - MultiPeriodDiD - < 1e-11 - 0.0% - Yes - **PASS** * - CallawaySantAnna - < 1e-10 - 0.0% - Yes - **PASS** * - SyntheticDiD - < 1e-10 - 0.3% - Yes - **PASS** Basic DiD Results ~~~~~~~~~~~~~~~~~ **Data**: 100 units, 4 periods, true ATT = 5.0 (small scale) .. list-table:: :header-rows: 1 * - Metric - diff-diff (Pure) - diff-diff (Rust) - R fixest - Difference * - ATT - 5.112 - 5.112 - 5.112 - < 1e-10 * - SE - 0.183 - 0.183 - 0.183 - 0.0% * - Time (s) - 0.002 - 0.002 - 0.041 - **22x faster** **Validation**: PASS - Results are numerically identical across all implementations. MultiPeriodDiD Results ~~~~~~~~~~~~~~~~~~~~~~ **Data**: 200 units, 8 periods (4 pre, 4 post), true ATT = 3.0 (small scale) .. list-table:: :header-rows: 1 * - Metric - diff-diff (Pure) - diff-diff (Rust) - R fixest - Difference * - ATT - 2.912 - 2.912 - 2.912 - < 1e-11 * - SE - 0.158 - 0.158 - 0.158 - 0.0% * - Period corr. - 1.000 - 1.000 - (ref) - Period max diff < 3e-11 * - Time (s) - 0.005 - 0.035 - 0.035 - **7x faster** (pure) **Validation**: PASS - Both average ATT and all period-level effects match R's ``fixest::feols(outcome ~ treated * time_f | unit)`` to machine precision. The regression includes unit fixed effects (absorbed via ``| unit`` in R, within- transformation via ``absorb=["unit"]`` in Python) and treatment × period interactions with cluster-robust SEs. Synthetic DiD Results ~~~~~~~~~~~~~~~~~~~~~ **Data**: 50 units (40 control, 10 treated), 20 periods, true ATT = 4.0 .. list-table:: :header-rows: 1 * - Metric - diff-diff (Pure) - diff-diff (Rust) - R synthdid - Difference * - ATT - 3.840 - 3.840 - 3.840 - < 1e-10 * - SE - 0.105 - 0.099 - 0.105 - 0.3% (pure) * - Time (s) - 3.41 - 1.65 - 8.19 - **2.4x faster** (pure) **Validation**: PASS - ATT estimates are numerically identical across all implementations. Both diff-diff and R's synthdid use Frank-Wolfe optimization with two-pass sparsification and auto-computed regularization (``zeta_omega``, ``zeta_lambda``), producing identical unit and time weights. Both use placebo-based variance estimation (Algorithm 4 from Arkhangelsky et al. 2021). The small SE difference (0.3% at small scale, up to ~7% at larger scales) is due to Monte Carlo variance in the placebo procedure, which randomly permutes control units to construct pseudo-treated groups. Different random seeds across implementations produce slightly different placebo samples. Callaway-Sant'Anna Results ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Data**: 200 units, 8 periods, 3 treatment cohorts, dynamic effects (small scale) .. list-table:: :header-rows: 1 * - Metric - diff-diff (Pure) - diff-diff (Rust) - R did - Difference * - ATT - 2.519 - 2.519 - 2.519 - < 1e-10 * - SE - 0.063 - 0.063 - 0.063 - 0.0% * - Time (s) - 0.007 ± 0.000 - 0.007 ± 0.000 - 0.070 ± 0.001 - **10x faster** **Validation**: PASS - Both point estimates and standard errors match R exactly. **Key findings from investigation:** 1. **Individual ATT(g,t) effects match perfectly** (~1e-11 difference) 2. **Never-treated coding**: R's ``did`` package requires ``first_treat=Inf`` for never-treated units. diff-diff accepts ``first_treat=0``. The benchmark converts 0 to Inf for R compatibility. 3. **Standard errors**: As of v2.0.2, analytical SEs match R's ``did`` package exactly (0.0% difference). The weight influence function (wif) formula was corrected to match R's implementation, achieving numerical equivalence across all dataset scales. **Event study per-event-time validation:** As of v2.1.0, event study SEs include the WIF adjustment matching R's ``did::aggte(..., type="dynamic")``. Validation targets: - Per-event-time point estimates: match R's ``aggte(..., type="dynamic")`` to <1e-10 - Per-event-time analytical SEs (``bstrap=FALSE``): match R with WIF included - Per-event-time bootstrap SEs (``bstrap=TRUE``): consistent with analytical - Simultaneous confidence bands (``cband=TRUE``): sup-t critical value matches R Performance Comparison ---------------------- We benchmarked performance across multiple dataset scales with 3 replications each to provide mean ± std timing statistics. As of v2.0.0, we compare three implementations: - **R**: Reference implementation (fixest, did packages) - **Python (Pure)**: diff-diff with NumPy/SciPy only (no Rust backend) - **Python (Rust)**: diff-diff with optional Rust backend enabled .. note:: **v2.0.0 Rust Backend**: diff-diff v2.0.0 introduces an optional Rust backend for accelerated computation. The Rust backend provides significant speedups for **SyntheticDiD** (4-8x faster than pure Python), which uses custom Rust implementations for synthetic weight computation and simplex projection. For **BasicDiD** and **CallawaySantAnna**, the Rust backend provides minimal additional speedup since these estimators primarily use OLS and variance computations that are already highly optimized in NumPy/SciPy via BLAS/LAPACK. As of v2.5.0, pre-built wheels on macOS and Linux link platform-optimized BLAS libraries (Apple Accelerate and OpenBLAS respectively) for matrix-vector and matrix-matrix products across all Rust-accelerated code paths. Windows wheels continue to use pure Rust with no external dependencies. Three-Way Performance Summary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **BasicDiD/TWFE Results:** .. list-table:: :header-rows: 1 :widths: 12 15 18 18 12 12 * - Scale - R (s) - Python Pure (s) - Python Rust (s) - Rust/R - Rust/Pure * - small - 0.034 - 0.002 - 0.002 - **17x** - 1.1x * - 1k - 0.036 - 0.003 - 0.003 - **13x** - 1.0x * - 5k - 0.042 - 0.005 - 0.006 - **7x** - 0.8x * - 10k - 0.043 - 0.010 - 0.012 - **4x** - 0.8x * - 20k - 0.050 - 0.022 - 0.025 - **2x** - 0.9x **CallawaySantAnna Results:** .. list-table:: :header-rows: 1 :widths: 12 15 18 18 12 12 * - Scale - R (s) - Python Pure (s) - Python Rust (s) - Pure/R - Rust/Pure * - small - 0.069 - 0.006 - 0.007 - **11x** - 1.0x * - 1k - 0.119 - 0.014 - 0.013 - **9x** - 1.0x * - 5k - 0.363 - 0.055 - 0.055 - **7x** - 1.0x * - 10k - 0.771 - 0.146 - 0.145 - **5x** - 1.0x * - 20k - 1.559 - 0.366 - 0.373 - **4x** - 1.0x **SyntheticDiD Results:** .. list-table:: :header-rows: 1 :widths: 12 15 18 18 12 12 * - Scale - R (s) - Python Pure (s) - Python Rust (s) - Pure/R - Rust/Pure * - small - 8.19 - 3.41 - 1.65 - **2.4x** - **2.1x** * - 1k - 111.7 - 24.0 - 76.1 - **4.7x** - 0.3x * - 5k - 524.2 - 31.7 - 307.5 - **16.5x** - 0.1x .. note:: **SyntheticDiD Performance**: diff-diff's pure Python backend achieves **2.4x to 16.5x speedup** over R's synthdid package using the same Frank-Wolfe optimization algorithm. At 5k scale, R takes ~9 minutes while pure Python completes in 32 seconds. ATT estimates are numerically identical (< 1e-10 difference) since both implementations use the same Frank-Wolfe optimizer with two-pass sparsification. The Rust backend uses a Gram-accelerated Frank-Wolfe solver for time weights (reducing per-iteration cost from O(N×T0) to O(T0)) and an allocation-free solver for unit weights (1 GEMV per iteration instead of 3, zero heap allocations). These optimizations make the Rust backend faster than pure Python at all scales. Dataset Sizes ~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 10 18 18 18 18 18 * - Scale - BasicDiD - MultiPeriodDiD - CallawaySantAnna - SyntheticDiD - Observations * - small - 100 × 4 - 200 × 8 - 200 × 8 - 50 × 20 - 400 - 1,600 * - 1k - 1,000 × 6 - 1,000 × 10 - 1,000 × 10 - 1,000 × 30 - 6,000 - 30,000 * - 5k - 5,000 × 8 - 5,000 × 12 - 5,000 × 12 - 5,000 × 40 - 40,000 - 200,000 * - 10k - 10,000 × 10 - 10,000 × 12 - 10,000 × 15 - 10,000 × 50 - 100,000 - 500,000 * - 20k - 20,000 × 12 - 20,000 × 16 - 20,000 × 18 - 20,000 × 60 - 240,000 - 1,200,000 Key Observations ~~~~~~~~~~~~~~~~ 1. **Performance varies by estimator and scale**: - **BasicDiD/TWFE**: 2-17x faster than R at all scales - **CallawaySantAnna**: 4-11x faster than R at all scales (vectorized WIF computation) - **SyntheticDiD**: 2.4-16.5x faster than R (pure Python), with both implementations using the same Frank-Wolfe algorithm 2. **Rust backend benefit depends on the estimator**: - **SyntheticDiD**: Rust provides speedup at small scale (2.1x) but is slower at larger scales due to placebo variance loop overhead - **BasicDiD/CallawaySantAnna**: Rust provides minimal benefit (~1x) since these estimators use OLS/variance computations already optimized in NumPy/SciPy 3. **When to use Rust backend**: - **SyntheticDiD at small scale**: Rust is ~2x faster than pure Python - **Bootstrap inference**: May help with parallelized iterations - **BasicDiD/CallawaySantAnna**: Optional - pure Python is equally fast 4. **Scaling behavior**: Python implementations show excellent scaling behavior across all estimators. SyntheticDiD pure Python is 16.5x faster than R at 5k scale. CallawaySantAnna achieves **exact SE accuracy** (0.0% difference) while being 4-11x faster than R through vectorized NumPy operations. 5. **No Rust required for most use cases**: Users without Rust/maturin can install diff-diff and get full functionality with excellent performance. Pure Python is the fastest option for SyntheticDiD at 1k+ scales. 6. **CallawaySantAnna accuracy and speed**: As of v2.0.3, CallawaySantAnna achieves both exact numerical accuracy (0.0% SE difference from R) AND superior performance (4-10x faster than R) through vectorized weight influence function (WIF) computation using NumPy matrix operations. Performance Optimization Details ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The performance improvements come from: 1. **Unified ``linalg.py`` backend**: Single optimized OLS/SE implementation using scipy's gelsy LAPACK driver (QR-based, faster than SVD) 2. **Vectorized cluster-robust SE**: Eliminated O(n × clusters) loop with pandas groupby aggregation 3. **Pre-computed data structures** (CallawaySantAnna): Wide-format outcome matrix and cohort masks computed once, reused across all ATT(g,t) calculations 4. **Vectorized bootstrap** (CallawaySantAnna): Matrix operations instead of nested loops, batch weight generation 5. **Vectorized WIF computation** (CallawaySantAnna, v2.0.3): Weight influence function computation uses NumPy matrix operations instead of O(n_units × n_keepers) nested loops. The indicator matrix, if1/if2 matrices, and wif contribution are computed using broadcasting and matrix multiplication: ``wif_contrib = wif_matrix @ effects`` 6. **Optional Rust backend** (v2.0.0): PyO3-based Rust extension for compute-intensive operations (OLS, robust variance, bootstrap weights, simplex projection) Why is diff-diff Fast? ~~~~~~~~~~~~~~~~~~~~~~ 1. **Optimized LAPACK**: scipy's gelsy driver for least squares 2. **Vectorized operations**: NumPy/pandas for matrix operations and aggregations 3. **Efficient memory access**: Pre-computed structures avoid repeated data reshaping 4. **Pure Python overhead minimized**: Hot paths use compiled NumPy/scipy routines 5. **Optional Rust acceleration**: Native code for bootstrap and optimization algorithms Real-World Data Validation -------------------------- In addition to synthetic data benchmarks, we validate diff-diff against the **MPDTA (Minimum Wage and Teen Employment)** dataset - the canonical benchmark used in Callaway & Sant'Anna (2021) and the R ``did`` package. MPDTA Dataset ~~~~~~~~~~~~~ The MPDTA dataset contains county-level teen employment data with staggered minimum wage policy changes: - **500 counties** across 5 years (2003-2007) - **2,500 observations** total - **4 treatment cohorts**: Never-treated (309), 2004 (20), 2006 (40), 2007 (131) - **Outcome**: Log teen employment (``lemp``) - **Source**: Built into R's ``did`` package Results Comparison ~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 25 25 25 25 * - Metric - diff-diff - R did - Difference * - ATT - -0.039951 - -0.039951 - **0** (exact match) * - SE (analytical) - 0.0117 - 0.0118 - **< 1%** * - Time (10 reps) - 0.003s ± 0.000s - 0.039s ± 0.006s - **14.4x faster** **Key Findings:** 1. **Point estimates match exactly**: The overall ATT of -0.039951 is identical between diff-diff and R's ``did`` package, validating the core estimation logic. 2. **Standard errors match exactly**: As of v2.0.2, analytical SEs use the corrected weight influence function formula, achieving 0.0% difference from R's ``did`` package. Both point estimates and standard errors are numerically equivalent. 3. **Performance**: diff-diff is ~14x faster than R on this real-world dataset at small scale. Performance scales differently at larger sizes (see performance tables above). This validation on real-world data with known published results confirms that diff-diff produces correct estimates that match the reference R implementation. Survey Real-Data Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to synthetic-data survey cross-validation (see ``test_survey_r_crossvalidation.py``), diff-diff's survey variance is validated against R's ``survey`` package using three real federal survey datasets. All comparisons match to machine precision (differences < 1e-10). **Datasets:** .. list-table:: :header-rows: 1 :widths: 15 20 15 20 30 * - Dataset - Source - Size - Survey Design - Policy Context * - API (apistrat) - R ``survey`` package - 200 schools - Strata + FPC + weights - California school accountability (PSAA 1999) * - NHANES - CDC/NCHS - 2,946 adults - Strata + PSU + weights (nest) - ACA young adult coverage provision (2010) * - RECS 2020 - U.S. EIA - 2,000 households - 60 JK1 replicate weights - Residential energy consumption survey **Suite A — API Dataset (TSL Variance):** .. list-table:: :header-rows: 1 :widths: 10 30 20 20 20 * - Test - Design Variant - ATT Gap - SE Gap - df * - A1 - Strata + FPC + weights - 2.1e-12 - 3.2e-11 (0.0000%) - Exact * - A2 - Strata + weights (no FPC) - 2.1e-12 - 5.3e-11 (0.0000%) - Exact * - A3 - Weights only - 2.1e-12 - 2.7e-11 (0.0000%) - Exact * - A4 - TWFE (strata + FPC + weights) - 2.1e-12 (ATT only) - n/a (TWFE absorbs unit FE) - Exact * - A5 - Subpopulation (elementary) - 1.5e-11 - 7.5e-12 (0.0000%) - Differs (see note) * - A6 - Covariates (meals, ell) - 2.2e-12 - 8.4e-12 (0.0000%) - Exact * - A7 - Fay's BRR replicates (rho=0.3) - 2.1e-12 - 7.7e-11 (0.0000%) - Exact **Suite B — NHANES (TSL with Strata + PSU + nest=TRUE):** .. list-table:: :header-rows: 1 :widths: 10 30 20 20 20 * - Test - Design Variant - ATT Gap - SE Gap - df * - B1 - Strata + PSU + weights - 4.6e-13 - 2.3e-14 (0.0000%) - Exact (31) * - B2 - Covariates (gender, poverty) - 4.9e-13 - 2.3e-13 (0.0000%) - Exact * - B3 - Weights only - 4.6e-13 - 2.6e-13 (0.0000%) - Exact * - B4 - Subpopulation (female) - 1.1e-13 - 8.7e-14 (0.0000%) - Exact **Suite C — RECS 2020 (JK1 Replicate Weights):** .. list-table:: :header-rows: 1 :widths: 10 30 20 20 20 * - Test - Model - Coef Gap - SE Gap - df * - C1 - TOTALBTU ~ KOWNRENT - 1.5e-11 - 2.0e-11 (0.0000%) - Exact (59) * - C2 - + TYPEHUQ + REGIONC - 3.8e-10 - 2.9e-11 (0.0000%) - Exact (59) **Key Findings:** 1. **Machine-precision agreement** on ATT, SE, df, and CI wherever directly comparable — differences are < 1e-10 (floating-point rounding only). Tolerances are set to 1e-8 in the test suite. 2. **All survey design features validated with real data:** stratification, PSU clustering, FPC corrections, probability weight normalization, nested PSU handling (``nest=TRUE``), subpopulation analysis, covariate adjustment, Fay's BRR (212 replicates), and JK1 replicate weight variance. 3. **Known differences:** A4 (TWFE) validates ATT only — SE differs because TWFE absorbs unit fixed effects. A5 (subpopulation) validates ATT/SE but df differs: ``subpopulation()`` preserves all strata (df=397) while R's ``subset()`` drops empty strata (df=199). This is a documented deviation (see REGISTRY.md); the diff-diff approach is conservative per Lumley (2004). Survey Estimator Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Four additional estimators are validated against R's ``survey::svyglm()`` using synthetic staggered-adoption and DDD datasets. Each estimator reduces to a WLS regression under survey weights, so the R comparison fits the equivalent ``svyglm()`` model and compares coefficients and standard errors. **Data:** 150-unit staggered panel (5 periods, 4 strata, 10 PSUs, FPC) with cohorts at *t* = 3 and *t* = 4; 200-observation DDD cross-section (4 strata, 10 PSUs, FPC). Both generated with seed 42. .. list-table:: :header-rows: 1 :widths: 8 18 30 15 15 14 * - Test - Estimator - R Comparison - Coef Gap - SE Gap - Tolerance * - S1 - ``ImputationDiD`` - ``svyglm()`` on control-only (Omega_0) FE regression; covariate coefficients - < 1e-10 - 0.00% - 1.5% * - S2 - ``StackedDiD`` - ``svyglm()`` on stacked dataset with Q-weight x survey weight composition - < 1e-10 - 0.77% - 1.5% * - S3 - ``SunAbraham`` - ``svyglm()`` with cohort x period interactions; IW-aggregated ATT - < 1e-11 - 0.00% - 1.5% * - S4 - ``TripleDifference`` - ``svyglm()`` three-way interaction (``group:partition:time``) - < 1e-10 - 0.36% - 1.5% **Key details:** - **S1** validates the WLS building block that ``ImputationDiD`` uses internally (control-only regression with absorbed unit + time FE and time-varying covariates). A companion smoke test confirms ``ImputationDiD.fit()`` produces finite ATT/SE under survey weights. - **S2** replicates the full stacking pipeline in R: sub-experiment construction, sample-share Q-weight computation, Q x survey weight composition with normalization, then ``svyglm()`` on the stacked data with strata/PSU structure. The 0.77% SE gap arises because R omits FPC on the stacked data while Python re-resolves the full survey design. - **S3** compares both individual cohort x relative-time effects and the IW-aggregated overall ATT (with survey-weighted cohort masses and delta-method SE via the vcov submatrix). - **S4** exploits the algebraic equivalence between the pairwise DDD decomposition (``estimation_method="reg"``, no covariates) and the three-way interaction coefficient from a single OLS regression. Reproducing Survey Estimator Validation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Generate golden values Rscript benchmarks/R/benchmark_survey_estimators.R # Run validation tests pytest tests/test_survey_estimator_validation.py -v Reproducing Survey Real-Data Validation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # 1. Generate API golden values (no download needed — data ships with R) Rscript benchmarks/R/benchmark_realdata_api.R # 2. Download and process NHANES data from CDC python benchmarks/scripts/download_nhanes.py Rscript benchmarks/R/benchmark_realdata_nhanes.R # 3. Download and subset RECS 2020 from EIA python benchmarks/scripts/download_recs.py Rscript benchmarks/R/benchmark_realdata_recs.R # 4. Run validation tests pytest tests/test_survey_real_data.py -v Reproducing Benchmarks ---------------------- Prerequisites ~~~~~~~~~~~~~ 1. Install R (>= 4.0): .. code-block:: bash # macOS brew install r 2. Install R packages: .. code-block:: bash Rscript benchmarks/R/requirements.R 3. Install diff-diff: .. code-block:: bash pip install -e ".[dev]" Running Benchmarks ~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Run all benchmarks at small scale python benchmarks/run_benchmarks.py --all # Run all benchmarks at all scales with 3 replications python benchmarks/run_benchmarks.py --all --scale all --replications 3 # Run specific estimator at specific scale python benchmarks/run_benchmarks.py --estimator callaway --scale 1k --replications 3 python benchmarks/run_benchmarks.py --estimator synthdid --scale small --replications 3 python benchmarks/run_benchmarks.py --estimator basic --scale 20k --replications 3 python benchmarks/run_benchmarks.py --estimator multiperiod --scale small --replications 3 # Available scales: small, 1k, 5k, 10k, 20k, all # Default: small (backward compatible) # Generate synthetic data only python benchmarks/run_benchmarks.py --generate-data-only --scale all The benchmarks run both pure Python and Rust backends automatically, producing a three-way comparison table (R vs Python Pure vs Python Rust). Output ~~~~~~ Results are saved to: - ``benchmarks/results/accuracy/`` - JSON files with estimates - ``benchmarks/results/comparison_report.txt`` - Summary report Interpretation Notes -------------------- When to Trust Results ~~~~~~~~~~~~~~~~~~~~~ - **BasicDiD/TWFE**: Results are identical to R. Use with confidence. - **MultiPeriodDiD**: Results are identical to R's ``fixest::feols`` with ``treated * time_f | unit`` interaction syntax (unit FE absorbed). Both average ATT and all period-level effects match to machine precision. Use with confidence. - **SyntheticDiD**: Point estimates are numerically identical (< 1e-10 diff) and standard errors match closely (0.3% diff at small scale). Both implementations use Frank-Wolfe optimization with identical weights. Use ``variance_method="placebo"`` (default) to match R's inference. Results are fully validated. - **CallawaySantAnna**: Both group-time effects (ATT(g,t)) and overall ATT aggregation match R exactly. Standard errors are numerically equivalent (0.0% difference) as of v2.0.2. Known Differences ~~~~~~~~~~~~~~~~~ 1. **Inference Methods**: diff-diff defaults to analytical SEs; R ``did`` defaults to multiplier bootstrap. Enable bootstrap in diff-diff for direct comparison. 2. **Aggregation Weights**: Overall ATT is a weighted average of ATT(g,t). Weighting schemes may differ between implementations. 3. **Placebo Variance**: SyntheticDiD SE estimates differ slightly (0.3-7%) across implementations due to Monte Carlo variance in the placebo procedure. Point estimates and unit/time weights are numerically identical since both implementations use the same Frank-Wolfe optimizer.