Interactive notebook
This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.
Survey-Aware Difference-in-Differences#
You’re evaluating a state-level preventive care program using data from a stratified household health survey. States adopted the program at different times, and you need to account for the survey’s complex sampling design when estimating treatment effects.
Most large-scale surveys – the American Community Survey (ACS), Behavioral Risk Factor Surveillance System (BRFSS), Current Population Survey (CPS), Medical Expenditure Panel Survey (MEPS) – use stratified multi-stage sampling. Ignoring this structure when running DiD leads to incorrect standard errors and potentially misleading conclusions.
diff-diff is the only package (R or Python) with full design-based variance estimation for modern heterogeneity-robust DiD estimators.
This tutorial covers:
Why survey design matters for DiD
The
SurveyDesignobjectBasic DiD with survey design
Staggered DiD with survey weights (Callaway-Sant’Anna)
Replicate weights workflow
Subpopulation analysis
Design effect (DEFF) diagnostics
Repeated cross-sections with survey design
Which estimators support survey design
Prerequisites: Familiarity with basic DiD (Tutorial 01) and staggered DiD (Tutorial 02).
[ ]:
import numpy as np
import pandas as pd
from diff_diff import (
CallawaySantAnna,
DifferenceInDifferences,
SurveyDesign,
generate_survey_did_data,
)
from diff_diff.linalg import LinearRegression
# For nicer plots (optional)
try:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
HAS_MATPLOTLIB = True
except ImportError:
HAS_MATPLOTLIB = False
print("matplotlib not installed - visualization examples will be skipped")
1. Why Survey Design Matters for DiD#
Our health survey, like most large-scale surveys, uses a stratified multi-stage design. Ignoring this structure leads to three problems:
Clustering inflates variance. Respondents in the same census tract (Primary Sampling Unit, or PSU) share local health infrastructure, economic conditions, and provider networks. Their outcomes are correlated, so there’s less independent information than the raw sample size suggests. Naive standard errors are too small, leading to over-rejection of null hypotheses.
Unequal weights shift point estimates. Rural areas are oversampled for adequate representation, so rural respondents get lower sampling weights. Without weights, point estimates are biased toward urban respondents (Solon, Haider & Wooldridge 2015).
Stratification and FPC affect variance. The survey sampled a known fraction of census tracts per geographic region. Ignoring the finite population correction (FPC) overstates variance; ignoring stratification misses efficiency gains.
Let’s see this in practice. We’ll prove all three using synthetic data where we know the true answer – so we can measure exactly how wrong the naive approach is.
[ ]:
# Generate synthetic survey data: a state-level preventive care program
# evaluated with a stratified household health survey
df = generate_survey_did_data(
n_units=200,
n_periods=8,
cohort_periods=[3, 5], # States adopted at periods 3 and 5
never_treated_frac=0.3, # 30% of respondents in never-adopting states
treatment_effect=2.0, # Base TE = 2.0 (realized population ATT differs
# due to heterogeneous effects + informative weights)
n_strata=5, # 5 geographic regions (large metro, small metro, etc.)
psu_per_stratum=8, # 8 census tracts sampled per region
fpc_per_stratum=200.0, # 200 total tracts per region
weight_variation='moderate', # Weights vary by stratum (~1.0 to 2.0)
psu_re_sd=2.0, # Strong PSU-level clustering
psu_period_factor=1.0, # Time-varying PSU shocks survive DiD differencing
informative_sampling=True, # Weights correlate with outcomes (realistic)
heterogeneous_te_by_strata=True, # TE varies by stratum (0.6 to 3.4)
return_true_population_att=True, # Store ground truth in df.attrs
seed=42,
)
print(f"Dataset: {len(df)} observations, {df['unit'].nunique()} respondents, {df['period'].nunique()} periods")
print(f"\nSurvey structure:")
print(f" Strata (regions): {df['stratum'].nunique()}")
print(f" PSUs (census tracts): {df['psu'].nunique()}")
print(f" Weight range: [{df['weight'].min():.1f}, {df['weight'].max():.1f}]")
print(f" FPC per stratum: {df['fpc'].iloc[0]:.0f}")
print(f"\nTreatment cohorts:")
print(df.groupby('unit')['first_treat'].first().value_counts().sort_index())
# Ground truth from the DGP
truth = df.attrs["dgp_truth"]
print(f"\nGround truth (known from the data-generating process):")
print(f" Population ATT: {truth['population_att']:.3f}")
print(f" Stratum TEs: {', '.join(f'{v:.2f}' for v in truth['base_stratum_effects'].values())}")
print(f" ICC (realized): {truth['icc_realized']:.3f}")
print(f" DEFF (Kish): {truth['deff_kish']:.3f}")
print()
df.head(10)
[ ]:
import warnings
# Suppress RuntimeWarnings from internal matrix operations (numpy matmul overflow/divide-by-zero)
# These are numerical noise, not data issues. UserWarnings from the library are kept visible.
warnings.filterwarnings('ignore', category=RuntimeWarning, message=r'.*matmul.*')
# Compare Callaway-Sant'Anna: naive vs. survey-aware
sd = SurveyDesign(weights='weight', strata='stratum', psu='psu', fpc='fpc')
cs_naive = CallawaySantAnna(control_group='never_treated')
results_naive = cs_naive.fit(
df, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', aggregate='event_study'
)
cs_survey = CallawaySantAnna(control_group='never_treated')
results_survey = cs_survey.fit(
df, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', aggregate='event_study',
survey_design=sd,
)
true_att = df.attrs["dgp_truth"]["population_att"]
print(f"Overall ATT:")
print(f" True (population): {true_att:.4f}")
print(f" Naive estimate: {results_naive.overall_att:.4f} (bias: {(results_naive.overall_att - true_att)/true_att:+.1%})")
print(f" Survey estimate: {results_survey.overall_att:.4f} (bias: {(results_survey.overall_att - true_att)/true_att:+.1%})")
print(f"\nStandard errors:")
print(f" Naive SE: {results_naive.overall_se:.4f}")
print(f" Survey SE: {results_survey.overall_se:.4f} ({results_survey.overall_se / results_naive.overall_se:.1f}x larger)")
ci_n = results_naive.overall_conf_int
ci_s = results_survey.overall_conf_int
covers_n = "YES" if ci_n[0] <= true_att <= ci_n[1] else "NO"
covers_s = "YES" if ci_s[0] <= true_att <= ci_s[1] else "NO"
print(f"\n95% confidence intervals:")
print(f" Naive: [{ci_n[0]:.3f}, {ci_n[1]:.3f}] width={ci_n[1]-ci_n[0]:.3f} covers truth: {covers_n}")
print(f" Survey: [{ci_s[0]:.3f}, {ci_s[1]:.3f}] width={ci_s[1]-ci_s[0]:.3f} covers truth: {covers_s}")
About the normalization warning: You’ll see pweight weights normalized to mean=1 throughout this tutorial. Survey weights are inverse selection probabilities – they rarely have mean=1 out of the box. The library rescales them internally so that weighted estimators are numerically stable. This is standard practice (Lumley 2004, §2.2). The warning confirms rescaling occurred; it is not an error.
[ ]:
if HAS_MATPLOTLIB:
# Event study comparison: naive vs. survey-corrected confidence intervals
fig, ax = plt.subplots(figsize=(10, 6))
event_times = sorted(results_naive.event_study_effects.keys())
offset = 0.1 # Horizontal offset for readability
for label, res, color, xoff in [
('Naive', results_naive, '#cccccc', -offset),
('Survey-aware', results_survey, '#2563eb', offset),
]:
effs = [res.event_study_effects[e]['effect'] for e in event_times]
# Use the returned conf_int (respects survey df t-critical values)
ci_lo = [res.event_study_effects[e]['conf_int'][0] for e in event_times]
ci_hi = [res.event_study_effects[e]['conf_int'][1] for e in event_times]
yerr_lo = [e - lo for e, lo in zip(effs, ci_lo)]
yerr_hi = [hi - e for e, hi in zip(effs, ci_hi)]
xs = [e + xoff for e in event_times]
ax.errorbar(xs, effs, yerr=[yerr_lo, yerr_hi],
fmt='o', capsize=4, label=label, color=color,
markersize=6, linewidth=1.5)
# True ATT(e) by horizon: 0 for pre-treatment, weighted mean of true_effect
# for post-treatment (varies by horizon due to cohort composition)
df_treated = df[df['treated'] == 1].copy()
df_treated['_rel_time'] = df_treated['period'] - df_treated['first_treat']
true_att_e = []
for e in event_times:
if e < 0:
true_att_e.append(0.0)
else:
mask = df_treated['_rel_time'] == e
if mask.any():
true_att_e.append(
np.average(df_treated.loc[mask, 'true_effect'],
weights=df_treated.loc[mask, 'weight']))
else:
true_att_e.append(np.nan)
ax.plot(event_times, true_att_e, 's--', color='red', markersize=5,
linewidth=1.5, alpha=0.7, label='True ATT(e)')
ax.axhline(y=0, color='black', linewidth=0.5, linestyle='-')
ax.axvline(x=-0.5, color='gray', linewidth=0.5, linestyle='--')
ax.set_xlabel('Periods Relative to Treatment')
ax.set_ylabel('ATT')
ax.set_title('Event Study: Naive vs. Survey-Aware Confidence Intervals')
ax.legend()
plt.tight_layout()
plt.show()
What just happened: The naive ATT is biased low – unweighted estimation lets over-represented strata (which happen to have smaller treatment effects in this DGP) dominate the average. The confidence intervals that look reassuringly tight? They’re based on standard errors that are roughly half what they should be, because clustering within census tracts is ignored.
But one random draw could be a fluke. The next two subsections quantify how badly naive inference fails on average.
How Much Are Your Standard Errors Wrong?#
The SE ratio quantifies how much ignoring the survey design understates uncertainty:
Two sources drive this gap:
Clustering (dominant): respondents in the same PSU share local conditions, so their outcomes are correlated (ICC = 0.84 in this DGP). Ignoring this makes standard errors far too small.
Unequal weights: oversampled groups get higher weights, adding variance. The Kish design effect from weight dispersion alone is ~1.4.
Note: this SE ratio compares two different estimators (survey-weighted vs. unweighted), so it is not a formal design effect (DEFF). The library’s compute_deff() computes the proper DEFF against an SRS baseline for the same weighted estimator – Section 7 covers that API.
[ ]:
# How much does ignoring survey design understate standard errors?
se_ratio = results_survey.overall_se / results_naive.overall_se
truth = df.attrs["dgp_truth"]
print(f"SE inflation (survey / naive): {se_ratio:.2f}x")
print(f" Naive SEs are {1/se_ratio:.0%} of what they should be")
print(f"\nContributing factors (from DGP):")
print(f" ICC (realized): {truth['icc_realized']:.3f} (clustering strength)")
print(f" DEFF (Kish): {truth['deff_kish']:.3f} (weight dispersion alone)")
Simulation Evidence: Does the CI Actually Work?#
A single draw is suggestive but not proof. The definitive test: across 200 independent datasets drawn from the same DGP, does the 95% CI cover the true ATT 95% of the time? And how often does naive inference falsely detect pre-trends that don’t exist?
[ ]:
# Monte Carlo simulation: 200 draws (~5 seconds)
import warnings as _w
_w.filterwarnings('ignore', category=UserWarning, message=r'.*pweight.*')
_w.filterwarnings('ignore', category=RuntimeWarning)
n_sim = 200
covers_naive, covers_survey = 0, 0
false_pretrend_naive, false_pretrend_survey = 0, 0
per_coef_naive = {}
per_coef_survey = {}
for sim in range(n_sim):
df_sim = generate_survey_did_data(
n_units=200, n_periods=8, cohort_periods=[3, 5],
never_treated_frac=0.3, treatment_effect=2.0,
n_strata=5, psu_per_stratum=8, fpc_per_stratum=200.0,
weight_variation="moderate", psu_re_sd=2.0, psu_period_factor=1.0,
informative_sampling=True, heterogeneous_te_by_strata=True,
return_true_population_att=True, seed=sim,
)
true_att_sim = df_sim.attrs["dgp_truth"]["population_att"]
sd_sim = SurveyDesign(weights="weight", strata="stratum", psu="psu", fpc="fpc")
res_n = CallawaySantAnna(control_group="never_treated").fit(
df_sim, outcome="outcome", unit="unit", time="period",
first_treat="first_treat", aggregate="event_study")
res_s = CallawaySantAnna(control_group="never_treated").fit(
df_sim, outcome="outcome", unit="unit", time="period",
first_treat="first_treat", aggregate="event_study",
survey_design=sd_sim)
# CI coverage (using proper critical values from result objects)
ci_n = res_n.overall_conf_int
ci_s = res_s.overall_conf_int
if ci_n[0] <= true_att_sim <= ci_n[1]:
covers_naive += 1
if ci_s[0] <= true_att_sim <= ci_s[1]:
covers_survey += 1
# False pre-trends (3 pre-treatment coefficients: e=-3, -2, -1)
pre_periods = [e for e in res_n.event_study_effects if e < 0]
if any(res_n.event_study_effects[e]["p_value"] < 0.05 for e in pre_periods):
false_pretrend_naive += 1
if any(res_s.event_study_effects[e]["p_value"] < 0.05 for e in pre_periods):
false_pretrend_survey += 1
for e in pre_periods:
per_coef_naive.setdefault(e, 0)
per_coef_survey.setdefault(e, 0)
if res_n.event_study_effects[e]["p_value"] < 0.05:
per_coef_naive[e] += 1
if res_s.event_study_effects[e]["p_value"] < 0.05:
per_coef_survey[e] += 1
# Results
print(f"{'Metric':<40s} {'Naive':>10s} {'Survey':>10s} {'Target':>10s}")
print("-" * 72)
print(f"{'95% CI coverage of true ATT':<40s} {100*covers_naive/n_sim:>9.1f}% {100*covers_survey/n_sim:>9.1f}% {'95.0%':>10s}")
n_pre = len(per_coef_naive)
target_any = 100 * (1 - 0.95**n_pre)
print(f"{'False pre-trend (any coef, p<0.05)':<40s} {100*false_pretrend_naive/n_sim:>9.1f}% {100*false_pretrend_survey/n_sim:>9.1f}% {target_any:>9.1f}%")
print()
print("Per-coefficient false positive rates (should be ~5% each):")
for e in sorted(per_coef_naive):
print(f" e={e:+d}: Naive {100*per_coef_naive[e]/n_sim:5.1f}% Survey {100*per_coef_survey[e]/n_sim:5.1f}%")
Bottom line: The naive approach produces 95% CIs that cover the truth far less than 95% of the time – roughly one in three studies using flat weights would exclude the true effect. Even worse, naive inference falsely detects pre-trends (which don’t exist in the DGP) in the majority of draws, potentially leading researchers to wrongly doubt the parallel trends assumption.
The survey-aware estimator isn’t perfectly calibrated either – finite-sample approximations with only 200 units and complex survey structure make that hard to achieve – but it is dramatically closer to correct inference.
The rest of this tutorial shows the mechanics: how to specify SurveyDesign, pass it to every estimator, handle replicate weights, subpopulations, and DEFF diagnostics. The evidence above is why it matters.
2. The Survey Design Object#
Our health survey has four design elements that map directly to SurveyDesign parameters:
Survey Element |
In Our Data |
|
What It Represents |
|---|---|---|---|
Sampling weights |
|
|
Inverse selection probability (how many population units each respondent represents) |
Geographic strata |
|
|
Region types (large metro, small metro, rural, etc.) |
Census tracts |
|
|
Primary Sampling Units – clusters of respondents |
Tracts per stratum |
|
|
Finite population correction (total PSUs in the population per stratum) |
[ ]:
# Create the survey design object
sd = SurveyDesign(
weights='weight', # Column name containing sampling weights
strata='stratum', # Column name for stratification variable
psu='psu', # Column name for primary sampling units (clusters)
fpc='fpc', # Column name for finite population correction
)
print(sd)
Additional options#
``weight_type``: Default
"pweight"(inverse probability weights). Use"fweight"for frequency weights or"aweight"for analytical weights.``nest``: Set
Truewhen PSU IDs are only unique within strata (e.g., PSU 1 in stratum A is different from PSU 1 in stratum B). DefaultFalseassumes globally unique PSU IDs.``lonely_psu``: How to handle strata with a single PSU (“singleton strata”):
"remove"(default),"certainty"(zero variance contribution), or"adjust"(average across other strata).
3. Basic DiD with Survey Design#
Let’s start simple: a 2x2 comparison for respondents in states that adopted the program at period 3, vs. respondents in states that never adopted.
[ ]:
# Filter to cohort 3 vs. never-treated, periods 2 (pre) and 3 (post)
cohort3 = df[(df['first_treat'].isin([0, 3])) & (df['period'].isin([2, 3]))].copy()
cohort3['post'] = (cohort3['period'] == 3).astype(int)
cohort3['treat'] = (cohort3['first_treat'] == 3).astype(int)
print(f"2x2 subset: {len(cohort3)} observations, {cohort3['unit'].nunique()} respondents")
[ ]:
# Fit without survey design (naive)
did_naive = DifferenceInDifferences()
results_no_survey = did_naive.fit(
cohort3, outcome='outcome', treatment='treat', time='post'
)
print("=== Naive DiD (no survey design) ===")
print(results_no_survey.summary())
[ ]:
# Fit with survey design
results_with_survey = did_naive.fit(
cohort3, outcome='outcome', treatment='treat', time='post',
survey_design=sd,
)
print("=== DiD with Survey Design ===")
print(results_with_survey.summary())
Understanding the Survey Design Block#
The summary now includes a Survey Design section with key diagnostics:
Weight type:
pweight= inverse probability weights (the default). Each respondent’s weight indicates how many population members they represent.Effective sample size: The Kish (1965) formula \((\sum w_i)^2 / \sum w_i^2\) adjusts the nominal sample size for weight variation. If all weights are equal, effective \(n\) = nominal \(n\). Unequal weights reduce the effective sample size.
Kish DEFF (weights): The design effect from weight variation alone. DEFF of 1.2 means weights add 20% more variance than equal weights.
Survey d.f.: Degrees of freedom = \(n_{PSU} - n_{strata}\). This controls the \(t\)-distribution critical values. With few PSUs, confidence intervals are wider than with normal critical values.
[ ]:
# Side-by-side comparison
print(f"{'':25s} {'Naive':>12s} {'Survey':>12s}")
print("-" * 50)
print(f"{'ATT':25s} {results_no_survey.att:>12.4f} {results_with_survey.att:>12.4f}")
print(f"{'SE':25s} {results_no_survey.se:>12.4f} {results_with_survey.se:>12.4f}")
ci_n = results_no_survey.conf_int
ci_s = results_with_survey.conf_int
print(f"{'CI width':25s} {ci_n[1]-ci_n[0]:>12.4f} {ci_s[1]-ci_s[0]:>12.4f}")
print(f"{'p-value':25s} {results_no_survey.p_value:>12.4f} {results_with_survey.p_value:>12.4f}")
print()
print("Note: ATT changes because survey weights shift the point estimate.")
print("The survey SE accounts for the complex sampling design (TSL variance).")
4. Staggered DiD with Survey Design#
In practice, states adopted the preventive care program at different times. This is a staggered adoption design. CallawaySantAnna handles both the staggered timing and the survey design.
When survey_design is provided, CallawaySantAnna uses:
Survey weights for point estimation
Taylor Series Linearization (TSL) for variance estimation, accounting for strata, PSU clustering, and FPC
Survey degrees of freedom (\(n_{PSU} - n_{strata}\)) for \(t\)-distribution critical values
[ ]:
# CS without survey design
cs_naive = CallawaySantAnna(control_group='never_treated')
results_cs_naive = cs_naive.fit(
df, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', aggregate='all'
)
print("=== CS without survey design ===")
print(f"Overall ATT: {results_cs_naive.overall_att:.4f} (SE: {results_cs_naive.overall_se:.4f})")
[ ]:
# CS with survey design
cs_survey = CallawaySantAnna(control_group='never_treated')
results_cs_survey = cs_survey.fit(
df, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', aggregate='all',
survey_design=sd,
)
print("=== CS with survey design ===")
print(results_cs_survey.summary())
[ ]:
if HAS_MATPLOTLIB:
# Event study overlay: naive vs. survey-corrected
fig, ax = plt.subplots(figsize=(10, 6))
event_times = sorted(results_cs_naive.event_study_effects.keys())
offset = 0.1
for label, res, color, xoff, alpha in [
('Naive (ignoring design)', results_cs_naive, '#cccccc', -offset, 0.8),
('Survey-aware (TSL)', results_cs_survey, '#2563eb', offset, 1.0),
]:
effs = [res.event_study_effects[e]['effect'] for e in event_times]
# Use returned conf_int (respects survey df t-critical values)
ci_lo = [res.event_study_effects[e]['conf_int'][0] for e in event_times]
ci_hi = [res.event_study_effects[e]['conf_int'][1] for e in event_times]
yerr_lo = [e - lo for e, lo in zip(effs, ci_lo)]
yerr_hi = [hi - e for e, hi in zip(effs, ci_hi)]
xs = [e + xoff for e in event_times]
ax.errorbar(xs, effs, yerr=[yerr_lo, yerr_hi],
fmt='o', capsize=4, label=label, color=color,
markersize=6, linewidth=1.5, alpha=alpha)
ax.axhline(y=0, color='black', linewidth=0.5)
ax.axvline(x=-0.5, color='gray', linewidth=0.5, linestyle='--')
ax.set_xlabel('Periods Relative to Treatment')
ax.set_ylabel('ATT')
ax.set_title('Event Study: Naive vs. Survey-Aware Inference')
ax.legend()
plt.tight_layout()
plt.show()
The survey-corrected confidence intervals are consistently wider. An effect that appears statistically significant when ignoring the survey design may not survive proper inference – a real consequence of ignoring the sampling structure.
[ ]:
# Event study: SE comparison table
print(f"{'Event Time':>12s} {'Naive SE':>12s} {'Survey SE':>12s} {'Ratio':>8s}")
print("-" * 46)
for e in sorted(results_cs_naive.event_study_effects.keys()):
se_n = results_cs_naive.event_study_effects[e]['se']
se_s = results_cs_survey.event_study_effects[e]['se']
print(f"{e:>+12d} {se_n:>12.4f} {se_s:>12.4f} {se_s/se_n:>7.2f}x")
[ ]:
# Group-level ATT with survey design
print("Group-level ATT (by treatment cohort):")
print(f"{'Cohort':>10s} {'ATT':>10s} {'SE':>10s} {'95% CI':>25s}")
print("-" * 58)
for g, eff in results_cs_survey.group_effects.items():
ci = eff['conf_int']
print(f"{g:>10d} {eff['effect']:>10.4f} {eff['se']:>10.4f} [{ci[0]:>8.4f}, {ci[1]:>8.4f}]")
5. Replicate Weights#
Some surveys (MEPS, ACS PUMS) publish replicate weights instead of strata/PSU/FPC columns. The design information is encoded in the replicates themselves – each replicate represents a perturbed version of the full sample.
Replicate weight methods:
JK1 (Jackknife delete-1): Drop one PSU at a time, rescale remaining weights. Simplest method.
JKn (Jackknife delete-n): Stratified jackknife, drops one PSU per stratum.
BRR (Balanced Repeated Replication): Halve each stratum, reweight.
Fay’s BRR: Modified BRR with a damping factor (0 < rho < 1).
SurveyDesign accepts replicate weights as an alternative to strata/PSU/FPC. They are mutually exclusive – use one or the other.
[ ]:
# Generate data with JK1 replicate weights
df_rep = generate_survey_did_data(
n_units=200, n_periods=8, cohort_periods=[3, 5],
treatment_effect=2.0, n_strata=5, psu_per_stratum=8,
include_replicate_weights=True, # Adds rep_0, rep_1, ..., rep_K columns
psu_period_factor=1.0,
seed=42,
)
# Find replicate weight columns
rep_cols = [c for c in df_rep.columns if c.startswith('rep_')]
print(f"Replicate weight columns: {len(rep_cols)} (JK1 delete-one-PSU)")
print(f"Example: {rep_cols[:5]}")
# Create a replicate-based survey design
sd_rep = SurveyDesign(
weights='weight',
replicate_weights=rep_cols,
replicate_method='JK1',
)
print(f"\n{sd_rep}")
[ ]:
# Fit CS with replicate weights
cs_rep = CallawaySantAnna(control_group='never_treated')
results_rep = cs_rep.fit(
df_rep, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', survey_design=sd_rep,
)
print(f"Overall ATT: {results_rep.overall_att:.4f} (SE: {results_rep.overall_se:.4f})")
print(f"Survey d.f.: {results_rep.survey_metadata.df_survey}")
print(f"Replicate method: {results_rep.survey_metadata.replicate_method}")
print(f"Number of replicates: {results_rep.survey_metadata.n_replicates}")
print()
print("Note: JK1 replicates are unstratified (global delete-one-PSU).")
print("If your survey uses stratified sampling, stratified replicates (JKn)")
print("provide design-consistent variance estimation. Use whichever method")
print("your survey documentation specifies.")
6. Subpopulation Analysis#
Suppose we want to estimate the treatment effect only for respondents in metro areas (strata 0 and 1). A common mistake is to simply subset the data and ignore the survey design:
# WRONG: Subsetting and dropping survey design
metro_df = df[df['stratum'].isin([0, 1])]
model.fit(metro_df, ...) # No survey_design
This both drops design information from non-metro strata and ignores the survey structure entirely, biasing both point estimates and variance estimates. The correct approach: use subpopulation() to zero out weights for excluded observations while keeping the full design structure.
When restricting to a subpopulation, some (group, time) cells may lack enough observations to estimate, especially with staggered adoption. The warning tells you which cells were dropped – this is a real consideration when doing subpopulation analysis with smaller domains.
[ ]:
# Correct approach: use subpopulation()
sub_sd, sub_data = sd.subpopulation(
df, mask=lambda d: d['stratum'].isin([0, 1])
)
cs_sub = CallawaySantAnna(control_group='never_treated')
results_sub = cs_sub.fit(
sub_data, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', survey_design=sub_sd,
)
# Naive approach: just filter rows and ignore survey design
metro_df = df[df['stratum'].isin([0, 1])].copy()
cs_naive_sub = CallawaySantAnna(control_group='never_treated')
results_naive_sub = cs_naive_sub.fit(
metro_df, outcome='outcome', unit='unit', time='period',
first_treat='first_treat',
# No survey_design -- common mistake
)
print(f"{'Method':25s} {'ATT':>10s} {'SE':>10s}")
print("-" * 47)
print(f"{'Subpopulation (correct)':25s} {results_sub.overall_att:>10.4f} {results_sub.overall_se:>10.4f}")
print(f"{'Naive subset (incorrect)':25s} {results_naive_sub.overall_att:>10.4f} {results_naive_sub.overall_se:>10.4f}")
print(f"\nThe subpopulation approach preserves the full design structure for")
print(f"correct variance estimation. Naive subsetting compounds two errors:")
print(f"losing design information and ignoring the survey structure entirely.")
7. DEFF Diagnostics#
The design effect (DEFF) quantifies how much the survey design inflates variance compared to a simple random sample (SRS) of the same size:
DEFF = 1: No design effect; the survey behaves like SRS for this coefficient.
DEFF > 1: The survey design costs precision. Clustering is the usual driver.
DEFF > 2: Substantial design effect – the effective sample size is less than half the nominal size.
DEFF < 1: The survey design improves precision for this coefficient. Stratification gains more than offset clustering. This can occur for difference-based estimators where time-invariant PSU effects cancel.
LinearRegression.compute_deff() gives per-coefficient DEFF diagnostics after fitting with a survey design.
[ ]:
# Fit a linear regression with survey design to get DEFF diagnostics
# Use the 2x2 subset with explicit treatment indicators
cohort3['treat_x_post'] = cohort3['treat'] * cohort3['post']
X_cols = ['treat', 'post', 'treat_x_post']
# Resolve the survey design (extracts numpy arrays from the DataFrame)
resolved_sd = sd.resolve(cohort3)
reg = LinearRegression(
include_intercept=True,
survey_design=resolved_sd,
)
reg.fit(
X=cohort3[X_cols].values,
y=cohort3['outcome'].values,
)
# Compute per-coefficient DEFF
deff = reg.compute_deff(coefficient_names=['intercept', 'treat', 'post', 'treat_x_post'])
print("Per-coefficient Design Effect (DEFF):")
print(f"{'Coefficient':20s} {'DEFF':>8s} {'Survey SE':>12s} {'SRS SE':>12s}")
print("-" * 55)
for name, d, se_surv, se_srs in zip(
deff.coefficient_names, deff.deff, deff.survey_se, deff.srs_se
):
flag = " **" if d > 2.0 else (" *" if d < 1.0 else "")
print(f"{name:20s} {d:>8.2f} {se_surv:>12.4f} {se_srs:>12.4f}{flag}")
print("\n** DEFF > 2: substantial design effect (clustering dominates)")
print(" * DEFF < 1: stratification improves precision")
[ ]:
if HAS_MATPLOTLIB:
fig, ax = plt.subplots(figsize=(8, 4))
names = deff.coefficient_names
y_pos = range(len(names))
ax.barh(y_pos, deff.deff, color='#2563eb', alpha=0.7)
ax.axvline(x=1.0, color='black', linewidth=1, linestyle='--', label='DEFF = 1 (no effect)')
ax.axvline(x=2.0, color='red', linewidth=1, linestyle=':', alpha=0.5, label='DEFF = 2 (substantial)')
ax.set_yticks(list(y_pos))
ax.set_yticklabels(names)
ax.set_xlabel('Design Effect (DEFF)')
ax.set_title('Per-Coefficient Design Effects')
ax.legend(loc='lower right')
plt.tight_layout()
plt.show()
8. Repeated Cross-Sections with Survey Design#
Many health surveys are repeated cross-sections – different respondents are sampled each period. Examples: BRFSS (monthly), CPS (monthly), ACS 1-year estimates.
For these designs, use CallawaySantAnna(panel=False). Survey weights, strata, PSU, and FPC all work the same way. Internally, the estimator uses the cross-sectional Doubly Robust DiD approach (Sant’Anna & Zhao 2020, Section 4) instead of within-unit differencing.
You’ll see a warning about stationary cross-sectional sampling below – this is a standard assumption reminder, not an error. It means the estimator assumes the population distribution of (Y, X, G) is stable across survey waves.
[ ]:
# Generate true repeated cross-sectional data:
# panel=False draws fresh respondent effects each period, with unique unit IDs
df_rc = generate_survey_did_data(
n_units=200, n_periods=8, cohort_periods=[3, 5],
treatment_effect=2.0, n_strata=5, psu_per_stratum=8,
panel=False, # Fresh respondents each period (repeated cross-section)
psu_period_factor=1.0,
seed=42,
)
print(f"Cross-sectional data: {df_rc['unit'].nunique()} unique respondents across {df_rc['period'].nunique()} periods")
print(f"Respondents per period: {df_rc.groupby('period')['unit'].nunique().iloc[0]}")
print(f"No unit appears in more than one period: {df_rc.groupby('unit')['period'].nunique().max() == 1}")
[ ]:
# Fit CS with panel=False for repeated cross-sections
cs_rc = CallawaySantAnna(
control_group='never_treated',
panel=False, # Repeated cross-section mode
)
results_rc = cs_rc.fit(
df_rc, outcome='outcome', unit='unit', time='period',
first_treat='first_treat', survey_design=sd,
)
print(f"Repeated Cross-Section with Survey Design:")
print(f" Overall ATT: {results_rc.overall_att:.4f} (SE: {results_rc.overall_se:.4f})")
print(f" Survey d.f.: {results_rc.survey_metadata.df_survey}")
print(f"\nPoint estimates may differ from the panel version because the")
print(f"estimator uses cross-sectional DRDID (Sant'Anna & Zhao 2020, Section 4).")
9. Which Estimators Support Survey Design?#
diff-diff supports survey design across all estimators, though the level of support varies:
Estimator |
Weights |
Strata/PSU/FPC (TSL) |
Replicate Weights |
Survey-Aware Bootstrap |
|---|---|---|---|---|
DifferenceInDifferences |
Full |
Full |
– |
– |
TwoWayFixedEffects |
Full |
Full |
– |
– |
MultiPeriodDiD |
Full |
Full |
– |
– |
CallawaySantAnna |
pweight only |
Full |
Full |
Multiplier at PSU |
TripleDifference |
pweight only |
Full |
Full (analytical) |
– |
StaggeredTripleDifference |
pweight only |
Full |
Full |
Multiplier at PSU |
SunAbraham |
Full |
Full |
– |
Rao-Wu rescaled |
StackedDiD |
pweight only |
Full (pweight only) |
– |
– |
ImputationDiD |
pweight only |
Partial (no FPC) |
– |
Multiplier at PSU |
TwoStageDiD |
pweight only |
Partial (no FPC) |
– |
Multiplier at PSU |
ContinuousDiD |
Full |
Full |
Full (analytical) |
Multiplier at PSU |
EfficientDiD |
Full |
Full |
Full (analytical) |
Multiplier at PSU |
SyntheticDiD |
pweight only |
Full (all three variance methods) |
– |
Hybrid pairs-bootstrap + Rao-Wu (bootstrap); stratified permutation (placebo); PSU-LOO (jackknife) |
TROP |
pweight only |
– |
– |
Rao-Wu rescaled |
BaconDecomposition |
Diagnostic |
Diagnostic |
– |
– |
Legend:
Full: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance
Full (pweight only): Full TSL support with strata/PSU/FPC, but only accepts
pweightweight type (fweight/aweightrejected because Q-weight composition changes their semantics)Partial (no FPC): Weights + strata (for df) + PSU (for clustering); FPC raises
NotImplementedErrorpweight only (Weights column): Only
pweightaccepted;fweight/aweightraise an errorpweight only (TSL column): Sampling weights for point estimates; no strata/PSU/FPC design elements
Full (all three variance methods) (SyntheticDiD TSL column): Strata/PSU/FPC supported on all three
variance_methodchoices —bootstrapvia weighted Frank-Wolfe + Rao-Wu,placebovia stratified permutation + weighted FW,jackknifevia PSU-level LOO with stratum aggregation. Replicate-weight designs remain rejected (pre-existing limitation).Diagnostic: Weighted descriptive statistics only (no inference)
–: Not supported
Note on SyntheticDiD: all three variance methods now support full strata/PSU/FPC designs.
Bootstrap (PR #355) composes per-draw Rao-Wu rescaled weights with a weighted Frank-Wolfe variant of
_sc_weight_fw. Each draw solvesmin ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²and composesω_eff = rw·ω/Σ(rw·ω)for the SDID estimator. Pweight-only fits use the constant per-control survey weight asrw; full designs use Rao-Wu rescaling per draw.Placebo uses a stratified permutation allocator: pseudo-treated indices are drawn from controls within each stratum containing actual treated units; weighted FW re-estimates ω and λ per draw with per-control survey weights flowing into both loss and regularization. SE follows Arkhangelsky Algorithm 4. The allocator requires at least
n_treated_hcontrols per treated-containing stratum; fit-time guards raise targetedValueErroron infeasible configurations.Jackknife uses PSU-level leave-one-out with stratum aggregation:
SE² = Σ_h (1-f_h)·(n_h-1)/n_h·Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)²(Rust & Rao 1996). FPC folded via(1-f_h); strata with fewer than 2 PSUs are silently skipped. Known anti-conservatism with few PSUs per stratum — for tight SE calibration in that regime, prefervariance_method="bootstrap".
See docs/methodology/REGISTRY.md §SyntheticDiD Note (survey + bootstrap / placebo / jackknife composition) for the full objectives, allocator asymmetry rationale (placebo ignores PSU axis, jackknife respects it), and validation details.
Note: EfficientDiD supports covariates and survey_design simultaneously. The doubly-robust (DR) path threads survey weights through WLS outcome regression, weighted sieve propensity ratios, and survey-weighted kernel smoothing.
For full details, see docs/survey-roadmap.md.
10. Real-World Example: ACA Young Adult Coverage (NHANES)#
The examples above used synthetic data to illustrate survey design concepts. Now let’s apply the same methods to real CDC survey data from the National Health and Nutrition Examination Survey (NHANES).
Policy background. The Affordable Care Act’s dependent coverage provision, effective September 2010, allowed young adults to remain on their parents’ health insurance until age 26. This created a natural experiment: adults aged 19–25 gained coverage access (treatment group), while adults aged 27–34 — similar demographics but ineligible — serve as controls. This is one of the most widely studied DiD natural experiments in health economics.
Antwi, Y.A., Moriya, A.S. & Simon, K. (2013). “Effects of Federal Policy to Insure Young Adults: Evidence from the 2010 Affordable Care Act’s Dependent-Coverage Mandate.” American Economic Journal: Economic Policy 5(4): 1–28.
We use two NHANES cycles: 2007–2008 (pre-ACA) and 2015–2016 (post-ACA), with health insurance coverage as the outcome (binary, modeled as a linear probability model). NHANES uses a complex multi-stage probability sampling design with masked pseudo-strata (SDMVSTRA), masked pseudo-PSUs (SDMVPSU), and exam weights (WTMEC2YR). Because PSU IDs are reused across strata, we need nest=True.
[ ]:
# Load NHANES ACA dataset
# Try the pre-processed CSV first; fall back to downloading from CDC
from pathlib import Path
import os, urllib.request, tempfile
csv_path = Path("../../benchmarks/data/real/nhanes_aca_subset.csv")
if not csv_path.exists():
csv_path = Path("benchmarks/data/real/nhanes_aca_subset.csv")
if csv_path.exists():
nhanes = pd.read_csv(csv_path)
print(f"Loaded from repo: {csv_path}")
else:
# Download directly from CDC (4 XPT files, ~10 MB total)
print("Downloading NHANES data from CDC...")
CDC_FILES = {
"pre": {
"demo": "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2007/DataFiles/DEMO_E.XPT",
"hiq": "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2007/DataFiles/HIQ_E.XPT",
"period": 0,
},
"post": {
"demo": "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2015/DataFiles/DEMO_I.XPT",
"hiq": "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2015/DataFiles/HIQ_I.XPT",
"period": 1,
},
}
cycles = []
for key, cfg in CDC_FILES.items():
with tempfile.TemporaryDirectory() as tmp:
demo_path = os.path.join(tmp, "demo.xpt")
hiq_path = os.path.join(tmp, "hiq.xpt")
urllib.request.urlretrieve(cfg["demo"], demo_path)
urllib.request.urlretrieve(cfg["hiq"], hiq_path)
demo = pd.read_sas(demo_path, format="xport")
hiq = pd.read_sas(hiq_path, format="xport")
demo = demo[["SEQN", "SDMVSTRA", "SDMVPSU", "WTMEC2YR", "RIDAGEYR", "RIAGENDR", "INDFMPIR"]]
hiq = hiq[["SEQN", "HIQ011"]]
merged = demo.merge(hiq, on="SEQN", how="inner")
merged = merged[((merged.RIDAGEYR >= 19) & (merged.RIDAGEYR <= 25)) |
((merged.RIDAGEYR >= 27) & (merged.RIDAGEYR <= 34))]
merged["treated"] = (merged.RIDAGEYR <= 25).astype(int)
merged["outcome"] = (merged.HIQ011 == 1.0).astype(int)
merged["period"] = cfg["period"]
merged["post"] = cfg["period"]
merged = merged.dropna(subset=["SDMVSTRA", "SDMVPSU", "WTMEC2YR", "HIQ011"])
cycles.append(merged)
print(f" {key}: {len(merged)} rows")
nhanes = pd.concat(cycles, ignore_index=True)
nhanes["unit_id"] = range(1, len(nhanes) + 1)
for col in ["SDMVSTRA", "SDMVPSU", "treated", "outcome", "period", "post", "RIAGENDR"]:
nhanes[col] = nhanes[col].astype(int)
print(f"Downloaded and processed: {len(nhanes)} rows")
print(f"\nDataset: {len(nhanes)} rows")
print(f"Pre-ACA (2007-08): {(nhanes.period == 0).sum()} | Post-ACA (2015-16): {(nhanes.period == 1).sum()}")
print(f"Treatment (19-25): {(nhanes.treated == 1).sum()} | Control (27-34): {(nhanes.treated == 0).sum()}")
[ ]:
# Explore the survey design variables
print("Survey design:")
print(f" Strata (SDMVSTRA): {nhanes.SDMVSTRA.nunique()} unique values")
print(f" PSUs (SDMVPSU): {nhanes.SDMVPSU.nunique()} values per stratum (masked, reused across strata)")
print(f" Weights (WTMEC2YR): range [{nhanes.WTMEC2YR.min():.0f}, {nhanes.WTMEC2YR.max():.0f}]")
print()
# Insurance coverage rates by group and period
summary = nhanes.groupby(["period", "treated"]).agg(
n=("outcome", "size"),
insured_pct=("outcome", "mean"),
).reset_index()
summary["period_label"] = summary["period"].map({0: "Pre-ACA (07-08)", 1: "Post-ACA (15-16)"})
summary["group"] = summary["treated"].map({1: "Ages 19-25 (treated)", 0: "Ages 27-34 (control)"})
print("Insurance coverage rates:")
for _, row in summary.iterrows():
print(f" {row['period_label']:20s} {row['group']:25s} n={row['n']:5d} insured={row['insured_pct']:.1%}")
[ ]:
# Naive DiD (ignoring survey design)
did_naive = DifferenceInDifferences()
result_naive = did_naive.fit(nhanes, "outcome", "treated", "post")
print("Naive DiD (no survey design):")
print(f" ATT: {result_naive.att:.4f}")
print(f" SE: {result_naive.se:.4f}")
print(f" p-value: {result_naive.p_value:.4f}")
print(f" 95% CI: [{result_naive.conf_int[0]:.4f}, {result_naive.conf_int[1]:.4f}]")
[ ]:
# Survey-aware DiD (with real NHANES design)
sd_nhanes = SurveyDesign(
weights="WTMEC2YR",
strata="SDMVSTRA",
psu="SDMVPSU",
nest=True, # PSU IDs reused across strata in NHANES
)
result_survey = did_naive.fit(nhanes, "outcome", "treated", "post", survey_design=sd_nhanes)
print("Survey-aware DiD (strata + PSU + weights, nest=True):")
print(f" ATT: {result_survey.att:.4f}")
print(f" SE: {result_survey.se:.4f}")
print(f" p-value: {result_survey.p_value:.4f}")
print(f" 95% CI: [{result_survey.conf_int[0]:.4f}, {result_survey.conf_int[1]:.4f}]")
print(f" Survey df: {result_survey.survey_metadata.df_survey}")
[ ]:
# Side-by-side comparison
print(f"{'':25s} {'Naive':>12s} {'Survey':>12s}")
print("-" * 50)
print(f"{'ATT':25s} {result_naive.att:>12.4f} {result_survey.att:>12.4f}")
print(f"{'SE':25s} {result_naive.se:>12.4f} {result_survey.se:>12.4f}")
ci_n = result_naive.conf_int
ci_s = result_survey.conf_int
print(f"{'CI width':25s} {ci_n[1]-ci_n[0]:>12.4f} {ci_s[1]-ci_s[0]:>12.4f}")
print(f"{'p-value':25s} {result_naive.p_value:>12.4f} {result_survey.p_value:>12.4f}")
print(f"{'df':25s} {'(normal)':>12s} {result_survey.survey_metadata.df_survey:>12d}")
print()
print("Both the point estimate AND standard error change with survey weights.")
print("The weighted ATT is the population-representative estimate.")
With real survey data, both the point estimate and standard error change when we account for the survey design. The ATT shifts from 0.065 (unweighted) to 0.097 (weighted) because NHANES uses unequal selection probabilities — the weighted estimate is population-representative, while the unweighted one over- or under-represents certain demographic groups. The SE also increases because NHANES clusters individuals within PSUs, and people from the same geographic area have correlated insurance status.
The survey-corrected estimate of ~9.7 percentage points suggests that the ACA provisions meaningfully increased insurance coverage among young adults. This is consistent with the published literature: studies measuring only the 2010 dependent coverage mandate find 3–6 pp effects (Antwi et al., 2013; Sommers, 2012), while studies spanning the full ACA implementation — including the 2014 marketplace, individual mandate, and Medicaid expansion — find 8–13 pp (Kaestner et al., 2017; Courtemanche et al., 2017). Our 2007–08 vs. 2015–16 window captures all of these provisions, placing the 9.7 pp estimate squarely in the expected range.
The survey degrees of freedom (31 = n_PSU − n_strata) reflect the actual number of independent sampling units, not the number of individuals. This is why the confidence interval [0.006, 0.187] is wide despite nearly 3,000 observations.
Note: This is a simplified analysis for illustration. A full study would include additional covariates, robustness checks, and careful attention to the parallel trends assumption.
[ ]:
# Subpopulation analysis: female respondents
sd_sub, data_sub = sd_nhanes.subpopulation(nhanes, lambda df: df["RIAGENDR"] == 2)
result_female = DifferenceInDifferences().fit(
data_sub, "outcome", "treated", "post", survey_design=sd_sub,
)
print("Subpopulation: Female respondents only")
print(f" ATT: {result_female.att:.4f}")
print(f" SE: {result_female.se:.4f}")
print(f" p-value: {result_female.p_value:.4f}")
print()
print(f"Overall ATT: {result_survey.att:.4f} (SE={result_survey.se:.4f})")
print(f"Female ATT: {result_female.att:.4f} (SE={result_female.se:.4f})")
print()
print("Note: subpopulation() zeros out male weights while preserving the full")
print("design structure (all strata and PSUs), giving correct variance estimates.")
Summary#
Key takeaways:
Always specify the survey design when working with survey data. Ignoring it leads to incorrect standard errors – typically too small, leading to false positives.
``SurveyDesign`` encapsulates your survey’s sampling structure in one object. Pass column names for weights, strata, PSU, and FPC.
Pass ``survey_design`` to ``fit()`` – the same API works across all estimators. No changes to your estimation code beyond adding one parameter.
CallawaySantAnna has the most complete survey support: strata/PSU/FPC, replicate weights, analytical and bootstrap SEs, and both panel and cross-section modes.
Replicate weights (JK1, JKn, BRR, Fay) are an alternative to strata/PSU/FPC when your survey provides them (e.g., MEPS, ACS PUMS). They are mutually exclusive with strata/PSU/FPC.
Use ``subpopulation()`` instead of subsetting when estimating effects for a subgroup. Subsetting drops design information and biases variance estimates.
DEFF diagnostics help you understand how the survey design affects precision. DEFF > 1 means clustering costs exceed stratification gains; DEFF < 1 means the design improves precision for that coefficient. DEFF > 2 indicates substantial clustering.
Repeated cross-sections (
panel=False) work with survey design for non-panel surveys like BRFSS, CPS, and ACS 1-year.Validate with real data. The NHANES example above demonstrates that diff-diff’s survey variance matches R’s
survey::svyglm()to machine precision on real CDC data. Seedocs/benchmarks.rstfor the full validation results across three federal survey datasets.
Quick reference:
Parameter |
When to use |
|---|---|
|
Always – specify the sampling weight column |
|
When the survey uses stratified sampling |
|
When multi-stage (clustered) sampling is used |
|
When the sampling fraction is non-negligible |
|
When the survey provides replicate weights instead of strata/PSU/FPC |
References:
Lumley, T. (2004). Analysis of Complex Survey Samples. Journal of Statistical Software 9(8).
Solon, G., Haider, S. J., & Wooldridge, J. M. (2015). What Are We Weighting For? Journal of Human Resources 50(2).
Binder, D. A. (1983). On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review 51(3).
Rao, J. N. K., Wu, C. F. J., & Yue, K. (1992). Some Recent Work on Resampling Methods for Complex Surveys. Survey Methodology 18(2).
Callaway, B. & Sant’Anna, P. H. C. (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics 225(2).
Sant’Anna, P. H. C. & Zhao, J. (2020). Doubly Robust Difference-in-Differences Estimators. Journal of Econometrics 219(1).