Interactive notebook
This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.
Power Analysis for Difference-in-Differences#
This notebook demonstrates how to use the power analysis tools in diff-diff for study design. We’ll cover:
Computing minimum detectable effects (MDE)
Calculating required sample sizes
Estimating statistical power
Creating power curves for visualization
Panel data considerations (ICC, multiple periods)
Simulation-based power analysis for complex designs
Power analysis for any estimator (staggered, synthetic DiD, triple difference)
Finding MDE via simulation (bisection search)
Finding required sample size via simulation (bisection search)
Custom data generators
Convenience functions
Practical recommendations
[ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from diff_diff import (
PowerAnalysis,
DifferenceInDifferences,
CallawaySantAnna,
SyntheticDiD,
TripleDifference,
simulate_power,
simulate_mde,
simulate_sample_size,
compute_mde,
compute_power,
compute_sample_size,
plot_power_curve,
)
1. The Power Analysis Problem#
Before running a DiD study, researchers need to answer key design questions:
“How many units do I need?” (sample size calculation)
“What’s the smallest effect I can detect?” (minimum detectable effect)
“What’s my chance of finding a significant result?” (power calculation)
The PowerAnalysis class provides analytical formulas for these calculations in basic DiD designs.
[ ]:
# Create a PowerAnalysis object with standard settings
# alpha = 0.05 (5% significance level)
# power = 0.80 (80% power - conventional target)
pa = PowerAnalysis(alpha=0.05, power=0.80)
print(f"Significance level: {pa.alpha}")
print(f"Target power: {pa.target_power}")
print(f"Alternative hypothesis: {pa.alternative}")
2. Minimum Detectable Effect (MDE)#
The MDE is the smallest treatment effect you can detect with your specified power and sample size. It depends on:
Sample sizes (n_treated, n_control)
Residual variance (sigma)
Significance level (alpha)
Target power
[ ]:
# Calculate MDE for a basic 2x2 DiD design
# Assume: 50 treated units, 50 control units, outcome SD of 10
result = pa.mde(
n_treated=50,
n_control=50,
sigma=10.0 # Residual standard deviation
)
print(result.summary())
[ ]:
# Access individual results
print(f"Minimum Detectable Effect: {result.mde:.2f}")
print(f"This is {result.mde / 10.0:.2f} standard deviations")
print(f"")
print(f"With 50 treated and 50 control units, and sigma=10,")
print(f"you can detect an effect of {result.mde:.2f} or larger")
print(f"with {result.power:.0%} power at alpha={result.alpha:.2f}.")
How MDE Changes with Sample Size#
[ ]:
# Compare MDE across different sample sizes
sample_sizes = [25, 50, 100, 200, 500]
print(f"{'N per group':>12} {'Total N':>10} {'MDE':>10} {'MDE/SD':>10}")
print("-" * 45)
for n in sample_sizes:
result = pa.mde(n_treated=n, n_control=n, sigma=10.0)
print(f"{n:>12} {2*n:>10} {result.mde:>10.2f} {result.mde/10.0:>10.2f}")
3. Required Sample Size#
Given a target effect size to detect, calculate the required sample size.
[ ]:
# How many units do we need to detect an effect of 5 units?
result = pa.sample_size(
effect_size=5.0, # Effect we want to detect
sigma=10.0 # Outcome standard deviation
)
print(result.summary())
[ ]:
# Compare sample sizes needed for different effect sizes
effect_sizes = [2.0, 3.0, 5.0, 7.0, 10.0]
print(f"{'Effect Size':>12} {'Effect/SD':>10} {'Required N':>12} {'N per group':>12}")
print("-" * 50)
for effect in effect_sizes:
result = pa.sample_size(effect_size=effect, sigma=10.0)
print(f"{effect:>12.1f} {effect/10.0:>10.2f} {result.required_n:>12} {result.n_treated:>12}")
4. Power Calculation#
Given a specific effect size and sample size, calculate the statistical power.
[ ]:
# What's our power to detect an effect of 4 with 75 units per group?
pa_calc = PowerAnalysis(alpha=0.05) # Don't specify target power for calculation
result = pa_calc.power(
effect_size=4.0,
n_treated=75,
n_control=75,
sigma=10.0
)
print(f"Power to detect effect of 4.0: {result.power:.1%}")
print(f"This is {'adequate' if result.power >= 0.80 else 'below the conventional 80% threshold'}")
5. Power Curves#
Power curves show how power changes with effect size (or sample size). They’re essential for understanding the trade-offs in study design.
[ ]:
# Generate power curve data
curve_df = pa.power_curve(
n_treated=50,
n_control=50,
sigma=10.0
)
# Get MDE for reference
mde_result = pa.mde(n_treated=50, n_control=50, sigma=10.0)
print("First few rows of power curve:")
print(curve_df.head(10))
[ ]:
# Plot the power curve
plot_power_curve(
curve_df,
mde=mde_result.mde,
target_power=0.80,
title="Power Curve: 50 Treated, 50 Control, SD=10",
xlabel="Treatment Effect Size",
figsize=(10, 6)
)
plt.show()
Power vs Sample Size#
[ ]:
# How does power change with sample size for a fixed effect?
sample_curve = pa.sample_size_curve(
effect_size=5.0,
sigma=10.0
)
# Plot
plot_power_curve(
sample_curve,
target_power=0.80,
show_mde_line=False,
title="Power vs Sample Size (Effect=5, SD=10)",
figsize=(10, 6)
)
plt.show()
6. Panel Data Considerations#
For panel DiD with multiple time periods, power depends on:
Number of pre/post periods (
n_pre,n_post)Within-unit (serial) equicorrelation
rho— the correlation of a unit’s outcomes across time
More periods improve precision. Within-unit correlation makes the DiD easier to detect (it lowers the MDE): the difference-in-differences cancels the shared within-unit component, so a higher rho leaves less residual noise. This is the within-unit equicorrelated case of Burlig, Preonas & Woerman (2020), Eq. 2 — Var(ATT) = σ²(1/n_T + 1/n_C)(1/m + 1/r)(1 − rho) — and is the opposite of the classic Moulton “design effect” intuition (which inflates the variance of a cluster mean,
not a difference).
[ ]:
# Compare MDE with different numbers of periods
print("Effect of number of periods on MDE (50 treated, 50 control, sigma=10):")
print(f"{'Periods (pre+post)':>20} {'MDE':>10} {'Improvement':>15}")
print("-" * 50)
baseline_mde = None
for n_pre, n_post in [(1, 1), (2, 2), (3, 3), (5, 5)]:
result = pa.mde(
n_treated=50, n_control=50, sigma=10.0,
n_pre=n_pre, n_post=n_post
)
if baseline_mde is None:
baseline_mde = result.mde
improvement = "-"
else:
improvement = f"{(1 - result.mde/baseline_mde)*100:.1f}%"
print(f"{n_pre + n_post:>20} {result.mde:>10.2f} {improvement:>15}")
[ ]:
# Effect of within-unit (serial) equicorrelation `rho` on the panel-DiD MDE.
# Per Burlig et al. (2020) Eq. 2 (equicorrelated case), higher rho LOWERS the MDE:
# Var(ATT) carries a (1 - rho) factor because the DiD differences out the shared
# within-unit component.
print("Effect of within-unit correlation on MDE (50 treated, 50 control, 6 periods, sigma=10):")
print(f"{'rho':>10} {'MDE':>10} {'Var factor (1-rho)':>20}")
print("-" * 42)
for rho in [0.0, 0.1, 0.3, 0.5, 0.7]:
result = pa.mde(
n_treated=50, n_control=50, sigma=10.0,
n_pre=3, n_post=3, rho=rho
)
# Variance scales with (1 - rho) relative to the rho = 0 case.
var_factor = 1 - rho
print(f"{rho:>10.1f} {result.mde:>10.2f} {var_factor:>20.2f}")
7. Simulation-Based Power Analysis#
For complex designs (staggered adoption, synthetic DiD, etc.), analytical formulas may not exist. Use Monte Carlo simulation instead.
The simulate_power function:
Generates synthetic data with known treatment effect
Fits your estimator
Records whether the effect is statistically significant
Repeats many times
Reports the proportion of significant results (= power)
[ ]:
# Simulation-based power analysis
did = DifferenceInDifferences()
results = simulate_power(
estimator=did,
n_units=100,
n_periods=4,
treatment_effect=5.0,
treatment_fraction=0.5,
sigma=5.0, # Noise level
n_simulations=200, # More simulations = more precise power estimate
seed=42,
progress=False
)
print(results.summary())
[ ]:
# Key metrics from simulation
print("Simulation Results:")
print(f" Power (rejection rate): {results.power:.1%}")
print(f" 95% CI for power: [{results.power_ci[0]:.1%}, {results.power_ci[1]:.1%}]")
print(f"")
print("Estimator Performance:")
print(f" True effect: {results.true_effect:.2f}")
print(f" Mean estimate: {results.mean_estimate:.2f}")
print(f" Bias: {results.bias:.4f}")
print(f" RMSE: {results.rmse:.4f}")
print(f" Coverage: {results.coverage:.1%}")
Power Curve via Simulation#
[ ]:
# Simulate power for multiple effect sizes
results_multi = simulate_power(
estimator=did,
n_units=100,
n_periods=4,
effect_sizes=[1.0, 2.0, 3.0, 5.0, 7.0, 10.0], # Multiple effects
sigma=5.0,
n_simulations=100, # Fewer per effect size for speed
seed=42,
progress=False
)
# Get power curve data
power_curve_sim = results_multi.power_curve_df()
print(power_curve_sim)
[ ]:
# Plot simulation-based power curve
plot_power_curve(
power_curve_sim,
target_power=0.80,
title="Simulation-Based Power Curve (100 units, 4 periods, SD=5)",
figsize=(10, 6)
)
plt.show()
8. Power Analysis for Any Estimator#
The simulation-based approach works with all 12 supported estimators — not just basic DiD. An internal registry automatically selects the appropriate data-generating process (DGP) and fit signature for each registered estimator. Just swap in the estimator object and everything else is handled. See the support table below for the full list, and Section 11 for using custom DGPs with unsupported estimators.
Staggered Adoption Estimators#
[ ]:
# Power analysis with Callaway-Sant'Anna — the registry auto-selects
# generate_staggered_data as the DGP and the correct fit kwargs
cs = CallawaySantAnna()
cs_results = simulate_power(
estimator=cs,
n_units=100,
n_periods=6,
treatment_effect=5.0,
treatment_fraction=0.5,
treatment_period=3,
sigma=5.0,
n_simulations=100,
seed=42,
progress=False,
)
print(cs_results.summary())
Factor Model Estimators (Synthetic DiD)#
For SyntheticDiD with the default placebo variance method, the DGP must generate more control than treated units (treatment_fraction < 0.5). The registry uses generate_factor_data automatically.
[ ]:
import warnings
# Synthetic DiD — note treatment_fraction=0.3 (placebo variance requires
# more control units than treated units)
sdid = SyntheticDiD()
# Suppress pre-treatment fit warnings from individual simulation draws;
# the factor-model DGP can trigger this on some random seeds without
# affecting the overall power estimate.
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Pre-treatment fit is poor")
sdid_results = simulate_power(
estimator=sdid,
n_units=60,
n_periods=6,
treatment_effect=5.0,
treatment_fraction=0.3,
treatment_period=3,
sigma=3.0,
n_simulations=100,
seed=42,
progress=False,
)
print(sdid_results.summary())
Triple Difference#
TripleDifference uses a fixed 2×2×2 factorial design (group × partition × time). Sample sizes are rounded via ``n_per_cell = max(2, n_units // 8)``, so the minimum effective N is 16 (2 units per cell × 8 cells). The effective_n_units field in results tracks any rounding. Note that simulate_sample_size() uses a higher search floor of 64 from the registry.
[ ]:
# Triple Difference — n_units snaps to multiples of 8
ddd = TripleDifference()
ddd_results = simulate_power(
estimator=ddd,
n_units=64,
treatment_effect=3.0,
sigma=2.0,
n_simulations=100,
seed=42,
progress=False,
)
print(ddd_results.summary())
if ddd_results.effective_n_units is not None:
print(f"\nEffective N (after grid rounding): {ddd_results.effective_n_units}")
Supported Estimators#
The following 12 estimators are supported by the simulation power analysis registry. Each is automatically paired with the correct data-generating process:
DGP Family |
Estimators |
Min N |
|---|---|---|
Basic DiD ( |
DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD |
20 |
Staggered ( |
CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD |
40 |
Factor Model ( |
TROP, SyntheticDiD |
30 |
Triple Difference ( |
TripleDifference |
16* |
* DDD effective N rounds to max(2, n_units // 8) * 8 with minimum 16. simulate_sample_size() uses a higher search floor of 64.
Note:
ContinuousDiDis not in the registry because continuous/dose-response treatments require a different DGP structure.BaconDecompositionandHonestDiDare diagnostic/sensitivity tools rather than treatment effect estimators. For unsupported estimators, you can pass a customdata_generatorandresult_extractor(see Section 11).
Power Curve for a Staggered Estimator#
[ ]:
# Power curve across effect sizes for Callaway-Sant'Anna
cs_curve = simulate_power(
estimator=CallawaySantAnna(),
n_units=100,
n_periods=6,
effect_sizes=[1.0, 2.0, 3.0, 5.0, 7.0],
treatment_period=3,
sigma=5.0,
n_simulations=100,
seed=42,
progress=False,
)
plot_power_curve(
cs_curve.power_curve_df(),
target_power=0.80,
title="CS Power Curve (100 units, 6 periods, SD=5)",
figsize=(10, 6),
)
plt.show()
9. Finding MDE via Simulation#
The analytical PowerAnalysis.mde() works for basic DiD, but for complex estimators there is no closed-form formula. simulate_mde() uses bisection search to find the minimum detectable effect: it repeatedly calls simulate_power() at different effect sizes, narrowing the bracket until it finds the smallest effect that achieves the target power.
[ ]:
# Find MDE for basic DiD via simulation
mde_result = simulate_mde(
DifferenceInDifferences(),
n_units=100,
n_periods=4,
sigma=5.0,
n_simulations=100,
seed=42,
progress=False,
)
print(mde_result.summary())
Inspecting the Search Path#
The search_path attribute records the effect size and power at each bisection step, which is useful for diagnosing convergence:
[ ]:
# search_path is a List[Dict] — convert to DataFrame for display
search_df = pd.DataFrame(mde_result.search_path)
print(search_df.to_string(index=False))
MDE for a Staggered Estimator#
The same function works with any registered estimator:
[ ]:
# MDE for Callaway-Sant'Anna
cs_mde = simulate_mde(
CallawaySantAnna(),
n_units=100,
n_periods=6,
treatment_period=3,
sigma=5.0,
n_simulations=100,
seed=42,
progress=False,
)
print(cs_mde.summary())
Key parameters for ``simulate_mde()``:
effect_range=(lo, hi)— custom search bracket (auto-detected if omitted)tol— convergence tolerance on power (default 0.02)max_steps— maximum bisection steps (default 15)n_simulations— simulations per step (use 500+ for production analyses)
10. Finding Required Sample Size via Simulation#
simulate_sample_size() uses bisection search over n_units to find the smallest sample size that achieves the target power for a given effect size.
[ ]:
# Find required sample size for basic DiD
n_result = simulate_sample_size(
DifferenceInDifferences(),
treatment_effect=5.0,
sigma=5.0,
n_simulations=100,
seed=42,
progress=False,
)
print(n_result.summary())
Inspecting the Search Path#
[ ]:
# View the bisection steps
n_search_df = pd.DataFrame(n_result.search_path)
print(n_search_df.to_string(index=False))
Comparing Analytical and Simulation Results#
For basic DiD, we can compare the simulation result against the analytical formula. With only 100 simulations per bisection step there will be Monte Carlo noise, so we expect approximate — not exact — agreement:
[ ]:
# Analytical sample size
analytical = pa.sample_size(effect_size=5.0, sigma=5.0)
print(f"{'Method':<25} {'Required N':>12}")
print("-" * 40)
print(f"{'Analytical:':<25} {analytical.required_n:>12}")
print(f"{'Simulation:':<25} {n_result.required_n:>12}")
print(f"\nSimulation power at N: {n_result.power_at_n:.1%}")
Key parameters for ``simulate_sample_size()``:
n_range=(lo, hi)— custom search bracket for sample size (auto-detected if omitted)max_steps— maximum bisection steps (default 15)n_simulations— simulations per step (use 500+ for production analyses)
11. Custom Data Generators#
The default DGPs cover common designs, but you can customize them in two ways:
Tweak the default DGP with
data_generator_kwargs(e.g., add multiple treatment cohorts)Supply a fully custom DGP with
data_generator
Tweaking the Default DGP#
Pass additional keyword arguments to the registry’s DGP via data_generator_kwargs. For example, the default staggered DGP generates a single treatment cohort — here we create a multi-cohort design:
Note: Some keys are protected and cannot be overridden via
data_generator_kwargsbecause they are controlled by the simulation function itself:treatment_effect,noise_sd,n_units,n_periods,treatment_fraction,treatment_period,n_pre,n_post.
[ ]:
# Multi-cohort staggered design: treatment starts at periods 2 and 4
cs_multi = simulate_power(
estimator=CallawaySantAnna(),
n_units=120,
n_periods=6,
treatment_effect=5.0,
sigma=5.0,
n_simulations=100,
seed=42,
progress=False,
data_generator_kwargs={
"cohort_periods": [2, 4],
"never_treated_frac": 0.3,
},
)
print(cs_multi.summary())
Fully Custom Data Generator#
For designs not covered by the built-in DGPs, supply your own data_generator function. It receives the standard simulation parameters and must return a DataFrame. You may also need a custom result_extractor if your estimator returns non-standard results, and estimator_kwargs to pass the right column names to fit().
[ ]:
def my_dgp(n_units, n_periods, treatment_effect, treatment_fraction,
treatment_period, noise_sd, seed=None):
"""Custom DGP with heterogeneous unit effects."""
rng = np.random.default_rng(seed)
n_treat = int(n_units * treatment_fraction)
rows = []
for i in range(n_units):
unit_fe = rng.normal(0, 3) # heterogeneous unit effect
treated_unit = i < n_treat
for t in range(n_periods):
post = int(t >= treatment_period)
effect = treatment_effect * post if treated_unit else 0.0
y = unit_fe + 2.0 * t + effect + rng.normal(0, noise_sd)
rows.append({
"unit": i, "period": t, "outcome": y,
"ever_treated": int(treated_unit), "post": post,
})
return pd.DataFrame(rows)
# Use the custom DGP with simulate_power
custom_results = simulate_power(
estimator=DifferenceInDifferences(),
n_units=80,
n_periods=4,
treatment_effect=4.0,
sigma=3.0,
n_simulations=100,
seed=42,
progress=False,
data_generator=my_dgp,
estimator_kwargs={"outcome": "outcome", "treatment": "ever_treated", "time": "post"},
)
print(custom_results.summary())
12. Convenience Functions#
For quick calculations, use the convenience functions:
[ ]:
# Quick MDE calculation
mde = compute_mde(n_treated=100, n_control=100, sigma=10.0, power=0.80)
print(f"MDE: {mde:.2f}")
# Quick power calculation
power = compute_power(effect_size=5.0, n_treated=100, n_control=100, sigma=10.0)
print(f"Power: {power:.1%}")
# Quick sample size calculation
n = compute_sample_size(effect_size=5.0, sigma=10.0, power=0.80)
print(f"Required N: {n}")
13. Practical Recommendations#
Estimating Sigma (Residual SD)#
The residual standard deviation is crucial for power calculations. Options:
Pilot data: Fit a model on historical data and get residual SD
Literature: Find similar studies and use their reported SDs
Domain knowledge: Expert judgment about outcome variability
Sensitivity analysis: Calculate power for a range of sigma values
[ ]:
# Sensitivity to sigma
print("MDE sensitivity to residual SD (50 treated, 50 control):")
print(f"{'Sigma':>10} {'MDE':>10} {'Effect/SD':>12}")
print("-" * 35)
for sigma in [5.0, 7.5, 10.0, 12.5, 15.0]:
result = pa.mde(n_treated=50, n_control=50, sigma=sigma)
print(f"{sigma:>10.1f} {result.mde:>10.2f} {result.mde/sigma:>12.3f}")
Choosing a Target Power#
80% power is conventional but arbitrary
90% power is more conservative and recommended for important studies
Consider the cost of Type II errors (missing a real effect) vs Type I errors
[ ]:
# Compare required sample sizes for different power levels
print("Required sample size by target power (effect=5, sigma=10):")
print(f"{'Power':>10} {'Required N':>15}")
print("-" * 30)
for power in [0.70, 0.80, 0.90, 0.95]:
pa_custom = PowerAnalysis(power=power)
result = pa_custom.sample_size(effect_size=5.0, sigma=10.0)
print(f"{power:>10.0%} {result.required_n:>15}")
Analytical vs. Simulation: When to Use Each#
Approach |
Best for |
Advantages |
|---|---|---|
Analytical ( |
Basic 2×2 DiD, panel DiD |
Fast, exact, closed-form |
Simulation ( |
Staggered, SDID, TROP, DDD, custom designs |
Works with any estimator, reports bias/RMSE/coverage |
Rule of thumb: Start with analytical power analysis for basic designs. Move to simulation when using specialized estimators or non-standard DGPs.
Summary#
Key takeaways for DiD power analysis:
Always do a power analysis before running a study
MDE decreases with sample size, more periods, and lower variance
Within-unit correlation helps in panel DiD — higher serial correlation lowers the MDE, because the difference-in-differences cancels the shared within-unit component (Burlig et al. 2020, Eq. 2, equicorrelated case)
Use simulation for complex designs (staggered, synthetic DiD, triple difference)
12 estimators are supported out of the box via the auto-registry — just swap in the estimator
``simulate_mde()`` and ``simulate_sample_size()`` extend MDE and sample size calculations to any estimator via bisection search
Custom DGPs let you model non-standard designs with
data_generatoranddata_generator_kwargsBe realistic about sigma — err on the side of larger values
Consider your smallest meaningful effect — don’t just target statistical significance
For more on DiD estimation, see the other tutorials:
01_basic_did.ipynb— Basic DiD estimation02_staggered_did.ipynb— Staggered adoption designs03_synthetic_did.ipynb— Synthetic DiD04_parallel_trends.ipynb— Testing assumptions05_honest_did.ipynb— Sensitivity analysis07_pretrends_power.ipynb— Pre-trends power analysis (Roth 2022)