Interactive notebook

This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.

Power Analysis for Difference-in-Differences#

This notebook demonstrates how to use the power analysis tools in diff-diff for study design. We’ll cover:

  1. Computing minimum detectable effects (MDE)

  2. Calculating required sample sizes

  3. Estimating statistical power

  4. Creating power curves for visualization

  5. Panel data considerations (ICC, multiple periods)

  6. Simulation-based power analysis for complex designs

  7. Power analysis for any estimator (staggered, synthetic DiD, triple difference)

  8. Finding MDE via simulation (bisection search)

  9. Finding required sample size via simulation (bisection search)

  10. Custom data generators

  11. Convenience functions

  12. Practical recommendations

[ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from diff_diff import (
    PowerAnalysis,
    DifferenceInDifferences,
    CallawaySantAnna,
    SyntheticDiD,
    TripleDifference,
    simulate_power,
    simulate_mde,
    simulate_sample_size,
    compute_mde,
    compute_power,
    compute_sample_size,
    plot_power_curve,
)

1. The Power Analysis Problem#

Before running a DiD study, researchers need to answer key design questions:

  • “How many units do I need?” (sample size calculation)

  • “What’s the smallest effect I can detect?” (minimum detectable effect)

  • “What’s my chance of finding a significant result?” (power calculation)

The PowerAnalysis class provides analytical formulas for these calculations in basic DiD designs.

[ ]:
# Create a PowerAnalysis object with standard settings
# alpha = 0.05 (5% significance level)
# power = 0.80 (80% power - conventional target)
pa = PowerAnalysis(alpha=0.05, power=0.80)

print(f"Significance level: {pa.alpha}")
print(f"Target power: {pa.target_power}")
print(f"Alternative hypothesis: {pa.alternative}")

2. Minimum Detectable Effect (MDE)#

The MDE is the smallest treatment effect you can detect with your specified power and sample size. It depends on:

  • Sample sizes (n_treated, n_control)

  • Residual variance (sigma)

  • Significance level (alpha)

  • Target power

[ ]:
# Calculate MDE for a basic 2x2 DiD design
# Assume: 50 treated units, 50 control units, outcome SD of 10
result = pa.mde(
    n_treated=50,
    n_control=50,
    sigma=10.0  # Residual standard deviation
)

print(result.summary())
[ ]:
# Access individual results
print(f"Minimum Detectable Effect: {result.mde:.2f}")
print(f"This is {result.mde / 10.0:.2f} standard deviations")
print(f"")
print(f"With 50 treated and 50 control units, and sigma=10,")
print(f"you can detect an effect of {result.mde:.2f} or larger")
print(f"with {result.power:.0%} power at alpha={result.alpha:.2f}.")

How MDE Changes with Sample Size#

[ ]:
# Compare MDE across different sample sizes
sample_sizes = [25, 50, 100, 200, 500]

print(f"{'N per group':>12} {'Total N':>10} {'MDE':>10} {'MDE/SD':>10}")
print("-" * 45)

for n in sample_sizes:
    result = pa.mde(n_treated=n, n_control=n, sigma=10.0)
    print(f"{n:>12} {2*n:>10} {result.mde:>10.2f} {result.mde/10.0:>10.2f}")

3. Required Sample Size#

Given a target effect size to detect, calculate the required sample size.

[ ]:
# How many units do we need to detect an effect of 5 units?
result = pa.sample_size(
    effect_size=5.0,  # Effect we want to detect
    sigma=10.0  # Outcome standard deviation
)

print(result.summary())
[ ]:
# Compare sample sizes needed for different effect sizes
effect_sizes = [2.0, 3.0, 5.0, 7.0, 10.0]

print(f"{'Effect Size':>12} {'Effect/SD':>10} {'Required N':>12} {'N per group':>12}")
print("-" * 50)

for effect in effect_sizes:
    result = pa.sample_size(effect_size=effect, sigma=10.0)
    print(f"{effect:>12.1f} {effect/10.0:>10.2f} {result.required_n:>12} {result.n_treated:>12}")

4. Power Calculation#

Given a specific effect size and sample size, calculate the statistical power.

[ ]:
# What's our power to detect an effect of 4 with 75 units per group?
pa_calc = PowerAnalysis(alpha=0.05)  # Don't specify target power for calculation

result = pa_calc.power(
    effect_size=4.0,
    n_treated=75,
    n_control=75,
    sigma=10.0
)

print(f"Power to detect effect of 4.0: {result.power:.1%}")
print(f"This is {'adequate' if result.power >= 0.80 else 'below the conventional 80% threshold'}")

5. Power Curves#

Power curves show how power changes with effect size (or sample size). They’re essential for understanding the trade-offs in study design.

[ ]:
# Generate power curve data
curve_df = pa.power_curve(
    n_treated=50,
    n_control=50,
    sigma=10.0
)

# Get MDE for reference
mde_result = pa.mde(n_treated=50, n_control=50, sigma=10.0)

print("First few rows of power curve:")
print(curve_df.head(10))
[ ]:
# Plot the power curve
plot_power_curve(
    curve_df,
    mde=mde_result.mde,
    target_power=0.80,
    title="Power Curve: 50 Treated, 50 Control, SD=10",
    xlabel="Treatment Effect Size",
    figsize=(10, 6)
)
plt.show()

Power vs Sample Size#

[ ]:
# How does power change with sample size for a fixed effect?
sample_curve = pa.sample_size_curve(
    effect_size=5.0,
    sigma=10.0
)

# Plot
plot_power_curve(
    sample_curve,
    target_power=0.80,
    show_mde_line=False,
    title="Power vs Sample Size (Effect=5, SD=10)",
    figsize=(10, 6)
)
plt.show()

6. Panel Data Considerations#

For panel DiD with multiple time periods, power depends on:

  • Number of pre/post periods (n_pre, n_post)

  • Within-unit (serial) equicorrelation rho — the correlation of a unit’s outcomes across time

More periods improve precision. Within-unit correlation makes the DiD easier to detect (it lowers the MDE): the difference-in-differences cancels the shared within-unit component, so a higher rho leaves less residual noise. This is the within-unit equicorrelated case of Burlig, Preonas & Woerman (2020), Eq. 2 — Var(ATT) = σ²(1/n_T + 1/n_C)(1/m + 1/r)(1 rho) — and is the opposite of the classic Moulton “design effect” intuition (which inflates the variance of a cluster mean, not a difference).

[ ]:
# Compare MDE with different numbers of periods
print("Effect of number of periods on MDE (50 treated, 50 control, sigma=10):")
print(f"{'Periods (pre+post)':>20} {'MDE':>10} {'Improvement':>15}")
print("-" * 50)

baseline_mde = None
for n_pre, n_post in [(1, 1), (2, 2), (3, 3), (5, 5)]:
    result = pa.mde(
        n_treated=50, n_control=50, sigma=10.0,
        n_pre=n_pre, n_post=n_post
    )
    if baseline_mde is None:
        baseline_mde = result.mde
        improvement = "-"
    else:
        improvement = f"{(1 - result.mde/baseline_mde)*100:.1f}%"
    print(f"{n_pre + n_post:>20} {result.mde:>10.2f} {improvement:>15}")
[ ]:
# Effect of within-unit (serial) equicorrelation `rho` on the panel-DiD MDE.
# Per Burlig et al. (2020) Eq. 2 (equicorrelated case), higher rho LOWERS the MDE:
# Var(ATT) carries a (1 - rho) factor because the DiD differences out the shared
# within-unit component.
print("Effect of within-unit correlation on MDE (50 treated, 50 control, 6 periods, sigma=10):")
print(f"{'rho':>10} {'MDE':>10} {'Var factor (1-rho)':>20}")
print("-" * 42)

for rho in [0.0, 0.1, 0.3, 0.5, 0.7]:
    result = pa.mde(
        n_treated=50, n_control=50, sigma=10.0,
        n_pre=3, n_post=3, rho=rho
    )
    # Variance scales with (1 - rho) relative to the rho = 0 case.
    var_factor = 1 - rho
    print(f"{rho:>10.1f} {result.mde:>10.2f} {var_factor:>20.2f}")

7. Simulation-Based Power Analysis#

For complex designs (staggered adoption, synthetic DiD, etc.), analytical formulas may not exist. Use Monte Carlo simulation instead.

The simulate_power function:

  1. Generates synthetic data with known treatment effect

  2. Fits your estimator

  3. Records whether the effect is statistically significant

  4. Repeats many times

  5. Reports the proportion of significant results (= power)

[ ]:
# Simulation-based power analysis
did = DifferenceInDifferences()

results = simulate_power(
    estimator=did,
    n_units=100,
    n_periods=4,
    treatment_effect=5.0,
    treatment_fraction=0.5,
    sigma=5.0,  # Noise level
    n_simulations=200,  # More simulations = more precise power estimate
    seed=42,
    progress=False
)

print(results.summary())
[ ]:
# Key metrics from simulation
print("Simulation Results:")
print(f"  Power (rejection rate): {results.power:.1%}")
print(f"  95% CI for power: [{results.power_ci[0]:.1%}, {results.power_ci[1]:.1%}]")
print(f"")
print("Estimator Performance:")
print(f"  True effect: {results.true_effect:.2f}")
print(f"  Mean estimate: {results.mean_estimate:.2f}")
print(f"  Bias: {results.bias:.4f}")
print(f"  RMSE: {results.rmse:.4f}")
print(f"  Coverage: {results.coverage:.1%}")

Power Curve via Simulation#

[ ]:
# Simulate power for multiple effect sizes
results_multi = simulate_power(
    estimator=did,
    n_units=100,
    n_periods=4,
    effect_sizes=[1.0, 2.0, 3.0, 5.0, 7.0, 10.0],  # Multiple effects
    sigma=5.0,
    n_simulations=100,  # Fewer per effect size for speed
    seed=42,
    progress=False
)

# Get power curve data
power_curve_sim = results_multi.power_curve_df()
print(power_curve_sim)
[ ]:
# Plot simulation-based power curve
plot_power_curve(
    power_curve_sim,
    target_power=0.80,
    title="Simulation-Based Power Curve (100 units, 4 periods, SD=5)",
    figsize=(10, 6)
)
plt.show()

8. Power Analysis for Any Estimator#

The simulation-based approach works with all 12 supported estimators — not just basic DiD. An internal registry automatically selects the appropriate data-generating process (DGP) and fit signature for each registered estimator. Just swap in the estimator object and everything else is handled. See the support table below for the full list, and Section 11 for using custom DGPs with unsupported estimators.

Staggered Adoption Estimators#

[ ]:
# Power analysis with Callaway-Sant'Anna — the registry auto-selects
# generate_staggered_data as the DGP and the correct fit kwargs
cs = CallawaySantAnna()

cs_results = simulate_power(
    estimator=cs,
    n_units=100,
    n_periods=6,
    treatment_effect=5.0,
    treatment_fraction=0.5,
    treatment_period=3,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(cs_results.summary())

Factor Model Estimators (Synthetic DiD)#

For SyntheticDiD with the default placebo variance method, the DGP must generate more control than treated units (treatment_fraction < 0.5). The registry uses generate_factor_data automatically.

[ ]:
import warnings

# Synthetic DiD — note treatment_fraction=0.3 (placebo variance requires
# more control units than treated units)
sdid = SyntheticDiD()

# Suppress pre-treatment fit warnings from individual simulation draws;
# the factor-model DGP can trigger this on some random seeds without
# affecting the overall power estimate.
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message="Pre-treatment fit is poor")
    sdid_results = simulate_power(
        estimator=sdid,
        n_units=60,
        n_periods=6,
        treatment_effect=5.0,
        treatment_fraction=0.3,
        treatment_period=3,
        sigma=3.0,
        n_simulations=100,
        seed=42,
        progress=False,
    )

print(sdid_results.summary())

Triple Difference#

TripleDifference uses a fixed 2×2×2 factorial design (group × partition × time). Sample sizes are rounded via ``n_per_cell = max(2, n_units // 8)``, so the minimum effective N is 16 (2 units per cell × 8 cells). The effective_n_units field in results tracks any rounding. Note that simulate_sample_size() uses a higher search floor of 64 from the registry.

[ ]:
# Triple Difference — n_units snaps to multiples of 8
ddd = TripleDifference()

ddd_results = simulate_power(
    estimator=ddd,
    n_units=64,
    treatment_effect=3.0,
    sigma=2.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(ddd_results.summary())
if ddd_results.effective_n_units is not None:
    print(f"\nEffective N (after grid rounding): {ddd_results.effective_n_units}")

Supported Estimators#

The following 12 estimators are supported by the simulation power analysis registry. Each is automatically paired with the correct data-generating process:

DGP Family

Estimators

Min N

Basic DiD (generate_did_data)

DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD

20

Staggered (generate_staggered_data)

CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD

40

Factor Model (generate_factor_data)

TROP, SyntheticDiD

30

Triple Difference (generate_ddd_data)

TripleDifference

16*

* DDD effective N rounds to max(2, n_units // 8) * 8 with minimum 16. simulate_sample_size() uses a higher search floor of 64.

Note: ContinuousDiD is not in the registry because continuous/dose-response treatments require a different DGP structure. BaconDecomposition and HonestDiD are diagnostic/sensitivity tools rather than treatment effect estimators. For unsupported estimators, you can pass a custom data_generator and result_extractor (see Section 11).

Power Curve for a Staggered Estimator#

[ ]:
# Power curve across effect sizes for Callaway-Sant'Anna
cs_curve = simulate_power(
    estimator=CallawaySantAnna(),
    n_units=100,
    n_periods=6,
    effect_sizes=[1.0, 2.0, 3.0, 5.0, 7.0],
    treatment_period=3,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

plot_power_curve(
    cs_curve.power_curve_df(),
    target_power=0.80,
    title="CS Power Curve (100 units, 6 periods, SD=5)",
    figsize=(10, 6),
)
plt.show()

9. Finding MDE via Simulation#

The analytical PowerAnalysis.mde() works for basic DiD, but for complex estimators there is no closed-form formula. simulate_mde() uses bisection search to find the minimum detectable effect: it repeatedly calls simulate_power() at different effect sizes, narrowing the bracket until it finds the smallest effect that achieves the target power.

[ ]:
# Find MDE for basic DiD via simulation
mde_result = simulate_mde(
    DifferenceInDifferences(),
    n_units=100,
    n_periods=4,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(mde_result.summary())

Inspecting the Search Path#

The search_path attribute records the effect size and power at each bisection step, which is useful for diagnosing convergence:

[ ]:
# search_path is a List[Dict] — convert to DataFrame for display
search_df = pd.DataFrame(mde_result.search_path)
print(search_df.to_string(index=False))

MDE for a Staggered Estimator#

The same function works with any registered estimator:

[ ]:
# MDE for Callaway-Sant'Anna
cs_mde = simulate_mde(
    CallawaySantAnna(),
    n_units=100,
    n_periods=6,
    treatment_period=3,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(cs_mde.summary())

Key parameters for ``simulate_mde()``:

  • effect_range=(lo, hi) — custom search bracket (auto-detected if omitted)

  • tol — convergence tolerance on power (default 0.02)

  • max_steps — maximum bisection steps (default 15)

  • n_simulations — simulations per step (use 500+ for production analyses)

10. Finding Required Sample Size via Simulation#

simulate_sample_size() uses bisection search over n_units to find the smallest sample size that achieves the target power for a given effect size.

[ ]:
# Find required sample size for basic DiD
n_result = simulate_sample_size(
    DifferenceInDifferences(),
    treatment_effect=5.0,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(n_result.summary())

Inspecting the Search Path#

[ ]:
# View the bisection steps
n_search_df = pd.DataFrame(n_result.search_path)
print(n_search_df.to_string(index=False))

Comparing Analytical and Simulation Results#

For basic DiD, we can compare the simulation result against the analytical formula. With only 100 simulations per bisection step there will be Monte Carlo noise, so we expect approximate — not exact — agreement:

[ ]:
# Analytical sample size
analytical = pa.sample_size(effect_size=5.0, sigma=5.0)

print(f"{'Method':<25} {'Required N':>12}")
print("-" * 40)
print(f"{'Analytical:':<25} {analytical.required_n:>12}")
print(f"{'Simulation:':<25} {n_result.required_n:>12}")
print(f"\nSimulation power at N: {n_result.power_at_n:.1%}")

Key parameters for ``simulate_sample_size()``:

  • n_range=(lo, hi) — custom search bracket for sample size (auto-detected if omitted)

  • max_steps — maximum bisection steps (default 15)

  • n_simulations — simulations per step (use 500+ for production analyses)

11. Custom Data Generators#

The default DGPs cover common designs, but you can customize them in two ways:

  1. Tweak the default DGP with data_generator_kwargs (e.g., add multiple treatment cohorts)

  2. Supply a fully custom DGP with data_generator

Tweaking the Default DGP#

Pass additional keyword arguments to the registry’s DGP via data_generator_kwargs. For example, the default staggered DGP generates a single treatment cohort — here we create a multi-cohort design:

Note: Some keys are protected and cannot be overridden via data_generator_kwargs because they are controlled by the simulation function itself: treatment_effect, noise_sd, n_units, n_periods, treatment_fraction, treatment_period, n_pre, n_post.

[ ]:
# Multi-cohort staggered design: treatment starts at periods 2 and 4
cs_multi = simulate_power(
    estimator=CallawaySantAnna(),
    n_units=120,
    n_periods=6,
    treatment_effect=5.0,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
    data_generator_kwargs={
        "cohort_periods": [2, 4],
        "never_treated_frac": 0.3,
    },
)

print(cs_multi.summary())

Fully Custom Data Generator#

For designs not covered by the built-in DGPs, supply your own data_generator function. It receives the standard simulation parameters and must return a DataFrame. You may also need a custom result_extractor if your estimator returns non-standard results, and estimator_kwargs to pass the right column names to fit().

[ ]:
def my_dgp(n_units, n_periods, treatment_effect, treatment_fraction,
           treatment_period, noise_sd, seed=None):
    """Custom DGP with heterogeneous unit effects."""
    rng = np.random.default_rng(seed)
    n_treat = int(n_units * treatment_fraction)

    rows = []
    for i in range(n_units):
        unit_fe = rng.normal(0, 3)  # heterogeneous unit effect
        treated_unit = i < n_treat
        for t in range(n_periods):
            post = int(t >= treatment_period)
            effect = treatment_effect * post if treated_unit else 0.0
            y = unit_fe + 2.0 * t + effect + rng.normal(0, noise_sd)
            rows.append({
                "unit": i, "period": t, "outcome": y,
                "ever_treated": int(treated_unit), "post": post,
            })
    return pd.DataFrame(rows)

# Use the custom DGP with simulate_power
custom_results = simulate_power(
    estimator=DifferenceInDifferences(),
    n_units=80,
    n_periods=4,
    treatment_effect=4.0,
    sigma=3.0,
    n_simulations=100,
    seed=42,
    progress=False,
    data_generator=my_dgp,
    estimator_kwargs={"outcome": "outcome", "treatment": "ever_treated", "time": "post"},
)

print(custom_results.summary())

12. Convenience Functions#

For quick calculations, use the convenience functions:

[ ]:
# Quick MDE calculation
mde = compute_mde(n_treated=100, n_control=100, sigma=10.0, power=0.80)
print(f"MDE: {mde:.2f}")

# Quick power calculation
power = compute_power(effect_size=5.0, n_treated=100, n_control=100, sigma=10.0)
print(f"Power: {power:.1%}")

# Quick sample size calculation
n = compute_sample_size(effect_size=5.0, sigma=10.0, power=0.80)
print(f"Required N: {n}")

13. Practical Recommendations#

Estimating Sigma (Residual SD)#

The residual standard deviation is crucial for power calculations. Options:

  1. Pilot data: Fit a model on historical data and get residual SD

  2. Literature: Find similar studies and use their reported SDs

  3. Domain knowledge: Expert judgment about outcome variability

  4. Sensitivity analysis: Calculate power for a range of sigma values

[ ]:
# Sensitivity to sigma
print("MDE sensitivity to residual SD (50 treated, 50 control):")
print(f"{'Sigma':>10} {'MDE':>10} {'Effect/SD':>12}")
print("-" * 35)

for sigma in [5.0, 7.5, 10.0, 12.5, 15.0]:
    result = pa.mde(n_treated=50, n_control=50, sigma=sigma)
    print(f"{sigma:>10.1f} {result.mde:>10.2f} {result.mde/sigma:>12.3f}")

Choosing a Target Power#

  • 80% power is conventional but arbitrary

  • 90% power is more conservative and recommended for important studies

  • Consider the cost of Type II errors (missing a real effect) vs Type I errors

[ ]:
# Compare required sample sizes for different power levels
print("Required sample size by target power (effect=5, sigma=10):")
print(f"{'Power':>10} {'Required N':>15}")
print("-" * 30)

for power in [0.70, 0.80, 0.90, 0.95]:
    pa_custom = PowerAnalysis(power=power)
    result = pa_custom.sample_size(effect_size=5.0, sigma=10.0)
    print(f"{power:>10.0%} {result.required_n:>15}")

Analytical vs. Simulation: When to Use Each#

Approach

Best for

Advantages

Analytical (PowerAnalysis)

Basic 2×2 DiD, panel DiD

Fast, exact, closed-form

Simulation (simulate_power/mde/sample_size)

Staggered, SDID, TROP, DDD, custom designs

Works with any estimator, reports bias/RMSE/coverage

Rule of thumb: Start with analytical power analysis for basic designs. Move to simulation when using specialized estimators or non-standard DGPs.

Summary#

Key takeaways for DiD power analysis:

  1. Always do a power analysis before running a study

  2. MDE decreases with sample size, more periods, and lower variance

  3. Within-unit correlation helps in panel DiD — higher serial correlation lowers the MDE, because the difference-in-differences cancels the shared within-unit component (Burlig et al. 2020, Eq. 2, equicorrelated case)

  4. Use simulation for complex designs (staggered, synthetic DiD, triple difference)

  5. 12 estimators are supported out of the box via the auto-registry — just swap in the estimator

  6. ``simulate_mde()`` and ``simulate_sample_size()`` extend MDE and sample size calculations to any estimator via bisection search

  7. Custom DGPs let you model non-standard designs with data_generator and data_generator_kwargs

  8. Be realistic about sigma — err on the side of larger values

  9. Consider your smallest meaningful effect — don’t just target statistical significance

For more on DiD estimation, see the other tutorials:

  • 01_basic_did.ipynb — Basic DiD estimation

  • 02_staggered_did.ipynb — Staggered adoption designs

  • 03_synthetic_did.ipynb — Synthetic DiD

  • 04_parallel_trends.ipynb — Testing assumptions

  • 05_honest_did.ipynb — Sensitivity analysis

  • 07_pretrends_power.ipynb — Pre-trends power analysis (Roth 2022)