Interactive notebook

This tutorial is a Jupyter notebook. You can view it on GitHub or download it to run locally.

Power Analysis for Difference-in-Differences#

This notebook demonstrates how to use the power analysis tools in diff-diff for study design. We’ll cover:

Computing minimum detectable effects (MDE)
Calculating required sample sizes
Estimating statistical power
Creating power curves for visualization
Panel data considerations (ICC, multiple periods)
Simulation-based power analysis for complex designs
Power analysis for any estimator (staggered, synthetic DiD, triple difference)
Finding MDE via simulation (bisection search)
Finding required sample size via simulation (bisection search)
Custom data generators
Convenience functions
Practical recommendations

[ ]:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from diff_diff import (
    PowerAnalysis,
    DifferenceInDifferences,
    CallawaySantAnna,
    SyntheticDiD,
    TripleDifference,
    simulate_power,
    simulate_mde,
    simulate_sample_size,
    compute_mde,
    compute_power,
    compute_sample_size,
    plot_power_curve,
)

1. The Power Analysis Problem#

Before running a DiD study, researchers need to answer key design questions:

“How many units do I need?” (sample size calculation)
“What’s the smallest effect I can detect?” (minimum detectable effect)
“What’s my chance of finding a significant result?” (power calculation)

The PowerAnalysis class provides analytical formulas for these calculations in basic DiD designs.

[ ]:

# Create a PowerAnalysis object with standard settings
# alpha = 0.05 (5% significance level)
# power = 0.80 (80% power - conventional target)
pa = PowerAnalysis(alpha=0.05, power=0.80)

print(f"Significance level: {pa.alpha}")
print(f"Target power: {pa.target_power}")
print(f"Alternative hypothesis: {pa.alternative}")

2. Minimum Detectable Effect (MDE)#

The MDE is the smallest treatment effect you can detect with your specified power and sample size. It depends on:

Sample sizes (n_treated, n_control)
Residual variance (sigma)
Significance level (alpha)
Target power

[ ]:

# Calculate MDE for a basic 2x2 DiD design
# Assume: 50 treated units, 50 control units, outcome SD of 10
result = pa.mde(
    n_treated=50,
    n_control=50,
    sigma=10.0  # Residual standard deviation
)

print(result.summary())

[ ]:

# Access individual results
print(f"Minimum Detectable Effect: {result.mde:.2f}")
print(f"This is {result.mde / 10.0:.2f} standard deviations")
print(f"")
print(f"With 50 treated and 50 control units, and sigma=10,")
print(f"you can detect an effect of {result.mde:.2f} or larger")
print(f"with {result.power:.0%} power at alpha={result.alpha:.2f}.")

How MDE Changes with Sample Size#

[ ]:

# Compare MDE across different sample sizes
sample_sizes = [25, 50, 100, 200, 500]

print(f"{'N per group':>12} {'Total N':>10} {'MDE':>10} {'MDE/SD':>10}")
print("-" * 45)

for n in sample_sizes:
    result = pa.mde(n_treated=n, n_control=n, sigma=10.0)
    print(f"{n:>12} {2*n:>10} {result.mde:>10.2f} {result.mde/10.0:>10.2f}")

3. Required Sample Size#

Given a target effect size to detect, calculate the required sample size.

[ ]:

# How many units do we need to detect an effect of 5 units?
result = pa.sample_size(
    effect_size=5.0,  # Effect we want to detect
    sigma=10.0  # Outcome standard deviation
)

print(result.summary())

[ ]:

# Compare sample sizes needed for different effect sizes
effect_sizes = [2.0, 3.0, 5.0, 7.0, 10.0]

print(f"{'Effect Size':>12} {'Effect/SD':>10} {'Required N':>12} {'N per group':>12}")
print("-" * 50)

for effect in effect_sizes:
    result = pa.sample_size(effect_size=effect, sigma=10.0)
    print(f"{effect:>12.1f} {effect/10.0:>10.2f} {result.required_n:>12} {result.n_treated:>12}")

4. Power Calculation#

Given a specific effect size and sample size, calculate the statistical power.

[ ]:

# What's our power to detect an effect of 4 with 75 units per group?
pa_calc = PowerAnalysis(alpha=0.05)  # Don't specify target power for calculation

result = pa_calc.power(
    effect_size=4.0,
    n_treated=75,
    n_control=75,
    sigma=10.0
)

print(f"Power to detect effect of 4.0: {result.power:.1%}")
print(f"This is {'adequate' if result.power >= 0.80 else 'below the conventional 80% threshold'}")

5. Power Curves#

Power curves show how power changes with effect size (or sample size). They’re essential for understanding the trade-offs in study design.

[ ]:

# Generate power curve data
curve_df = pa.power_curve(
    n_treated=50,
    n_control=50,
    sigma=10.0
)

# Get MDE for reference
mde_result = pa.mde(n_treated=50, n_control=50, sigma=10.0)

print("First few rows of power curve:")
print(curve_df.head(10))

[ ]:

# Plot the power curve
plot_power_curve(
    curve_df,
    mde=mde_result.mde,
    target_power=0.80,
    title="Power Curve: 50 Treated, 50 Control, SD=10",
    xlabel="Treatment Effect Size",
    figsize=(10, 6)
)
plt.show()

Power vs Sample Size#

[ ]:

# How does power change with sample size for a fixed effect?
sample_curve = pa.sample_size_curve(
    effect_size=5.0,
    sigma=10.0
)

# Plot
plot_power_curve(
    sample_curve,
    target_power=0.80,
    show_mde_line=False,
    title="Power vs Sample Size (Effect=5, SD=10)",
    figsize=(10, 6)
)
plt.show()

6. Panel Data Considerations#

For panel DiD with multiple time periods, power depends on:

Number of pre/post periods (n_pre, n_post)
Within-unit (serial) equicorrelation rho — the correlation of a unit’s outcomes across time

More periods improve precision. Within-unit correlation makes the DiD easier to detect (it lowers the MDE): the difference-in-differences cancels the shared within-unit component, so a higher rho leaves less residual noise. This is the within-unit equicorrelated case of Burlig, Preonas & Woerman (2020), Eq. 2 — Var(ATT) = σ²(1/n_T + 1/n_C)(1/m + 1/r)(1 − rho) — and is the opposite of the classic Moulton “design effect” intuition (which inflates the variance of a cluster mean, not a difference).

[ ]:

# Compare MDE with different numbers of periods
print("Effect of number of periods on MDE (50 treated, 50 control, sigma=10):")
print(f"{'Periods (pre+post)':>20} {'MDE':>10} {'Improvement':>15}")
print("-" * 50)

baseline_mde = None
for n_pre, n_post in [(1, 1), (2, 2), (3, 3), (5, 5)]:
    result = pa.mde(
        n_treated=50, n_control=50, sigma=10.0,
        n_pre=n_pre, n_post=n_post
    )
    if baseline_mde is None:
        baseline_mde = result.mde
        improvement = "-"
    else:
        improvement = f"{(1 - result.mde/baseline_mde)*100:.1f}%"
    print(f"{n_pre + n_post:>20} {result.mde:>10.2f} {improvement:>15}")

[ ]:

# Effect of within-unit (serial) equicorrelation `rho` on the panel-DiD MDE.
# Per Burlig et al. (2020) Eq. 2 (equicorrelated case), higher rho LOWERS the MDE:
# Var(ATT) carries a (1 - rho) factor because the DiD differences out the shared
# within-unit component.
print("Effect of within-unit correlation on MDE (50 treated, 50 control, 6 periods, sigma=10):")
print(f"{'rho':>10} {'MDE':>10} {'Var factor (1-rho)':>20}")
print("-" * 42)

for rho in [0.0, 0.1, 0.3, 0.5, 0.7]:
    result = pa.mde(
        n_treated=50, n_control=50, sigma=10.0,
        n_pre=3, n_post=3, rho=rho
    )
    # Variance scales with (1 - rho) relative to the rho = 0 case.
    var_factor = 1 - rho
    print(f"{rho:>10.1f} {result.mde:>10.2f} {var_factor:>20.2f}")

7. Simulation-Based Power Analysis#

For complex designs (staggered adoption, synthetic DiD, etc.), analytical formulas may not exist. Use Monte Carlo simulation instead.

The simulate_power function:

Generates synthetic data with known treatment effect
Fits your estimator
Records whether the effect is statistically significant
Repeats many times
Reports the proportion of significant results (= power)

[ ]:

# Simulation-based power analysis
did = DifferenceInDifferences()

results = simulate_power(
    estimator=did,
    n_units=100,
    n_periods=4,
    treatment_effect=5.0,
    treatment_fraction=0.5,
    sigma=5.0,  # Noise level
    n_simulations=200,  # More simulations = more precise power estimate
    seed=42,
    progress=False
)

print(results.summary())

[ ]:

# Key metrics from simulation
print("Simulation Results:")
print(f"  Power (rejection rate): {results.power:.1%}")
print(f"  95% CI for power: [{results.power_ci[0]:.1%}, {results.power_ci[1]:.1%}]")
print(f"")
print("Estimator Performance:")
print(f"  True effect: {results.true_effect:.2f}")
print(f"  Mean estimate: {results.mean_estimate:.2f}")
print(f"  Bias: {results.bias:.4f}")
print(f"  RMSE: {results.rmse:.4f}")
print(f"  Coverage: {results.coverage:.1%}")

Power Curve via Simulation#

[ ]:

# Simulate power for multiple effect sizes
results_multi = simulate_power(
    estimator=did,
    n_units=100,
    n_periods=4,
    effect_sizes=[1.0, 2.0, 3.0, 5.0, 7.0, 10.0],  # Multiple effects
    sigma=5.0,
    n_simulations=100,  # Fewer per effect size for speed
    seed=42,
    progress=False
)

# Get power curve data
power_curve_sim = results_multi.power_curve_df()
print(power_curve_sim)

[ ]:

# Plot simulation-based power curve
plot_power_curve(
    power_curve_sim,
    target_power=0.80,
    title="Simulation-Based Power Curve (100 units, 4 periods, SD=5)",
    figsize=(10, 6)
)
plt.show()

8. Power Analysis for Any Estimator#

The simulation-based approach works with all 12 supported estimators — not just basic DiD. An internal registry automatically selects the appropriate data-generating process (DGP) and fit signature for each registered estimator. Just swap in the estimator object and everything else is handled. See the support table below for the full list, and Section 11 for using custom DGPs with unsupported estimators.

Staggered Adoption Estimators#

[ ]:

# Power analysis with Callaway-Sant'Anna — the registry auto-selects
# generate_staggered_data as the DGP and the correct fit kwargs
cs = CallawaySantAnna()

cs_results = simulate_power(
    estimator=cs,
    n_units=100,
    n_periods=6,
    treatment_effect=5.0,
    treatment_fraction=0.5,
    treatment_period=3,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(cs_results.summary())

Factor Model Estimators (Synthetic DiD)#

For SyntheticDiD with the default placebo variance method, the DGP must generate more control than treated units (treatment_fraction < 0.5). The registry uses generate_factor_data automatically.

[ ]:

import warnings

# Synthetic DiD — note treatment_fraction=0.3 (placebo variance requires
# more control units than treated units)
sdid = SyntheticDiD()

# Suppress pre-treatment fit warnings from individual simulation draws;
# the factor-model DGP can trigger this on some random seeds without
# affecting the overall power estimate.
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message="Pre-treatment fit is poor")
    sdid_results = simulate_power(
        estimator=sdid,
        n_units=60,
        n_periods=6,
        treatment_effect=5.0,
        treatment_fraction=0.3,
        treatment_period=3,
        sigma=3.0,
        n_simulations=100,
        seed=42,
        progress=False,
    )

print(sdid_results.summary())

Triple Difference#

TripleDifference power routes by n_periods:

``n_periods ≤ 2`` → the cross-sectional 2×2×2 factorial DGP (generate_ddd_data, group × partition × time). Sample sizes are rounded via ``n_per_cell = max(2, n_units // 8)``, so the minimum effective N is 16 (2 units per cell × 8 cells), and the effective_n_units field tracks any rounding. simulate_sample_size() snaps required_n to multiples of 8.
``n_periods > 2`` → the panel DGP (generate_ddd_panel_data), which honors n_periods/treatment_period. n_units maps directly to panel units (no rounding; effective_n_units is None), and simulate_sample_size() searches a continuous grid. The panel DGP has within-unit serial correlation, so construct the estimator as ``TripleDifference(cluster=”unit”)`` — otherwise the unclustered SEs overstate power and a UserWarning is emitted. treatment_fraction is inert (balanced 2×2×2); pass group_frac/partition_frac via data_generator_kwargs to vary the split.

The example below uses the panel path (n_periods=6):

[ ]:

# Triple Difference — panel path (n_periods > 2 routes to generate_ddd_panel_data).
# cluster="unit" is required for valid inference on panel-generated data.
ddd = TripleDifference(cluster="unit")

ddd_results = simulate_power(
    estimator=ddd,
    n_units=64,
    n_periods=6,
    treatment_period=3,
    treatment_effect=3.0,
    sigma=2.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(ddd_results.summary())
# effective_n_units is None on the panel path (n_units maps directly).
print(f"\nEffective N: {ddd_results.effective_n_units}")

Supported Estimators#

The following 12 estimators are supported by the simulation power analysis registry. Each is automatically paired with the correct data-generating process:

DGP Family	Estimators	Min N
Basic DiD (`generate_did_data`)	DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD	20
Staggered (`generate_staggered_data`)	CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, StackedDiD, EfficientDiD	40
Factor Model (`generate_factor_data`)	TROP, SyntheticDiD	30
Triple Difference (`generate_ddd_data` / `generate_ddd_panel_data`)	TripleDifference	16*

* DDD routes by n_periods: for n_periods ≤ 2 the cross-sectional generate_ddd_data rounds effective N to max(2, n_units // 8) * 8 (minimum 16) and simulate_sample_size() starts its bisection at the registry floor of 64 (an explicit n_range can still go down to the 16-unit realizability floor); for n_periods > 2 the panel generate_ddd_panel_data maps n_units directly (use TripleDifference(cluster="unit")).

Note: ContinuousDiD is not in the registry because continuous/dose-response treatments require a different DGP structure. BaconDecomposition and HonestDiD are diagnostic/sensitivity tools rather than treatment effect estimators. For unsupported estimators, you can pass a custom data_generator and result_extractor (see Section 11).

Power Curve for a Staggered Estimator#

[ ]:

# Power curve across effect sizes for Callaway-Sant'Anna
cs_curve = simulate_power(
    estimator=CallawaySantAnna(),
    n_units=100,
    n_periods=6,
    effect_sizes=[1.0, 2.0, 3.0, 5.0, 7.0],
    treatment_period=3,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

plot_power_curve(
    cs_curve.power_curve_df(),
    target_power=0.80,
    title="CS Power Curve (100 units, 6 periods, SD=5)",
    figsize=(10, 6),
)
plt.show()

9. Finding MDE via Simulation#

The analytical PowerAnalysis.mde() works for basic DiD, but for complex estimators there is no closed-form formula. simulate_mde() uses bisection search to find the minimum detectable effect: it repeatedly calls simulate_power() at different effect sizes, narrowing the bracket until it finds the smallest effect that achieves the target power.

[ ]:

# Find MDE for basic DiD via simulation
mde_result = simulate_mde(
    DifferenceInDifferences(),
    n_units=100,
    n_periods=4,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(mde_result.summary())

Inspecting the Search Path#

The search_path attribute records the effect size and power at each bisection step, which is useful for diagnosing convergence:

[ ]:

# search_path is a List[Dict] — convert to DataFrame for display
search_df = pd.DataFrame(mde_result.search_path)
print(search_df.to_string(index=False))

MDE for a Staggered Estimator#

The same function works with any registered estimator:

[ ]:

# MDE for Callaway-Sant'Anna
cs_mde = simulate_mde(
    CallawaySantAnna(),
    n_units=100,
    n_periods=6,
    treatment_period=3,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(cs_mde.summary())

Key parameters for ``simulate_mde()``:

effect_range=(lo, hi) — custom search bracket (auto-detected if omitted)
tol — convergence tolerance on power (default 0.02)
max_steps — maximum bisection steps (default 15)
n_simulations — simulations per step (use 500+ for production analyses)

10. Finding Required Sample Size via Simulation#

simulate_sample_size() uses bisection search over n_units to find the smallest sample size that achieves the target power for a given effect size.

[ ]:

# Find required sample size for basic DiD
n_result = simulate_sample_size(
    DifferenceInDifferences(),
    treatment_effect=5.0,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
)

print(n_result.summary())

Inspecting the Search Path#

[ ]:

# View the bisection steps
n_search_df = pd.DataFrame(n_result.search_path)
print(n_search_df.to_string(index=False))

Comparing Analytical and Simulation Results#

For basic DiD, we can compare the simulation result against the analytical formula. With only 100 simulations per bisection step there will be Monte Carlo noise, so we expect approximate — not exact — agreement:

[ ]:

# Analytical sample size
analytical = pa.sample_size(effect_size=5.0, sigma=5.0)

print(f"{'Method':<25} {'Required N':>12}")
print("-" * 40)
print(f"{'Analytical:':<25} {analytical.required_n:>12}")
print(f"{'Simulation:':<25} {n_result.required_n:>12}")
print(f"\nSimulation power at N: {n_result.power_at_n:.1%}")

Key parameters for ``simulate_sample_size()``:

n_range=(lo, hi) — custom search bracket for sample size (auto-detected if omitted)
max_steps — maximum bisection steps (default 15)
n_simulations — simulations per step (use 500+ for production analyses)

11. Custom Data Generators#

The default DGPs cover common designs, but you can customize them in two ways:

Tweak the default DGP with data_generator_kwargs (e.g., add multiple treatment cohorts)
Supply a fully custom DGP with data_generator

Tweaking the Default DGP#

Pass additional keyword arguments to the registry’s DGP via data_generator_kwargs. For example, the default staggered DGP generates a single treatment cohort — here we create a multi-cohort design:

Note: Some keys are protected and cannot be overridden via data_generator_kwargs because they are controlled by the simulation function itself: treatment_effect, noise_sd, n_units, n_periods, treatment_fraction, treatment_period, n_pre, n_post.

[ ]:

# Multi-cohort staggered design: treatment starts at periods 2 and 4
cs_multi = simulate_power(
    estimator=CallawaySantAnna(),
    n_units=120,
    n_periods=6,
    treatment_effect=5.0,
    sigma=5.0,
    n_simulations=100,
    seed=42,
    progress=False,
    data_generator_kwargs={
        "cohort_periods": [2, 4],
        "never_treated_frac": 0.3,
    },
)

print(cs_multi.summary())

Fully Custom Data Generator#

For designs not covered by the built-in DGPs, supply your own data_generator function. It receives the standard simulation parameters and must return a DataFrame. You may also need a custom result_extractor if your estimator returns non-standard results, and estimator_kwargs to pass the right column names to fit().

[ ]:

def my_dgp(n_units, n_periods, treatment_effect, treatment_fraction,
           treatment_period, noise_sd, seed=None):
    """Custom DGP with heterogeneous unit effects."""
    rng = np.random.default_rng(seed)
    n_treat = int(n_units * treatment_fraction)

    rows = []
    for i in range(n_units):
        unit_fe = rng.normal(0, 3)  # heterogeneous unit effect
        treated_unit = i < n_treat
        for t in range(n_periods):
            post = int(t >= treatment_period)
            effect = treatment_effect * post if treated_unit else 0.0
            y = unit_fe + 2.0 * t + effect + rng.normal(0, noise_sd)
            rows.append({
                "unit": i, "period": t, "outcome": y,
                "ever_treated": int(treated_unit), "post": post,
            })
    return pd.DataFrame(rows)

# Use the custom DGP with simulate_power
custom_results = simulate_power(
    estimator=DifferenceInDifferences(),
    n_units=80,
    n_periods=4,
    treatment_effect=4.0,
    sigma=3.0,
    n_simulations=100,
    seed=42,
    progress=False,
    data_generator=my_dgp,
    estimator_kwargs={"outcome": "outcome", "treatment": "ever_treated", "time": "post"},
)

print(custom_results.summary())

12. Convenience Functions#

For quick calculations, use the convenience functions:

[ ]:

# Quick MDE calculation
mde = compute_mde(n_treated=100, n_control=100, sigma=10.0, power=0.80)
print(f"MDE: {mde:.2f}")

# Quick power calculation
power = compute_power(effect_size=5.0, n_treated=100, n_control=100, sigma=10.0)
print(f"Power: {power:.1%}")

# Quick sample size calculation
n = compute_sample_size(effect_size=5.0, sigma=10.0, power=0.80)
print(f"Required N: {n}")

13. Practical Recommendations#

Estimating Sigma (Residual SD)#

The residual standard deviation is crucial for power calculations. Options:

Pilot data: Fit a model on historical data and get residual SD
Literature: Find similar studies and use their reported SDs
Domain knowledge: Expert judgment about outcome variability
Sensitivity analysis: Calculate power for a range of sigma values

[ ]:

# Sensitivity to sigma
print("MDE sensitivity to residual SD (50 treated, 50 control):")
print(f"{'Sigma':>10} {'MDE':>10} {'Effect/SD':>12}")
print("-" * 35)

for sigma in [5.0, 7.5, 10.0, 12.5, 15.0]:
    result = pa.mde(n_treated=50, n_control=50, sigma=sigma)
    print(f"{sigma:>10.1f} {result.mde:>10.2f} {result.mde/sigma:>12.3f}")

Choosing a Target Power#

80% power is conventional but arbitrary
90% power is more conservative and recommended for important studies
Consider the cost of Type II errors (missing a real effect) vs Type I errors

[ ]:

# Compare required sample sizes for different power levels
print("Required sample size by target power (effect=5, sigma=10):")
print(f"{'Power':>10} {'Required N':>15}")
print("-" * 30)

for power in [0.70, 0.80, 0.90, 0.95]:
    pa_custom = PowerAnalysis(power=power)
    result = pa_custom.sample_size(effect_size=5.0, sigma=10.0)
    print(f"{power:>10.0%} {result.required_n:>15}")

Analytical vs. Simulation: When to Use Each#

Approach	Best for	Advantages
Analytical (`PowerAnalysis`)	Basic 2×2 DiD, panel DiD	Fast, exact, closed-form
Simulation (`simulate_power/mde/sample_size`)	Staggered, SDID, TROP, DDD, custom designs	Works with any estimator, reports bias/RMSE/coverage

Rule of thumb: Start with analytical power analysis for basic designs. Move to simulation when using specialized estimators or non-standard DGPs.

Summary#

Key takeaways for DiD power analysis:

Always do a power analysis before running a study
MDE decreases with sample size, more periods, and lower variance
Within-unit correlation helps in panel DiD — higher serial correlation lowers the MDE, because the difference-in-differences cancels the shared within-unit component (Burlig et al. 2020, Eq. 2, equicorrelated case)
Use simulation for complex designs (staggered, synthetic DiD, triple difference)
12 estimators are supported out of the box via the auto-registry — just swap in the estimator
``simulate_mde()`` and ``simulate_sample_size()`` extend MDE and sample size calculations to any estimator via bisection search
Custom DGPs let you model non-standard designs with data_generator and data_generator_kwargs
Be realistic about sigma — err on the side of larger values
Consider your smallest meaningful effect — don’t just target statistical significance

For more on DiD estimation, see the other tutorials:

01_basic_did.ipynb — Basic DiD estimation
02_staggered_did.ipynb — Staggered adoption designs
03_synthetic_did.ipynb — Synthetic DiD
04_parallel_trends.ipynb — Testing assumptions
05_honest_did.ipynb — Sensitivity analysis
07_pretrends_power.ipynb — Pre-trends power analysis (Roth 2022)