Utilities

Statistical utilities for parallel trends testing, robust standard errors, and bootstrap inference.

Parallel Trends Testing

check_parallel_trends

Test for parallel trends using pre-treatment data.

diff_diff.check_parallel_trends(data, outcome, time, treatment_group, pre_periods=None)[source]

Perform a simple check for parallel trends assumption.

This computes the trend (slope) in the outcome variable for both treatment and control groups during pre-treatment periods.

Parameters:

data (pd.DataFrame) – Panel data.
outcome (str) – Name of outcome variable column.
time (str) – Name of time period column.
treatment_group (str) – Name of treatment group indicator column.
pre_periods (list, optional) – List of pre-treatment time periods. If None, infers from data.

Returns:

Dictionary with trend statistics and test results.

Return type:

dict

Example

from diff_diff import check_parallel_trends

result = check_parallel_trends(
    data,
    outcome='y',
    unit='unit_id',
    time='period',
    treated='treated',
    pre_periods=4
)

print(f"F-statistic: {result['f_stat']:.3f}")
print(f"p-value: {result['p_value']:.3f}")

if result['p_value'] > 0.05:
    print("Cannot reject parallel trends")
else:
    print("Evidence against parallel trends")

check_parallel_trends_robust

Robust parallel trends test using Wasserstein distance with permutation-based inference.

diff_diff.check_parallel_trends_robust(data, outcome, time, treatment_group, unit=None, pre_periods=None, n_permutations=1000, seed=None, wasserstein_threshold=0.2)[source]

Perform robust parallel trends testing using distributional comparisons.

Uses the Wasserstein (Earth Mover’s) distance to compare the full distribution of outcome changes between treated and control groups, with permutation-based inference.

Parameters:

data (pd.DataFrame) – Panel data with repeated observations over time.
outcome (str) – Name of outcome variable column.
time (str) – Name of time period column.
treatment_group (str) – Name of treatment group indicator column (0/1).
unit (str, optional) – Name of unit identifier column. If provided, computes unit-level changes. Otherwise uses observation-level data.
pre_periods (list, optional) – List of pre-treatment time periods. If None, uses first half of periods.
n_permutations (int, default=1000) – Number of permutations for computing p-value.
seed (int, optional) – Random seed for reproducibility.
wasserstein_threshold (float, default=0.2) – Threshold for normalized Wasserstein distance. Values below this threshold (combined with p > 0.05) suggest parallel trends are plausible.

Returns:

Dictionary containing: - wasserstein_distance: Wasserstein distance between group distributions - wasserstein_p_value: Permutation-based p-value - ks_statistic: Kolmogorov-Smirnov test statistic - ks_p_value: KS test p-value - mean_difference: Difference in mean changes - variance_ratio: Ratio of variances in changes - treated_changes: Array of outcome changes for treated - control_changes: Array of outcome changes for control - parallel_trends_plausible: Boolean assessment

Return type:

dict

Examples

>>> results = check_parallel_trends_robust(
...     data, outcome='sales', time='year',
...     treatment_group='treated', unit='firm_id'
... )
>>> print(f"Wasserstein distance: {results['wasserstein_distance']:.4f}")
>>> print(f"P-value: {results['wasserstein_p_value']:.4f}")

Notes

The Wasserstein distance (Earth Mover’s Distance) measures the minimum “cost” of transforming one distribution into another. Unlike simple mean comparisons, it captures differences in the entire distribution shape, making it more robust to non-normal data and heterogeneous effects.

A small Wasserstein distance and high p-value suggest the distributions of pre-treatment changes are similar, supporting the parallel trends assumption.

equivalence_test_trends

Equivalence test for parallel trends (TOST procedure).

diff_diff.equivalence_test_trends(data, outcome, time, treatment_group, unit=None, pre_periods=None, equivalence_margin=None)[source]

Perform equivalence testing (TOST) for parallel trends.

Tests whether the difference in trends is practically equivalent to zero using Two One-Sided Tests (TOST) procedure.

Parameters:

data (pd.DataFrame) – Panel data.
outcome (str) – Name of outcome variable column.
time (str) – Name of time period column.
treatment_group (str) – Name of treatment group indicator column.
unit (str, optional) – Name of unit identifier column.
pre_periods (list, optional) – List of pre-treatment time periods.
equivalence_margin (float, optional) – The margin for equivalence (delta). If None, uses 0.5 * pooled SD of outcome changes as a default.

Returns:

Dictionary containing: - mean_difference: Difference in mean changes - equivalence_margin: The margin used - lower_p_value: P-value for lower bound test - upper_p_value: P-value for upper bound test - tost_p_value: Maximum of the two p-values - equivalent: Boolean indicating equivalence at alpha=0.05

Return type:

dict

Example

from diff_diff import equivalence_test_trends

# Test if pre-trends are equivalent to zero within bounds
result = equivalence_test_trends(
    data,
    outcome='y',
    unit='unit_id',
    time='period',
    treated='treated',
    equivalence_bound=0.5  # Effect size bound
)

if result['equivalent']:
    print("Pre-trends are practically equivalent to zero")

Wild Cluster Bootstrap

wild_bootstrap_se

Compute wild cluster bootstrap standard errors.

diff_diff.wild_bootstrap_se(X, y, residuals, cluster_ids, coefficient_index, n_bootstrap=999, weight_type='rademacher', null_hypothesis=0.0, alpha=0.05, seed=None, return_distribution=False)[source]

Compute wild cluster bootstrap standard errors and p-values.

Implements the Wild Cluster Residual (WCR) bootstrap procedure from Cameron, Gelbach, and Miller (2008). Uses the restricted residuals approach (imposing H0: coefficient = null_hypothesis) for more accurate p-value computation.

Parameters:

X (np.ndarray) – Design matrix of shape (n, k).
y (np.ndarray) – Outcome vector of shape (n,).
residuals (np.ndarray) – OLS residuals from unrestricted regression, shape (n,).
cluster_ids (np.ndarray) – Cluster identifiers of shape (n,).
coefficient_index (int) – Index of the coefficient for which to compute bootstrap inference. For DiD, this is typically 3 (the treatment*post interaction term).
n_bootstrap (int, default=999) – Number of bootstrap replications. Odd numbers are recommended for exact p-value computation.
weight_type (str, default="rademacher") – Type of bootstrap weights: - “rademacher”: +1 or -1 with equal probability (standard choice) - “webb”: 6-point distribution (recommended for <10 clusters) - “mammen”: Two-point distribution with skewness correction
null_hypothesis (float, default=0.0) – Value of the null hypothesis for p-value computation.
alpha (float, default=0.05) – Significance level for confidence interval.
seed (int, optional) – Random seed for reproducibility. If None (default), results will vary between runs.
return_distribution (bool, default=False) – If True, include full bootstrap distribution in results.

Returns:

Dataclass containing bootstrap SE, p-value, confidence interval, and other inference results.

Return type:

WildBootstrapResults

Raises:

ValueError – If weight_type is not recognized or if there are fewer than 2 clusters.

Warns:

UserWarning – If the number of clusters is less than 5, as bootstrap inference may be unreliable.

Examples

>>> from diff_diff.utils import wild_bootstrap_se
>>> results = wild_bootstrap_se(
...     X, y, residuals, cluster_ids,
...     coefficient_index=3,  # ATT coefficient
...     n_bootstrap=999,
...     weight_type="rademacher",
...     seed=42
... )
>>> print(f"Bootstrap SE: {results.se:.4f}")
>>> print(f"Bootstrap p-value: {results.p_value:.4f}")

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors. The Review of Economics and Statistics, 90(3), 414-427.

MacKinnon, J. G., & Webb, M. D. (2018). The wild bootstrap for few (treated) clusters. The Econometrics Journal, 21(2), 114-135.

Example

from diff_diff import DifferenceInDifferences, wild_bootstrap_se

# Fit model
did = DifferenceInDifferences()
results = did.fit(data, outcome='y', treated='treated', post='post')

# Bootstrap standard errors
boot_results = wild_bootstrap_se(
    data,
    outcome='y',
    treated='treated',
    post='post',
    cluster='unit_id',
    n_bootstrap=999,
    weight_type='rademacher'
)

print(f"Bootstrap SE: {boot_results.se:.3f}")
print(f"Bootstrap 95% CI: [{boot_results.ci[0]:.3f}, {boot_results.ci[1]:.3f}]")

WildBootstrapResults

Container for wild bootstrap results.

class diff_diff.WildBootstrapResults[source]

Bases: object

Results from wild cluster bootstrap inference.

se

Bootstrap standard error of the coefficient.

Type:: float

p_value

Bootstrap p-value (two-sided).

Type:: float

t_stat_original

Original t-statistic from the data.

Type:: float

ci_lower

Lower bound of the confidence interval.

Type:: float

ci_upper

Upper bound of the confidence interval.

Type:: float

n_clusters

Number of clusters in the data.

Type:: int

n_bootstrap

Number of bootstrap replications.

Type:: int

weight_type

Type of bootstrap weights used (“rademacher”, “webb”, or “mammen”).

Type:: str

alpha

Significance level used for confidence interval.

Type:: float

bootstrap_distribution

Full bootstrap distribution of coefficients (if requested).

Type:: np.ndarray, optional

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors. The Review of Economics and Statistics, 90(3), 414-427.

se: float

p_value: float

t_stat_original: float

ci_lower: float

ci_upper: float

n_clusters: int

n_bootstrap: int

weight_type: str

alpha: float = 0.05

bootstrap_distribution: ndarray | None = None

summary()[source]

Generate formatted summary of bootstrap results.

Return type:: str

print_summary()[source]

Print formatted summary to stdout.

Return type:: None

__init__(se, p_value, t_stat_original, ci_lower, ci_upper, n_clusters, n_bootstrap, weight_type, alpha=0.05, bootstrap_distribution=None)

Parameters:

se (float)
p_value (float)
t_stat_original (float)
ci_lower (float)
ci_upper (float)
n_clusters (int)
n_bootstrap (int)
weight_type (str)
alpha (float)
bootstrap_distribution (ndarray | None)

Return type:

None

Weight Types

The wild bootstrap supports several weight distributions:

'rademacher': ±1 with equal probability (default, good general choice)
'mammen': Two-point distribution matching higher moments
'webb': Six-point distribution, better for few clusters

# Using different weight types
boot_rad = wild_bootstrap_se(data, ..., weight_type='rademacher')
boot_webb = wild_bootstrap_se(data, ..., weight_type='webb')
boot_mammen = wild_bootstrap_se(data, ..., weight_type='mammen')

Recommendation

Use 'rademacher' (default) for most cases
Use 'webb' when you have fewer than 10 clusters
The n_bootstrap should typically be at least 999 for reliable inference