Utilities#

Statistical utilities for parallel trends testing, robust standard errors, and bootstrap inference.

Parallel Trends Testing#

check_parallel_trends#

Test for parallel trends using pre-treatment data.

diff_diff.check_parallel_trends(data, outcome, time, treatment_group, pre_periods=None)[source]#

Perform a simple check for parallel trends assumption.

This computes the trend (slope) in the outcome variable for both treatment and control groups during pre-treatment periods.

Parameters:

data (pd.DataFrame) – Panel data.
outcome (str) – Name of outcome variable column.
time (str) – Name of time period column.
treatment_group (str) – Name of treatment group indicator column.
pre_periods (list, optional) – List of pre-treatment time periods. If None, infers from data.

Returns:

Dictionary with trend statistics and test results.

Return type:

dict

Example#

from diff_diff import check_parallel_trends

result = check_parallel_trends(
    data,
    outcome='y',
    time='period',
    treatment_group='treated',
    pre_periods=[0, 1, 2, 3]
)

print(f"t-statistic: {result['t_statistic']:.3f}")
print(f"p-value: {result['p_value']:.3f}")

if result['p_value'] > 0.05:
    print("Cannot reject parallel trends")
else:
    print("Evidence against parallel trends")

check_parallel_trends_robust#

Robust parallel trends test using Wasserstein distance with permutation-based inference.

diff_diff.check_parallel_trends_robust(data, outcome, time, treatment_group, unit=None, pre_periods=None, n_permutations=1000, seed=None, wasserstein_threshold=0.2)[source]#

Perform robust parallel trends testing using distributional comparisons.

Uses the Wasserstein (Earth Mover’s) distance to compare the full distribution of outcome changes between treated and control groups, with permutation-based inference.

Parameters:

data (pd.DataFrame) – Panel data with repeated observations over time.
outcome (str) – Name of outcome variable column.
time (str) – Name of time period column.
treatment_group (str) – Name of treatment group indicator column (0/1).
unit (str, optional) – Name of unit identifier column. If provided, computes unit-level changes. Otherwise uses observation-level data.
pre_periods (list, optional) – List of pre-treatment time periods. If None, uses first half of periods.
n_permutations (int, default=1000) – Number of permutations for computing p-value.
seed (int, optional) – Random seed for reproducibility.
wasserstein_threshold (float, default=0.2) – Threshold for normalized Wasserstein distance. Values below this threshold (combined with p > 0.05) suggest parallel trends are plausible.

Returns:

Dictionary containing: - wasserstein_distance: Wasserstein distance between group distributions - wasserstein_p_value: Permutation-based p-value - ks_statistic: Kolmogorov-Smirnov test statistic - ks_p_value: KS test p-value - mean_difference: Difference in mean changes - variance_ratio: Ratio of variances in changes - treated_changes: Array of outcome changes for treated - control_changes: Array of outcome changes for control - parallel_trends_plausible: Boolean assessment

Return type:

dict

Examples

>>> results = check_parallel_trends_robust(
...     data, outcome='sales', time='year',
...     treatment_group='treated', unit='firm_id'
... )
>>> print(f"Wasserstein distance: {results['wasserstein_distance']:.4f}")
>>> print(f"P-value: {results['wasserstein_p_value']:.4f}")

Notes

The Wasserstein distance (Earth Mover’s Distance) measures the minimum “cost” of transforming one distribution into another. Unlike simple mean comparisons, it captures differences in the entire distribution shape, making it more robust to non-normal data and heterogeneous effects.

A small Wasserstein distance and high p-value suggest the distributions of pre-treatment changes are similar, supporting the parallel trends assumption.

equivalence_test_trends#

Equivalence test for parallel trends (TOST procedure).

diff_diff.equivalence_test_trends(data, outcome, time, treatment_group, unit=None, pre_periods=None, equivalence_margin=None)[source]#

Perform equivalence testing (TOST) for parallel trends.

Tests whether the difference in trends is practically equivalent to zero using Two One-Sided Tests (TOST) procedure.

Parameters:

data (pd.DataFrame) – Panel data.
outcome (str) – Name of outcome variable column.
time (str) – Name of time period column.
treatment_group (str) – Name of treatment group indicator column.
unit (str, optional) – Name of unit identifier column.
pre_periods (list, optional) – List of pre-treatment time periods.
equivalence_margin (float, optional) – The margin for equivalence (delta). If None, uses 0.5 * pooled SD of outcome changes as a default.

Returns:

Dictionary containing: - mean_difference: Difference in mean changes - equivalence_margin: The margin used - lower_p_value: P-value for lower bound test - upper_p_value: P-value for upper bound test - tost_p_value: Maximum of the two p-values - equivalent: Boolean indicating equivalence at alpha=0.05

Return type:

dict

Example#

from diff_diff import equivalence_test_trends

# Test if pre-trends are equivalent to zero within bounds
result = equivalence_test_trends(
    data,
    outcome='y',
    time='period',
    treatment_group='treated',
    equivalence_margin=0.5  # Effect size bound
)

if result['equivalent']:
    print("Pre-trends are practically equivalent to zero")

Wild Cluster Bootstrap#

wild_bootstrap_se#

Compute wild cluster bootstrap standard errors.

diff_diff.wild_bootstrap_se(X, y, residuals, cluster_ids, coefficient_index, n_bootstrap=999, weight_type='rademacher', null_hypothesis=0.0, alpha=0.05, seed=None, return_distribution=False, p_val_type='two-tailed')[source]#

Compute wild cluster bootstrap standard errors and p-values.

Implements the Wild Cluster Restricted (WCR) bootstrap of Cameron, Gelbach, and Miller (2008), matching the defaults of R’s fwildclusterboot::boottest (Roodman, MacKinnon, Nielsen & Webb 2019): the null H0: coefficient = null_hypothesis is genuinely imposed by re-estimating the model with the coefficient’s column dropped, the bootstrap DGP resamples the restricted residuals, and the confidence interval is obtained by inverting the bootstrap test (the set of null values not rejected at level alpha) so that the p-value and CI are mutually consistent (0 in CI iff p >= alpha). For Rademacher weights with few clusters all 2**n_clusters sign-vectors are enumerated (deterministic) when 2**n_clusters <= n_bootstrap (the boottest full-enumeration trigger — it switches to enumeration once n_bootstrap reaches the number of possible draws) and n_clusters <= 20 (a memory guard); the reported n_bootstrap is then 2**n_clusters. Otherwise signs are sampled.

The reported se is the analytical cluster-robust (CR1) standard error of the original estimate — the studentized bootstrap drives the p-value and CI, not a re-scaled bootstrap dispersion.

Parameters:

X (np.ndarray) – Design matrix of shape (n, k).
y (np.ndarray) – Outcome vector of shape (n,).
residuals (np.ndarray) – Retained for backward compatibility and IGNORED by the WCR implementation, which recomputes the original fit and the restricted (null-imposed) residualization internally from X and y.
cluster_ids (np.ndarray) – Cluster identifiers of shape (n,).
coefficient_index (int) – Index of the coefficient for which to compute bootstrap inference. For DiD, this is typically 3 (the treatment*post interaction term).
n_bootstrap (int, default=999) – Number of bootstrap replications. Odd numbers are recommended for exact p-value computation.
weight_type (str, default="rademacher") – Type of bootstrap weights: - “rademacher”: +1 or -1 with equal probability (standard choice) - “webb”: 6-point distribution (recommended for <10 clusters) - “mammen”: Two-point distribution with skewness correction
null_hypothesis (float, default=0.0) – Value of the null hypothesis for p-value computation.
alpha (float, default=0.05) – Significance level for confidence interval.
seed (int, optional) – Random seed for reproducibility. If None (default), results will vary between runs.
return_distribution (bool, default=False) – If True, include the bootstrap distribution of the studentized statistic t* (evaluated at the null) in the results.
p_val_type (str, default="two-tailed") –
Shape of the test (mirrors boottest’s p_val_type):
- ”two-tailed”: test on |t*|; two-tailed CI by inversion (the interval need not be symmetric about the estimate).
- ”equal-tailed”: each tail tested at alpha/2; equal-tailed CI.

Returns:

Dataclass containing bootstrap SE, p-value, confidence interval, and other inference results.

Return type:

WildBootstrapResults

Raises:

ValueError – If weight_type is not recognized or if there are fewer than 2 clusters.

Warns:

UserWarning – If the number of clusters is less than 5, as bootstrap inference may be unreliable.

Examples

>>> from diff_diff.utils import wild_bootstrap_se
>>> results = wild_bootstrap_se(
...     X, y, residuals, cluster_ids,
...     coefficient_index=3,  # ATT coefficient
...     n_bootstrap=999,
...     weight_type="rademacher",
...     seed=42
... )
>>> print(f"Bootstrap SE: {results.se:.4f}")
>>> print(f"Bootstrap p-value: {results.p_value:.4f}")

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors. The Review of Economics and Statistics, 90(3), 414-427.

MacKinnon, J. G., & Webb, M. D. (2018). The wild bootstrap for few (treated) clusters. The Econometrics Journal, 21(2), 114-135.

Example#

from diff_diff import DifferenceInDifferences, generate_did_data

panel = generate_did_data(n_units=200, n_periods=10, treatment_effect=2.0)

# Use wild bootstrap via the estimator's inference parameter (recommended)
did = DifferenceInDifferences(inference='wild_bootstrap', n_bootstrap=999,
                               cluster='unit')
results = did.fit(panel, outcome='outcome', treatment='treated',
                  time='post')

print(f"Bootstrap SE: {results.se:.3f}")
print(f"Bootstrap 95% CI: [{results.conf_int[0]:.3f}, {results.conf_int[1]:.3f}]")

Note

wild_bootstrap_se() is a low-level function that operates on numpy arrays (X, y, residuals, cluster_ids). For most users, the estimator-level inference='wild_bootstrap' parameter shown above is more convenient.

WildBootstrapResults#

Container for wild bootstrap results.

class diff_diff.WildBootstrapResults[source]

Bases: object

Results from wild cluster bootstrap inference.

se

Analytical cluster-robust (CR1) standard error of the coefficient. The wild bootstrap studentizes the test with this SE; it is not a rescaled bootstrap dispersion.

Type:: float

p_value

Wild cluster bootstrap p-value (two-tailed or equal-tailed).

Type:: float

t_stat_original

Studentized statistic of the original estimate, (coef - null) / se.

Type:: float

ci_lower

Lower bound of the confidence interval (by test inversion).

Type:: float

ci_upper

Upper bound of the confidence interval (by test inversion).

Type:: float

n_clusters

Number of clusters in the data.

Type:: int

n_bootstrap

Number of bootstrap replications.

Type:: int

weight_type

Type of bootstrap weights used (“rademacher”, “webb”, or “mammen”).

Type:: str

alpha

Significance level used for confidence interval.

Type:: float

p_val_type

Test shape used (“two-tailed” or “equal-tailed”).

Type:: str

bootstrap_distribution

Bootstrap distribution of the studentized statistic t* evaluated at the null (if requested).

Type:: np.ndarray, optional

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors. The Review of Economics and Statistics, 90(3), 414-427.

se: float

p_value: float

t_stat_original: float

ci_lower: float

ci_upper: float

n_clusters: int

n_bootstrap: int

weight_type: str

alpha: float = 0.05

p_val_type: str = 'two-tailed'

bootstrap_distribution: ndarray | None = None

summary()[source]

Generate formatted summary of bootstrap results.

Return type:: str

print_summary()[source]

Print formatted summary to stdout.

Return type:: None

__init__(se, p_value, t_stat_original, ci_lower, ci_upper, n_clusters, n_bootstrap, weight_type, alpha=0.05, p_val_type='two-tailed', bootstrap_distribution=None)

Parameters:

se (float)
p_value (float)
t_stat_original (float)
ci_lower (float)
ci_upper (float)
n_clusters (int)
n_bootstrap (int)
weight_type (str)
alpha (float)
p_val_type (str)
bootstrap_distribution (ndarray | None)

Return type:

None

Weight Types#

The wild bootstrap supports several weight distributions:

'rademacher': ±1 with equal probability (default, good general choice)
'mammen': Two-point distribution matching higher moments
'webb': Six-point distribution, better for few clusters

# Using different weight types (low-level array API)
# wild_bootstrap_se(X, y, residuals, cluster_ids, coefficient_index, ...)
boot_rad = wild_bootstrap_se(X, y, resid, clusters, 0, weight_type='rademacher')
boot_webb = wild_bootstrap_se(X, y, resid, clusters, 0, weight_type='webb')
boot_mammen = wild_bootstrap_se(X, y, resid, clusters, 0, weight_type='mammen')

Recommendation#

Use 'rademacher' (default) for most cases
Use 'webb' when you have fewer than 10 clusters
The n_bootstrap should typically be at least 999 for reliable inference