Datasets#

Built-in real-world datasets from published studies for examples, tutorials, and testing.

All datasets are downloaded from public sources on first use and cached locally at ~/.cache/diff_diff/datasets/. Pass force_download=True to any loader to refresh the cache. If the download fails and a cached copy exists, the cached version is used automatically.

Dataset Loaders#

load_card_krueger#

Card & Krueger (1994) minimum wage study. Classic 2x2 DiD comparing fast-food employment in New Jersey (treated) and Pennsylvania (control) around NJ’s 1992 minimum wage increase.

diff_diff.load_card_krueger(force_download=False)[source]#

Load the Card & Krueger (1994) minimum wage dataset.

This classic dataset examines the effect of New Jersey’s 1992 minimum wage increase on employment in fast-food restaurants, using Pennsylvania as a control group.

The study is a canonical example of the Difference-in-Differences method.

Parameters:: force_download (bool, default=False) – If True, re-download the dataset even if cached.
Returns:: Dataset with columns: - store_id : int - Unique store identifier - state : str - ‘NJ’ (New Jersey, treated) or ‘PA’ (Pennsylvania, control) - chain : str - Fast food chain (‘bk’, ‘kfc’, ‘roys’, ‘wendys’) - emp_pre : float - Full-time equivalent employment before (Feb 1992) - emp_post : float - Full-time equivalent employment after (Nov 1992) - wage_pre : float - Starting wage before - wage_post : float - Starting wage after - treated : int - 1 if NJ, 0 if PA - emp_change : float - Change in employment (emp_post - emp_pre)
Return type:: pd.DataFrame

Notes

The minimum wage in New Jersey increased from $4.25 to $5.05 on April 1, 1992. Pennsylvania’s minimum wage remained at $4.25.

Original finding: No significant negative effect of minimum wage increase on employment (ATT ≈ +2.8 FTE employees).

References

Card, D., & Krueger, A. B. (1994). Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-793.

Examples

>>> from diff_diff.datasets import load_card_krueger
>>> from diff_diff import DifferenceInDifferences
>>>
>>> # Load and prepare data
>>> ck = load_card_krueger()
>>> ck_long = ck.melt(
...     id_vars=['store_id', 'state', 'treated'],
...     value_vars=['emp_pre', 'emp_post'],
...     var_name='period', value_name='employment'
... )
>>> ck_long['post'] = (ck_long['period'] == 'emp_post').astype(int)
>>>
>>> # Estimate DiD
>>> did = DifferenceInDifferences()
>>> results = did.fit(ck_long, outcome='employment', treatment='treated', time='post')

Example#

from diff_diff.datasets import load_card_krueger
from diff_diff import DifferenceInDifferences

ck = load_card_krueger()

# Reshape to long format for DiD estimation
ck_long = ck.melt(
    id_vars=['store_id', 'state', 'treated'],
    value_vars=['emp_pre', 'emp_post'],
    var_name='period', value_name='employment'
)
ck_long['post'] = (ck_long['period'] == 'emp_post').astype(int)

did = DifferenceInDifferences()
results = did.fit(ck_long, outcome='employment', treatment='treated', time='post')

load_castle_doctrine#

Castle doctrine (Stand Your Ground) gun law study. Staggered adoption of self-defense law expansions across U.S. states (2000–2010), suitable for Callaway–Sant’Anna or Sun–Abraham estimation.

diff_diff.load_castle_doctrine(force_download=False)[source]#

Load Castle Doctrine / Stand Your Ground laws dataset.

This dataset tracks the staggered adoption of Castle Doctrine (Stand Your Ground) laws across U.S. states, which expanded self-defense rights. It’s commonly used to demonstrate heterogeneous treatment timing methods like Callaway-Sant’Anna or Sun-Abraham.

Parameters:: force_download (bool, default=False) – If True, re-download the dataset even if cached.
Returns:: Panel dataset with columns: - state : str - State abbreviation - year : int - Year (2000-2010) - first_treat : int - Year of law adoption (0 = never adopted) - homicide_rate : float - Homicides per 100,000 population - population : int - State population - income : float - Per capita income - treated : int - 1 if law in effect, 0 otherwise - cohort : int - Alias for first_treat
Return type:: pd.DataFrame

Notes

Castle Doctrine laws remove the duty to retreat before using deadly force in self-defense. States adopted these laws at different times between 2005 and 2009, creating a staggered treatment design.

References

Cheng, C., & Hoekstra, M. (2013). Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine. Journal of Human Resources, 48(3), 821-854.

Examples

>>> from diff_diff.datasets import load_castle_doctrine
>>> from diff_diff import CallawaySantAnna
>>>
>>> castle = load_castle_doctrine()
>>> cs = CallawaySantAnna(control_group="never_treated")
>>> results = cs.fit(
...     castle,
...     outcome="homicide_rate",
...     unit="state",
...     time="year",
...     first_treat="first_treat"
... )

Example#

from diff_diff.datasets import load_castle_doctrine
from diff_diff import CallawaySantAnna

castle = load_castle_doctrine()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
    castle,
    outcome="homicide_rate",
    unit="state",
    time="year",
    first_treat="first_treat"
)

load_divorce_laws#

Unilateral (no-fault) divorce law reforms. Staggered adoption across U.S. states (1968–1988) from Stevenson & Wolfers (2006), with outcomes for divorce rate, female labor force participation, and female suicide rate.

diff_diff.load_divorce_laws(force_download=False)[source]#

Load unilateral divorce laws dataset.

This dataset tracks the staggered adoption of unilateral (no-fault) divorce laws across U.S. states. It’s a classic example for studying staggered DiD methods and was used in Stevenson & Wolfers (2006).

Parameters:: force_download (bool, default=False) – If True, re-download the dataset even if cached.
Returns:: Panel dataset with columns: - state : str - State abbreviation - year : int - Year - first_treat : int - Year unilateral divorce became available (0 = never) - divorce_rate : float - Divorces per 1,000 population - female_lfp : float - Female labor force participation rate - suicide_rate : float - Female suicide rate - treated : int - 1 if law in effect, 0 otherwise - cohort : int - Alias for first_treat
Return type:: pd.DataFrame

Notes

Unilateral divorce laws allow one spouse to obtain a divorce without the other’s consent. States adopted these laws at different times, primarily between 1969 and 1985.

References

Stevenson, B., & Wolfers, J. (2006). Bargaining in the Shadow of the Law: Divorce Laws and Family Distress. Quarterly Journal of Economics, 121(1), 267-288.

Wolfers, J. (2006). Did Unilateral Divorce Laws Raise Divorce Rates? A Reconciliation and New Results. American Economic Review, 96(5), 1802-1820.

Examples

>>> from diff_diff.datasets import load_divorce_laws
>>> from diff_diff import CallawaySantAnna, SunAbraham
>>>
>>> divorce = load_divorce_laws()
>>> cs = CallawaySantAnna(control_group="never_treated")
>>> results = cs.fit(
...     divorce,
...     outcome="divorce_rate",
...     unit="state",
...     time="year",
...     first_treat="first_treat"
... )

Example#

from diff_diff.datasets import load_divorce_laws
from diff_diff import CallawaySantAnna

divorce = load_divorce_laws()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
    divorce,
    outcome="divorce_rate",
    unit="state",
    time="year",
    first_treat="first_treat"
)

load_mpdta#

Minimum wage panel data for training (Callaway & Sant’Anna 2021). Simulated county-level employment data with staggered minimum wage increases (2003–2007), from the R did package.

diff_diff.load_mpdta(force_download=False)[source]#

Load the Minimum Wage Panel Dataset for DiD Analysis (mpdta).

This is a simulated dataset from the R did package that mimics county-level employment data under staggered minimum wage increases. It’s designed specifically for teaching the Callaway-Sant’Anna estimator.

Parameters:: force_download (bool, default=False) – If True, re-download the dataset even if cached.
Returns:: Panel dataset with columns: - countyreal : int - County identifier - year : int - Year (2003-2007) - lpop : float - Log population - lemp : float - Log employment (outcome) - first_treat : int - Year of minimum wage increase (0 = never) - treat : int - 1 if ever treated, 0 otherwise
Return type:: pd.DataFrame

Notes

This dataset is included in the R did package and is commonly used in tutorials demonstrating the Callaway-Sant’Anna estimator.

References

Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.

Examples

>>> from diff_diff.datasets import load_mpdta
>>> from diff_diff import CallawaySantAnna
>>>
>>> mpdta = load_mpdta()
>>> cs = CallawaySantAnna()
>>> results = cs.fit(
...     mpdta,
...     outcome="lemp",
...     unit="countyreal",
...     time="year",
...     first_treat="first_treat"
... )

Example#

from diff_diff.datasets import load_mpdta
from diff_diff import CallawaySantAnna

mpdta = load_mpdta()
cs = CallawaySantAnna()
results = cs.fit(
    mpdta,
    outcome="lemp",
    unit="countyreal",
    time="year",
    first_treat="first_treat"
)

load_prop99#

California Proposition 99 tobacco program study (Lee–Wooldridge cohort format of the Abadie–Diamond–Hainmueller 2010 data). Log per capita cigarette sales for 39 states (1970–2000) with a single treated unit (California, treated from 1989) – the canonical small-sample DiD and synthetic control setting.

diff_diff.load_prop99(force_download=False)[source]#

Load the California Proposition 99 smoking dataset (Lee-Wooldridge format).

This dataset tracks per capita cigarette sales across 39 U.S. states (California plus 38 never-treated donor states) from 1970 to 2000. California passed Proposition 99, a large tobacco tax and control program, effective in 1989. With a single treated unit, it is the canonical setting for small-sample DiD inference and synthetic control comparisons.

Parameters:: force_download (bool, default=False) – If True, re-download the dataset even if cached.
Returns:: Panel dataset with columns: - state : str - State name - year : int - Year (1970-2000) - first_year : int - Treatment start year (1989 for California, 0 = never) - lcigsale : float - Log per capita cigarette sales (packs) - treated : int - 1 if treatment in effect, 0 otherwise - cohort : int - Alias for first_year
Return type:: pd.DataFrame

Notes

This is the cohort-format version of the Abadie, Diamond & Hainmueller (2010) California tobacco data distributed (MIT license) with the authors’ Stata lwdid package by Hur, Lee and Wooldridge. The donor pool excludes states with their own tobacco programs, leaving exactly one treated state and 38 controls.

Downloads are verified against a pinned SHA-256 and validated against the source invariants (39 states, 1970-2000, single 1989 cohort). If the real data cannot be obtained, a SYNTHETIC same-schema fallback is returned with a UserWarning; check df.attrs["source"] ("lwdid_ssc_ancillary" = real data, "synthetic_fallback" = synthetic - never use the fallback for replication).

References

Lee, S. J., & Wooldridge, J. M. (2026). Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes. SSRN Working Paper No. 5325686.

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. Journal of the American Statistical Association, 105(490), 493-505.

Examples

>>> from diff_diff.datasets import load_prop99
>>> from diff_diff import DifferenceInDifferences
>>>
>>> prop99 = load_prop99()
>>> prop99["treated_state"] = (prop99["first_year"] > 0).astype(int)
>>> prop99["post"] = (prop99["year"] >= 1989).astype(int)
>>>
>>> did = DifferenceInDifferences()
>>> results = did.fit(
...     prop99, outcome="lcigsale", treatment="treated_state", time="post"
... )

Example#

from diff_diff.datasets import load_prop99
from diff_diff import DifferenceInDifferences

prop99 = load_prop99()
prop99["treated_state"] = (prop99["first_year"] > 0).astype(int)
prop99["post"] = (prop99["year"] >= 1989).astype(int)

did = DifferenceInDifferences()
results = did.fit(
    prop99, outcome="lcigsale", treatment="treated_state", time="post"
)

load_walmart#

Walmart entry county panel (Lee & Wooldridge 2025 sample, derived from County Business Patterns data as constructed by Brown & Butts). Log retail and wholesale employment for 1,277 counties (1977–1999) with staggered first store openings (1986–1999) and 391 never-treated counties.

diff_diff.load_walmart(force_download=False)[source]#

Load the Walmart entry county panel (Lee-Wooldridge sample).

This dataset tracks log retail and wholesale employment for 1,277 U.S. counties from 1977 to 1999, with staggered first Walmart store openings between 1986 and 1999 and 391 counties never receiving a store. It is used to study the local labor-market effects of Walmart entry under staggered treatment adoption.

Parameters:: force_download (bool, default=False) – If True, re-download the dataset even if cached.
Returns:: Panel dataset with columns: - cid : int - County identifier - year : int - Year (1977-1999) - first_year : int - Year of first Walmart opening (0 = never) - log_retail_emp : float - Log county retail employment (outcome) - log_wholesale_emp : float - Log county wholesale employment - x1 : float - County poverty rate - x2 : float - Share with high-school education - x3 : float - Manufacturing employment share - treated : int - 1 if a Walmart has opened, 0 otherwise - cohort : int - Alias for first_year
Return type:: pd.DataFrame

Notes

The panel derives from County Business Patterns data as constructed by Brown & Butts, and is distributed (MIT license) with the authors’ Stata lwdid package by Hur, Lee and Wooldridge. The covariate labels follow the Lee & Wooldridge application.

Downloads are verified against a pinned SHA-256 and validated against the source invariants (1,277 counties, 1977-1999, cohorts 1986-1999, 391 never-treated). If the real data cannot be obtained, a SYNTHETIC same-schema fallback (200 counties) is returned with a UserWarning; check df.attrs["source"] ("lwdid_ssc_ancillary" = real data, "synthetic_fallback" = synthetic - never use the fallback for replication).

References

Lee, S. J., & Wooldridge, J. M. (2025). A Simple Transformation Approach to Difference-in-Differences Estimation for Panel Data. SSRN Working Paper No. 4516518.

Brown, N., & Butts, K. (2025). Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels. Journal of Econometrics.

Examples

>>> from diff_diff.datasets import load_walmart
>>> from diff_diff import CallawaySantAnna
>>>
>>> walmart = load_walmart()
>>> cs = CallawaySantAnna(control_group="never_treated")
>>> results = cs.fit(
...     walmart,
...     outcome="log_retail_emp",
...     unit="cid",
...     time="year",
...     first_treat="first_year",
... )

Example#

from diff_diff.datasets import load_walmart
from diff_diff import CallawaySantAnna

walmart = load_walmart()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
    walmart,
    outcome="log_retail_emp",
    unit="cid",
    time="year",
    first_treat="first_year"
)

Utility Functions#

load_dataset#

Generic loader that fetches a dataset by name.

diff_diff.load_dataset(name, force_download=False)[source]#

Load a dataset by name.

Parameters:

name (str) – Name of the dataset. Use list_datasets() to see available datasets.
force_download (bool, default=False) – If True, re-download the dataset even if cached.

Returns:

The requested dataset.

Return type:

pd.DataFrame

Raises:

ValueError – If the dataset name is not recognized.

Examples

>>> from diff_diff.datasets import load_dataset, list_datasets
>>> print(list_datasets())
>>> df = load_dataset("card_krueger")

list_datasets#

List all available datasets with descriptions.

diff_diff.list_datasets()[source]#

List available real-world datasets.

Returns:: Dictionary mapping dataset names to descriptions.
Return type:: dict

Examples

>>> from diff_diff.datasets import list_datasets
>>> for name, desc in list_datasets().items():
...     print(f"{name}: {desc}")

clear_cache#

Remove all cached dataset files from ~/.cache/diff_diff/datasets/.

diff_diff.clear_cache()[source]#

Clear the local dataset cache.

Return type:: None

Listing and Loading Datasets#

from diff_diff.datasets import list_datasets, load_dataset

# See what's available
for name, description in list_datasets().items():
    print(f"{name}: {description}")

# Load by name
df = load_dataset("card_krueger")
print(df.shape)
print(df.columns.tolist())