Datasets#

Built-in real-world datasets from published studies for examples, tutorials, and testing.

All datasets are downloaded from public sources on first use and cached locally at ~/.cache/diff_diff/datasets/. Pass force_download=True to any loader to refresh the cache. If the download fails and a cached copy exists, the cached version is used automatically.

Dataset Loaders#

load_card_krueger#

Card & Krueger (1994) minimum wage study. Classic 2x2 DiD comparing fast-food employment in New Jersey (treated) and Pennsylvania (control) around NJ’s 1992 minimum wage increase.

diff_diff.load_card_krueger(force_download=False)[source]#

Load the Card & Krueger (1994) minimum wage dataset.

This classic dataset examines the effect of New Jersey’s 1992 minimum wage increase on employment in fast-food restaurants, using Pennsylvania as a control group.

The study is a canonical example of the Difference-in-Differences method.

Parameters:

force_download (bool, default=False) – If True, re-download the dataset even if cached.

Returns:

Dataset with columns: - store_id : int - Unique store identifier - state : str - ‘NJ’ (New Jersey, treated) or ‘PA’ (Pennsylvania, control) - chain : str - Fast food chain (‘bk’, ‘kfc’, ‘roys’, ‘wendys’) - emp_pre : float - Full-time equivalent employment before (Feb 1992) - emp_post : float - Full-time equivalent employment after (Nov 1992) - wage_pre : float - Starting wage before - wage_post : float - Starting wage after - treated : int - 1 if NJ, 0 if PA - emp_change : float - Change in employment (emp_post - emp_pre)

Return type:

pd.DataFrame

Notes

The minimum wage in New Jersey increased from $4.25 to $5.05 on April 1, 1992. Pennsylvania’s minimum wage remained at $4.25.

Original finding: No significant negative effect of minimum wage increase on employment (ATT ≈ +2.8 FTE employees).

References

Card, D., & Krueger, A. B. (1994). Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-793.

Examples

>>> from diff_diff.datasets import load_card_krueger
>>> from diff_diff import DifferenceInDifferences
>>>
>>> # Load and prepare data
>>> ck = load_card_krueger()
>>> ck_long = ck.melt(
...     id_vars=['store_id', 'state', 'treated'],
...     value_vars=['emp_pre', 'emp_post'],
...     var_name='period', value_name='employment'
... )
>>> ck_long['post'] = (ck_long['period'] == 'emp_post').astype(int)
>>>
>>> # Estimate DiD
>>> did = DifferenceInDifferences()
>>> results = did.fit(ck_long, outcome='employment', treatment='treated', time='post')

Example#

from diff_diff.datasets import load_card_krueger
from diff_diff import DifferenceInDifferences

ck = load_card_krueger()

# Reshape to long format for DiD estimation
ck_long = ck.melt(
    id_vars=['store_id', 'state', 'treated'],
    value_vars=['emp_pre', 'emp_post'],
    var_name='period', value_name='employment'
)
ck_long['post'] = (ck_long['period'] == 'emp_post').astype(int)

did = DifferenceInDifferences()
results = did.fit(ck_long, outcome='employment', treatment='treated', time='post')

load_castle_doctrine#

Castle doctrine (Stand Your Ground) gun law study. Staggered adoption of self-defense law expansions across U.S. states (2000–2010), suitable for Callaway–Sant’Anna or Sun–Abraham estimation.

diff_diff.load_castle_doctrine(force_download=False)[source]#

Load Castle Doctrine / Stand Your Ground laws dataset.

This dataset tracks the staggered adoption of Castle Doctrine (Stand Your Ground) laws across U.S. states, which expanded self-defense rights. It’s commonly used to demonstrate heterogeneous treatment timing methods like Callaway-Sant’Anna or Sun-Abraham.

Parameters:

force_download (bool, default=False) – If True, re-download the dataset even if cached.

Returns:

Panel dataset with columns: - state : str - State abbreviation - year : int - Year (2000-2010) - first_treat : int - Year of law adoption (0 = never adopted) - homicide_rate : float - Homicides per 100,000 population - population : int - State population - income : float - Per capita income - treated : int - 1 if law in effect, 0 otherwise - cohort : int - Alias for first_treat

Return type:

pd.DataFrame

Notes

Castle Doctrine laws remove the duty to retreat before using deadly force in self-defense. States adopted these laws at different times between 2005 and 2009, creating a staggered treatment design.

References

Cheng, C., & Hoekstra, M. (2013). Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine. Journal of Human Resources, 48(3), 821-854.

Examples

>>> from diff_diff.datasets import load_castle_doctrine
>>> from diff_diff import CallawaySantAnna
>>>
>>> castle = load_castle_doctrine()
>>> cs = CallawaySantAnna(control_group="never_treated")
>>> results = cs.fit(
...     castle,
...     outcome="homicide_rate",
...     unit="state",
...     time="year",
...     first_treat="first_treat"
... )

Example#

from diff_diff.datasets import load_castle_doctrine
from diff_diff import CallawaySantAnna

castle = load_castle_doctrine()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
    castle,
    outcome="homicide_rate",
    unit="state",
    time="year",
    first_treat="first_treat"
)

load_divorce_laws#

Unilateral (no-fault) divorce law reforms. Staggered adoption across U.S. states (1968–1988) from Stevenson & Wolfers (2006), with outcomes for divorce rate, female labor force participation, and female suicide rate.

diff_diff.load_divorce_laws(force_download=False)[source]#

Load unilateral divorce laws dataset.

This dataset tracks the staggered adoption of unilateral (no-fault) divorce laws across U.S. states. It’s a classic example for studying staggered DiD methods and was used in Stevenson & Wolfers (2006).

Parameters:

force_download (bool, default=False) – If True, re-download the dataset even if cached.

Returns:

Panel dataset with columns: - state : str - State abbreviation - year : int - Year - first_treat : int - Year unilateral divorce became available (0 = never) - divorce_rate : float - Divorces per 1,000 population - female_lfp : float - Female labor force participation rate - suicide_rate : float - Female suicide rate - treated : int - 1 if law in effect, 0 otherwise - cohort : int - Alias for first_treat

Return type:

pd.DataFrame

Notes

Unilateral divorce laws allow one spouse to obtain a divorce without the other’s consent. States adopted these laws at different times, primarily between 1969 and 1985.

References

Stevenson, B., & Wolfers, J. (2006). Bargaining in the Shadow of the Law: Divorce Laws and Family Distress. Quarterly Journal of Economics, 121(1), 267-288.

Wolfers, J. (2006). Did Unilateral Divorce Laws Raise Divorce Rates? A Reconciliation and New Results. American Economic Review, 96(5), 1802-1820.

Examples

>>> from diff_diff.datasets import load_divorce_laws
>>> from diff_diff import CallawaySantAnna, SunAbraham
>>>
>>> divorce = load_divorce_laws()
>>> cs = CallawaySantAnna(control_group="never_treated")
>>> results = cs.fit(
...     divorce,
...     outcome="divorce_rate",
...     unit="state",
...     time="year",
...     first_treat="first_treat"
... )

Example#

from diff_diff.datasets import load_divorce_laws
from diff_diff import CallawaySantAnna

divorce = load_divorce_laws()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
    divorce,
    outcome="divorce_rate",
    unit="state",
    time="year",
    first_treat="first_treat"
)

load_mpdta#

Minimum wage panel data for training (Callaway & Sant’Anna 2021). Simulated county-level employment data with staggered minimum wage increases (2003–2007), from the R did package.

diff_diff.load_mpdta(force_download=False)[source]#

Load the Minimum Wage Panel Dataset for DiD Analysis (mpdta).

This is a simulated dataset from the R did package that mimics county-level employment data under staggered minimum wage increases. It’s designed specifically for teaching the Callaway-Sant’Anna estimator.

Parameters:

force_download (bool, default=False) – If True, re-download the dataset even if cached.

Returns:

Panel dataset with columns: - countyreal : int - County identifier - year : int - Year (2003-2007) - lpop : float - Log population - lemp : float - Log employment (outcome) - first_treat : int - Year of minimum wage increase (0 = never) - treat : int - 1 if ever treated, 0 otherwise

Return type:

pd.DataFrame

Notes

This dataset is included in the R did package and is commonly used in tutorials demonstrating the Callaway-Sant’Anna estimator.

References

Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.

Examples

>>> from diff_diff.datasets import load_mpdta
>>> from diff_diff import CallawaySantAnna
>>>
>>> mpdta = load_mpdta()
>>> cs = CallawaySantAnna()
>>> results = cs.fit(
...     mpdta,
...     outcome="lemp",
...     unit="countyreal",
...     time="year",
...     first_treat="first_treat"
... )

Example#

from diff_diff.datasets import load_mpdta
from diff_diff import CallawaySantAnna

mpdta = load_mpdta()
cs = CallawaySantAnna()
results = cs.fit(
    mpdta,
    outcome="lemp",
    unit="countyreal",
    time="year",
    first_treat="first_treat"
)

Utility Functions#

load_dataset#

Generic loader that fetches a dataset by name.

diff_diff.load_dataset(name, force_download=False)[source]#

Load a dataset by name.

Parameters:
  • name (str) – Name of the dataset. Use list_datasets() to see available datasets.

  • force_download (bool, default=False) – If True, re-download the dataset even if cached.

Returns:

The requested dataset.

Return type:

pd.DataFrame

Raises:

ValueError – If the dataset name is not recognized.

Examples

>>> from diff_diff.datasets import load_dataset, list_datasets
>>> print(list_datasets())
>>> df = load_dataset("card_krueger")

list_datasets#

List all available datasets with descriptions.

diff_diff.list_datasets()[source]#

List available real-world datasets.

Returns:

Dictionary mapping dataset names to descriptions.

Return type:

dict

Examples

>>> from diff_diff.datasets import list_datasets
>>> for name, desc in list_datasets().items():
...     print(f"{name}: {desc}")

clear_cache#

Remove all cached dataset files from ~/.cache/diff_diff/datasets/.

diff_diff.clear_cache()[source]#

Clear the local dataset cache.

Return type:

None

Listing and Loading Datasets#

from diff_diff.datasets import list_datasets, load_dataset

# See what's available
for name, description in list_datasets().items():
    print(f"{name}: {description}")

# Load by name
df = load_dataset("card_krueger")
print(df.shape)
print(df.columns.tolist())