Datasets#
Built-in real-world datasets from published studies for examples, tutorials, and testing.
All datasets are downloaded from public sources on first use and cached locally
at ~/.cache/diff_diff/datasets/. Pass force_download=True to any loader
to refresh the cache. If the download fails and a cached copy exists, the cached
version is used automatically.
Dataset Loaders#
load_card_krueger#
Card & Krueger (1994) minimum wage study. Classic 2x2 DiD comparing fast-food employment in New Jersey (treated) and Pennsylvania (control) around NJ’s 1992 minimum wage increase.
- diff_diff.load_card_krueger(force_download=False)[source]#
Load the Card & Krueger (1994) minimum wage dataset.
This classic dataset examines the effect of New Jersey’s 1992 minimum wage increase on employment in fast-food restaurants, using Pennsylvania as a control group.
The study is a canonical example of the Difference-in-Differences method.
- Parameters:
force_download (bool, default=False) – If True, re-download the dataset even if cached.
- Returns:
Dataset with columns: - store_id : int - Unique store identifier - state : str - ‘NJ’ (New Jersey, treated) or ‘PA’ (Pennsylvania, control) - chain : str - Fast food chain (‘bk’, ‘kfc’, ‘roys’, ‘wendys’) - emp_pre : float - Full-time equivalent employment before (Feb 1992) - emp_post : float - Full-time equivalent employment after (Nov 1992) - wage_pre : float - Starting wage before - wage_post : float - Starting wage after - treated : int - 1 if NJ, 0 if PA - emp_change : float - Change in employment (emp_post - emp_pre)
- Return type:
pd.DataFrame
Notes
The minimum wage in New Jersey increased from $4.25 to $5.05 on April 1, 1992. Pennsylvania’s minimum wage remained at $4.25.
Original finding: No significant negative effect of minimum wage increase on employment (ATT ≈ +2.8 FTE employees).
References
Card, D., & Krueger, A. B. (1994). Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-793.
Examples
>>> from diff_diff.datasets import load_card_krueger >>> from diff_diff import DifferenceInDifferences >>> >>> # Load and prepare data >>> ck = load_card_krueger() >>> ck_long = ck.melt( ... id_vars=['store_id', 'state', 'treated'], ... value_vars=['emp_pre', 'emp_post'], ... var_name='period', value_name='employment' ... ) >>> ck_long['post'] = (ck_long['period'] == 'emp_post').astype(int) >>> >>> # Estimate DiD >>> did = DifferenceInDifferences() >>> results = did.fit(ck_long, outcome='employment', treatment='treated', time='post')
Example#
from diff_diff.datasets import load_card_krueger
from diff_diff import DifferenceInDifferences
ck = load_card_krueger()
# Reshape to long format for DiD estimation
ck_long = ck.melt(
id_vars=['store_id', 'state', 'treated'],
value_vars=['emp_pre', 'emp_post'],
var_name='period', value_name='employment'
)
ck_long['post'] = (ck_long['period'] == 'emp_post').astype(int)
did = DifferenceInDifferences()
results = did.fit(ck_long, outcome='employment', treatment='treated', time='post')
load_castle_doctrine#
Castle doctrine (Stand Your Ground) gun law study. Staggered adoption of self-defense law expansions across U.S. states (2000–2010), suitable for Callaway–Sant’Anna or Sun–Abraham estimation.
- diff_diff.load_castle_doctrine(force_download=False)[source]#
Load Castle Doctrine / Stand Your Ground laws dataset.
This dataset tracks the staggered adoption of Castle Doctrine (Stand Your Ground) laws across U.S. states, which expanded self-defense rights. It’s commonly used to demonstrate heterogeneous treatment timing methods like Callaway-Sant’Anna or Sun-Abraham.
- Parameters:
force_download (bool, default=False) – If True, re-download the dataset even if cached.
- Returns:
Panel dataset with columns: - state : str - State abbreviation - year : int - Year (2000-2010) - first_treat : int - Year of law adoption (0 = never adopted) - homicide_rate : float - Homicides per 100,000 population - population : int - State population - income : float - Per capita income - treated : int - 1 if law in effect, 0 otherwise - cohort : int - Alias for first_treat
- Return type:
pd.DataFrame
Notes
Castle Doctrine laws remove the duty to retreat before using deadly force in self-defense. States adopted these laws at different times between 2005 and 2009, creating a staggered treatment design.
References
Cheng, C., & Hoekstra, M. (2013). Does Strengthening Self-Defense Law Deter Crime or Escalate Violence? Evidence from Expansions to Castle Doctrine. Journal of Human Resources, 48(3), 821-854.
Examples
>>> from diff_diff.datasets import load_castle_doctrine >>> from diff_diff import CallawaySantAnna >>> >>> castle = load_castle_doctrine() >>> cs = CallawaySantAnna(control_group="never_treated") >>> results = cs.fit( ... castle, ... outcome="homicide_rate", ... unit="state", ... time="year", ... first_treat="first_treat" ... )
Example#
from diff_diff.datasets import load_castle_doctrine
from diff_diff import CallawaySantAnna
castle = load_castle_doctrine()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
castle,
outcome="homicide_rate",
unit="state",
time="year",
first_treat="first_treat"
)
load_divorce_laws#
Unilateral (no-fault) divorce law reforms. Staggered adoption across U.S. states (1968–1988) from Stevenson & Wolfers (2006), with outcomes for divorce rate, female labor force participation, and female suicide rate.
- diff_diff.load_divorce_laws(force_download=False)[source]#
Load unilateral divorce laws dataset.
This dataset tracks the staggered adoption of unilateral (no-fault) divorce laws across U.S. states. It’s a classic example for studying staggered DiD methods and was used in Stevenson & Wolfers (2006).
- Parameters:
force_download (bool, default=False) – If True, re-download the dataset even if cached.
- Returns:
Panel dataset with columns: - state : str - State abbreviation - year : int - Year - first_treat : int - Year unilateral divorce became available (0 = never) - divorce_rate : float - Divorces per 1,000 population - female_lfp : float - Female labor force participation rate - suicide_rate : float - Female suicide rate - treated : int - 1 if law in effect, 0 otherwise - cohort : int - Alias for first_treat
- Return type:
pd.DataFrame
Notes
Unilateral divorce laws allow one spouse to obtain a divorce without the other’s consent. States adopted these laws at different times, primarily between 1969 and 1985.
References
Stevenson, B., & Wolfers, J. (2006). Bargaining in the Shadow of the Law: Divorce Laws and Family Distress. Quarterly Journal of Economics, 121(1), 267-288.
Wolfers, J. (2006). Did Unilateral Divorce Laws Raise Divorce Rates? A Reconciliation and New Results. American Economic Review, 96(5), 1802-1820.
Examples
>>> from diff_diff.datasets import load_divorce_laws >>> from diff_diff import CallawaySantAnna, SunAbraham >>> >>> divorce = load_divorce_laws() >>> cs = CallawaySantAnna(control_group="never_treated") >>> results = cs.fit( ... divorce, ... outcome="divorce_rate", ... unit="state", ... time="year", ... first_treat="first_treat" ... )
Example#
from diff_diff.datasets import load_divorce_laws
from diff_diff import CallawaySantAnna
divorce = load_divorce_laws()
cs = CallawaySantAnna(control_group="never_treated")
results = cs.fit(
divorce,
outcome="divorce_rate",
unit="state",
time="year",
first_treat="first_treat"
)
load_mpdta#
Minimum wage panel data for training (Callaway & Sant’Anna 2021). Simulated
county-level employment data with staggered minimum wage increases (2003–2007),
from the R did package.
- diff_diff.load_mpdta(force_download=False)[source]#
Load the Minimum Wage Panel Dataset for DiD Analysis (mpdta).
This is a simulated dataset from the R did package that mimics county-level employment data under staggered minimum wage increases. It’s designed specifically for teaching the Callaway-Sant’Anna estimator.
- Parameters:
force_download (bool, default=False) – If True, re-download the dataset even if cached.
- Returns:
Panel dataset with columns: - countyreal : int - County identifier - year : int - Year (2003-2007) - lpop : float - Log population - lemp : float - Log employment (outcome) - first_treat : int - Year of minimum wage increase (0 = never) - treat : int - 1 if ever treated, 0 otherwise
- Return type:
pd.DataFrame
Notes
This dataset is included in the R did package and is commonly used in tutorials demonstrating the Callaway-Sant’Anna estimator.
References
Callaway, B., & Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
Examples
>>> from diff_diff.datasets import load_mpdta >>> from diff_diff import CallawaySantAnna >>> >>> mpdta = load_mpdta() >>> cs = CallawaySantAnna() >>> results = cs.fit( ... mpdta, ... outcome="lemp", ... unit="countyreal", ... time="year", ... first_treat="first_treat" ... )
Example#
from diff_diff.datasets import load_mpdta
from diff_diff import CallawaySantAnna
mpdta = load_mpdta()
cs = CallawaySantAnna()
results = cs.fit(
mpdta,
outcome="lemp",
unit="countyreal",
time="year",
first_treat="first_treat"
)
Utility Functions#
load_dataset#
Generic loader that fetches a dataset by name.
- diff_diff.load_dataset(name, force_download=False)[source]#
Load a dataset by name.
- Parameters:
- Returns:
The requested dataset.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the dataset name is not recognized.
Examples
>>> from diff_diff.datasets import load_dataset, list_datasets >>> print(list_datasets()) >>> df = load_dataset("card_krueger")
list_datasets#
List all available datasets with descriptions.
clear_cache#
Remove all cached dataset files from ~/.cache/diff_diff/datasets/.
Listing and Loading Datasets#
from diff_diff.datasets import list_datasets, load_dataset
# See what's available
for name, description in list_datasets().items():
print(f"{name}: {description}")
# Load by name
df = load_dataset("card_krueger")
print(df.shape)
print(df.columns.tolist())