{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic Difference-in-Differences with diff-diff\n", "\n", "This notebook demonstrates how to use the `diff-diff` library for basic 2x2 Difference-in-Differences (DiD) analysis. We'll cover:\n", "\n", "1. Setting up a basic DiD estimation\n", "2. Using both column-name and formula interfaces\n", "3. Interpreting results\n", "4. Adding covariates\n", "5. Using fixed effects\n", "6. Cluster-robust and wild bootstrap inference" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from diff_diff import DifferenceInDifferences, TwoWayFixedEffects\n", "from diff_diff.prep import generate_did_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Generate Sample Data\n", "\n", "The `generate_did_data` function creates synthetic panel data with a known treatment effect, which is useful for learning and testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate synthetic DiD data with known ATT of 5.0\n", "data = generate_did_data(\n", " n_units=100,\n", " n_periods=2,\n", " treatment_effect=5.0,\n", " treatment_fraction=0.5,\n", " treatment_period=1, # Period 1 is post-treatment (periods are 0 and 1)\n", " noise_sd=1.0,\n", " seed=42\n", ")\n", "\n", "print(f\"Dataset shape: {data.shape}\")\n", "data.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Examine the data structure\n", "print(\"Treatment and time distribution:\")\n", "print(data.groupby(['treated', 'post']).size().unstack(fill_value=0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Basic DiD Estimation\n", "\n", "The `DifferenceInDifferences` estimator provides an sklearn-like interface with a `fit()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the estimator\n", "did = DifferenceInDifferences()\n", "\n", "# Fit using column names\n", "results = did.fit(\n", " data,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\"\n", ")\n", "\n", "# Print the summary\n", "print(results.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding the Results\n", "\n", "The key results are:\n", "- **ATT (Average Treatment Effect on the Treated)**: The estimated causal effect of the treatment\n", "- **SE**: Standard error of the estimate\n", "- **t-stat**: T-statistic for testing H0: ATT = 0\n", "- **p-value**: Two-sided p-value\n", "- **95% CI**: Confidence interval for the ATT" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Access individual components\n", "print(f\"Estimated ATT: {results.att:.4f}\")\n", "print(f\"True ATT: 5.0\")\n", "print(f\"Standard Error: {results.se:.4f}\")\n", "print(f\"95% CI: [{results.conf_int[0]:.4f}, {results.conf_int[1]:.4f}]\")\n", "print(f\"P-value: {results.p_value:.4f}\")\n", "print(f\"Is significant at 5% level: {results.is_significant}\")\n", "print(f\"Significance stars: {results.significance_stars}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Using the Formula Interface\n", "\n", "For those familiar with R, `diff-diff` supports a formula interface similar to R's notation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using formula interface (R-style)\n", "did_formula = DifferenceInDifferences()\n", "results_formula = did_formula.fit(\n", " data,\n", " formula=\"outcome ~ treated * post\"\n", ")\n", "\n", "print(results_formula.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Verify both methods give the same result\n", "print(f\"Column-name ATT: {results.att:.6f}\")\n", "print(f\"Formula ATT: {results_formula.att:.6f}\")\n", "print(f\"Difference: {abs(results.att - results_formula.att):.2e}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Adding Covariates\n", "\n", "You can include additional control variables to improve precision and reduce bias from observed confounders." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add some covariates to our data\n", "np.random.seed(42)\n", "data['size'] = np.random.normal(100, 20, len(data))\n", "data['age'] = np.random.normal(10, 3, len(data))\n", "\n", "# Fit with covariates\n", "did_cov = DifferenceInDifferences()\n", "results_cov = did_cov.fit(\n", " data,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\",\n", " covariates=[\"size\", \"age\"]\n", ")\n", "\n", "print(results_cov.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# All coefficient estimates are available\n", "print(\"All coefficients:\")\n", "for name, coef in results_cov.coefficients.items():\n", " print(f\" {name}: {coef:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Fixed Effects\n", "\n", "Fixed effects control for time-invariant unobserved heterogeneity. `diff-diff` supports two approaches:\n", "\n", "1. **Dummy variables** (`fixed_effects`): Creates indicator variables for each level\n", "2. **Within-transformation** (`absorb`): Demeans data by group (more efficient for high-dimensional FE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate data with more structure\n", "np.random.seed(42)\n", "n_units = 50\n", "n_periods = 4\n", "\n", "panel_data = []\n", "for unit in range(n_units):\n", " is_treated = unit < n_units // 2\n", " state = unit % 5 # 5 states\n", " unit_effect = np.random.normal(0, 2)\n", " \n", " for period in range(n_periods):\n", " post = 1 if period >= 2 else 0\n", " y = 10.0 + unit_effect + period * 0.5 + state * 1.5\n", " if is_treated and post:\n", " y += 4.0 # True ATT = 4.0\n", " y += np.random.normal(0, 0.5)\n", " \n", " panel_data.append({\n", " 'unit': unit,\n", " 'state': f'state_{state}',\n", " 'period': period,\n", " 'treated': int(is_treated),\n", " 'post': post,\n", " 'outcome': y\n", " })\n", "\n", "panel_df = pd.DataFrame(panel_data)\n", "print(f\"Panel data: {panel_df.shape[0]} observations\")\n", "panel_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using fixed effects with dummy variables\n", "did_fe = DifferenceInDifferences()\n", "results_fe = did_fe.fit(\n", " panel_df,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\",\n", " fixed_effects=[\"state\"]\n", ")\n", "\n", "print(results_fe.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using absorbed fixed effects (within-transformation)\n", "# This is more efficient for high-dimensional fixed effects\n", "did_absorb = DifferenceInDifferences()\n", "results_absorb = did_absorb.fit(\n", " panel_df,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\",\n", " absorb=[\"unit\"] # Absorb unit fixed effects\n", ")\n", "\n", "print(results_absorb.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Two-Way Fixed Effects (TWFE)\n", "\n", "For panel data, the `TwoWayFixedEffects` estimator automatically includes both unit and time fixed effects using within-transformation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Two-Way Fixed Effects estimator\n", "twfe = TwoWayFixedEffects()\n", "results_twfe = twfe.fit(\n", " panel_df,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"period\", # Use actual time periods\n", " unit=\"unit\"\n", ")\n", "\n", "print(results_twfe.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Robust Inference\n", "\n", "### Cluster-Robust Standard Errors\n", "\n", "When observations are correlated within clusters (e.g., units over time), use cluster-robust standard errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create clustered data\n", "np.random.seed(42)\n", "n_clusters = 20\n", "obs_per_cluster = 10\n", "\n", "clustered_data = []\n", "for cluster in range(n_clusters):\n", " is_treated = cluster < n_clusters // 2\n", " cluster_effect = np.random.normal(0, 2)\n", " \n", " for obs in range(obs_per_cluster):\n", " for period in [0, 1]:\n", " y = 10.0 + cluster_effect\n", " if period == 1:\n", " y += 3.0\n", " if is_treated and period == 1:\n", " y += 2.5 # True ATT = 2.5\n", " y += np.random.normal(0, 0.5)\n", " \n", " clustered_data.append({\n", " 'cluster': cluster,\n", " 'obs': obs,\n", " 'period': period,\n", " 'treated': int(is_treated),\n", " 'post': period,\n", " 'outcome': y\n", " })\n", "\n", "clustered_df = pd.DataFrame(clustered_data)\n", "print(f\"Clustered data: {clustered_df.shape[0]} observations in {n_clusters} clusters\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compare standard errors: robust vs cluster-robust\n", "did_robust = DifferenceInDifferences(robust=True)\n", "did_cluster = DifferenceInDifferences(cluster=\"cluster\")\n", "\n", "results_robust = did_robust.fit(\n", " clustered_df,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\"\n", ")\n", "\n", "results_cluster = did_cluster.fit(\n", " clustered_df,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\"\n", ")\n", "\n", "print(f\"ATT (both methods): {results_robust.att:.4f}\")\n", "print(f\"Robust SE (HC1): {results_robust.se:.4f}\")\n", "print(f\"Cluster-robust SE: {results_cluster.se:.4f}\")\n", "print(f\"\\nCluster-robust SE is {results_cluster.se / results_robust.se:.2f}x larger\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wild Cluster Bootstrap\n", "\n", "For better inference with few clusters (<50), use the wild cluster bootstrap." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Wild cluster bootstrap inference\n", "did_bootstrap = DifferenceInDifferences(\n", " cluster=\"cluster\",\n", " inference=\"wild_bootstrap\",\n", " n_bootstrap=999,\n", " bootstrap_weights=\"rademacher\",\n", " seed=42\n", ")\n", "\n", "results_bootstrap = did_bootstrap.fit(\n", " clustered_df,\n", " outcome=\"outcome\",\n", " treatment=\"treated\",\n", " time=\"post\"\n", ")\n", "\n", "print(results_bootstrap.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compare inference methods\n", "print(\"Comparison of inference methods:\")\n", "print(f\"{'Method':<25} {'SE':>10} {'p-value':>10} {'95% CI':>25}\")\n", "print(\"-\" * 70)\n", "print(f\"{'Cluster-robust (analytical)':<25} {results_cluster.se:>10.4f} {results_cluster.p_value:>10.4f} [{results_cluster.conf_int[0]:>8.4f}, {results_cluster.conf_int[1]:>8.4f}]\")\n", "print(f\"{'Wild cluster bootstrap':<25} {results_bootstrap.se:>10.4f} {results_bootstrap.p_value:>10.4f} [{results_bootstrap.conf_int[0]:>8.4f}, {results_bootstrap.conf_int[1]:>8.4f}]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Exporting Results\n", "\n", "Results can be exported to various formats for reporting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export to dictionary\n", "result_dict = results.to_dict()\n", "print(\"As dictionary:\")\n", "for key, value in result_dict.items():\n", " if isinstance(value, float):\n", " print(f\" {key}: {value:.4f}\")\n", " else:\n", " print(f\" {key}: {value}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export to DataFrame (useful for combining multiple estimates)\n", "result_df = results.to_dataframe()\n", "print(\"\\nAs DataFrame:\")\n", "result_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this notebook, we covered:\n", "\n", "- **Basic DiD estimation** with both column-name and formula interfaces\n", "- **Adding covariates** to control for observed confounders\n", "- **Fixed effects** using dummy variables or within-transformation\n", "- **Two-Way Fixed Effects** for panel data\n", "- **Cluster-robust standard errors** for correlated observations\n", "- **Wild cluster bootstrap** for robust inference with few clusters\n", "\n", "For more advanced topics, see the other example notebooks:\n", "- `02_staggered_did.ipynb` - Staggered adoption with Callaway-Sant'Anna\n", "- `03_synthetic_did.ipynb` - Synthetic Difference-in-Differences\n", "- `04_parallel_trends.ipynb` - Testing and diagnostics" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 4 }