{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Difference-in-Differences with diff-diff\n",
    "\n",
    "This notebook demonstrates how to use the `diff-diff` library for basic 2x2 Difference-in-Differences (DiD) analysis. We'll cover:\n",
    "\n",
    "1. Setting up a basic DiD estimation\n",
    "2. Using both column-name and formula interfaces\n",
    "3. Interpreting results\n",
    "4. Adding covariates\n",
    "5. Using fixed effects\n",
    "6. Cluster-robust and wild bootstrap inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from diff_diff import DifferenceInDifferences, TwoWayFixedEffects\n",
    "from diff_diff.prep import generate_did_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Generate Sample Data\n",
    "\n",
    "The `generate_did_data` function creates synthetic panel data with a known treatment effect, which is useful for learning and testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate synthetic DiD data with known ATT of 5.0\n",
    "data = generate_did_data(\n",
    "    n_units=100,\n",
    "    n_periods=2,\n",
    "    treatment_effect=5.0,\n",
    "    treatment_fraction=0.5,\n",
    "    treatment_period=1,  # Period 1 is post-treatment (periods are 0 and 1)\n",
    "    noise_sd=1.0,\n",
    "    seed=42\n",
    ")\n",
    "\n",
    "print(f\"Dataset shape: {data.shape}\")\n",
    "data.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Examine the data structure\n",
    "print(\"Treatment and time distribution:\")\n",
    "print(data.groupby(['treated', 'post']).size().unstack(fill_value=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Basic DiD Estimation\n",
    "\n",
    "The `DifferenceInDifferences` estimator provides an sklearn-like interface with a `fit()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the estimator\n",
    "did = DifferenceInDifferences()\n",
    "\n",
    "# Fit using column names\n",
    "results = did.fit(\n",
    "    data,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\"\n",
    ")\n",
    "\n",
    "# Print the summary\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Understanding the Results\n",
    "\n",
    "The key results are:\n",
    "- **ATT (Average Treatment Effect on the Treated)**: The estimated causal effect of the treatment\n",
    "- **SE**: Standard error of the estimate\n",
    "- **t-stat**: T-statistic for testing H0: ATT = 0\n",
    "- **p-value**: Two-sided p-value\n",
    "- **95% CI**: Confidence interval for the ATT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access individual components\n",
    "print(f\"Estimated ATT: {results.att:.4f}\")\n",
    "print(f\"True ATT: 5.0\")\n",
    "print(f\"Standard Error: {results.se:.4f}\")\n",
    "print(f\"95% CI: [{results.conf_int[0]:.4f}, {results.conf_int[1]:.4f}]\")\n",
    "print(f\"P-value: {results.p_value:.4f}\")\n",
    "print(f\"Is significant at 5% level: {results.is_significant}\")\n",
    "print(f\"Significance stars: {results.significance_stars}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Using the Formula Interface\n",
    "\n",
    "For those familiar with R, `diff-diff` supports a formula interface similar to R's notation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using formula interface (R-style)\n",
    "did_formula = DifferenceInDifferences()\n",
    "results_formula = did_formula.fit(\n",
    "    data,\n",
    "    formula=\"outcome ~ treated * post\"\n",
    ")\n",
    "\n",
    "print(results_formula.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify both methods give the same result\n",
    "print(f\"Column-name ATT: {results.att:.6f}\")\n",
    "print(f\"Formula ATT: {results_formula.att:.6f}\")\n",
    "print(f\"Difference: {abs(results.att - results_formula.att):.2e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Adding Covariates\n",
    "\n",
    "You can include additional control variables to improve precision and reduce bias from observed confounders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add some covariates to our data\n",
    "np.random.seed(42)\n",
    "data['size'] = np.random.normal(100, 20, len(data))\n",
    "data['age'] = np.random.normal(10, 3, len(data))\n",
    "\n",
    "# Fit with covariates\n",
    "did_cov = DifferenceInDifferences()\n",
    "results_cov = did_cov.fit(\n",
    "    data,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\",\n",
    "    covariates=[\"size\", \"age\"]\n",
    ")\n",
    "\n",
    "print(results_cov.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# All coefficient estimates are available\n",
    "print(\"All coefficients:\")\n",
    "for name, coef in results_cov.coefficients.items():\n",
    "    print(f\"  {name}: {coef:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Fixed Effects\n",
    "\n",
    "Fixed effects control for time-invariant unobserved heterogeneity. `diff-diff` supports two approaches:\n",
    "\n",
    "1. **Dummy variables** (`fixed_effects`): Creates indicator variables for each level\n",
    "2. **Within-transformation** (`absorb`): Demeans data by group (more efficient for high-dimensional FE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate data with more structure\n",
    "np.random.seed(42)\n",
    "n_units = 50\n",
    "n_periods = 4\n",
    "\n",
    "panel_data = []\n",
    "for unit in range(n_units):\n",
    "    is_treated = unit < n_units // 2\n",
    "    state = unit % 5  # 5 states\n",
    "    unit_effect = np.random.normal(0, 2)\n",
    "    \n",
    "    for period in range(n_periods):\n",
    "        post = 1 if period >= 2 else 0\n",
    "        y = 10.0 + unit_effect + period * 0.5 + state * 1.5\n",
    "        if is_treated and post:\n",
    "            y += 4.0  # True ATT = 4.0\n",
    "        y += np.random.normal(0, 0.5)\n",
    "        \n",
    "        panel_data.append({\n",
    "            'unit': unit,\n",
    "            'state': f'state_{state}',\n",
    "            'period': period,\n",
    "            'treated': int(is_treated),\n",
    "            'post': post,\n",
    "            'outcome': y\n",
    "        })\n",
    "\n",
    "panel_df = pd.DataFrame(panel_data)\n",
    "print(f\"Panel data: {panel_df.shape[0]} observations\")\n",
    "panel_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using fixed effects with dummy variables\n",
    "did_fe = DifferenceInDifferences()\n",
    "results_fe = did_fe.fit(\n",
    "    panel_df,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\",\n",
    "    fixed_effects=[\"state\"]\n",
    ")\n",
    "\n",
    "print(results_fe.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using absorbed fixed effects (within-transformation)\n",
    "# This is more efficient for high-dimensional fixed effects\n",
    "did_absorb = DifferenceInDifferences()\n",
    "results_absorb = did_absorb.fit(\n",
    "    panel_df,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\",\n",
    "    absorb=[\"unit\"]  # Absorb unit fixed effects\n",
    ")\n",
    "\n",
    "print(results_absorb.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Two-Way Fixed Effects (TWFE)\n",
    "\n",
    "For panel data, the `TwoWayFixedEffects` estimator automatically includes both unit and time fixed effects using within-transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Two-Way Fixed Effects estimator\n",
    "twfe = TwoWayFixedEffects()\n",
    "results_twfe = twfe.fit(\n",
    "    panel_df,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"period\",  # Use actual time periods\n",
    "    unit=\"unit\"\n",
    ")\n",
    "\n",
    "print(results_twfe.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Robust Inference\n",
    "\n",
    "### Cluster-Robust Standard Errors\n",
    "\n",
    "When observations are correlated within clusters (e.g., units over time), use cluster-robust standard errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create clustered data\n",
    "np.random.seed(42)\n",
    "n_clusters = 20\n",
    "obs_per_cluster = 10\n",
    "\n",
    "clustered_data = []\n",
    "for cluster in range(n_clusters):\n",
    "    is_treated = cluster < n_clusters // 2\n",
    "    cluster_effect = np.random.normal(0, 2)\n",
    "    \n",
    "    for obs in range(obs_per_cluster):\n",
    "        for period in [0, 1]:\n",
    "            y = 10.0 + cluster_effect\n",
    "            if period == 1:\n",
    "                y += 3.0\n",
    "            if is_treated and period == 1:\n",
    "                y += 2.5  # True ATT = 2.5\n",
    "            y += np.random.normal(0, 0.5)\n",
    "            \n",
    "            clustered_data.append({\n",
    "                'cluster': cluster,\n",
    "                'obs': obs,\n",
    "                'period': period,\n",
    "                'treated': int(is_treated),\n",
    "                'post': period,\n",
    "                'outcome': y\n",
    "            })\n",
    "\n",
    "clustered_df = pd.DataFrame(clustered_data)\n",
    "print(f\"Clustered data: {clustered_df.shape[0]} observations in {n_clusters} clusters\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare standard errors: robust vs cluster-robust\n",
    "did_robust = DifferenceInDifferences(robust=True)\n",
    "did_cluster = DifferenceInDifferences(cluster=\"cluster\")\n",
    "\n",
    "results_robust = did_robust.fit(\n",
    "    clustered_df,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\"\n",
    ")\n",
    "\n",
    "results_cluster = did_cluster.fit(\n",
    "    clustered_df,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\"\n",
    ")\n",
    "\n",
    "print(f\"ATT (both methods): {results_robust.att:.4f}\")\n",
    "print(f\"Robust SE (HC1): {results_robust.se:.4f}\")\n",
    "print(f\"Cluster-robust SE: {results_cluster.se:.4f}\")\n",
    "print(f\"\\nCluster-robust SE is {results_cluster.se / results_robust.se:.2f}x larger\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wild Cluster Bootstrap\n",
    "\n",
    "For better inference with few clusters (<50), use the wild cluster bootstrap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wild cluster bootstrap inference\n",
    "did_bootstrap = DifferenceInDifferences(\n",
    "    cluster=\"cluster\",\n",
    "    inference=\"wild_bootstrap\",\n",
    "    n_bootstrap=999,\n",
    "    bootstrap_weights=\"rademacher\",\n",
    "    seed=42\n",
    ")\n",
    "\n",
    "results_bootstrap = did_bootstrap.fit(\n",
    "    clustered_df,\n",
    "    outcome=\"outcome\",\n",
    "    treatment=\"treated\",\n",
    "    time=\"post\"\n",
    ")\n",
    "\n",
    "print(results_bootstrap.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare inference methods\n",
    "print(\"Comparison of inference methods:\")\n",
    "print(f\"{'Method':<25} {'SE':>10} {'p-value':>10} {'95% CI':>25}\")\n",
    "print(\"-\" * 70)\n",
    "print(f\"{'Cluster-robust (analytical)':<25} {results_cluster.se:>10.4f} {results_cluster.p_value:>10.4f} [{results_cluster.conf_int[0]:>8.4f}, {results_cluster.conf_int[1]:>8.4f}]\")\n",
    "print(f\"{'Wild cluster bootstrap':<25} {results_bootstrap.se:>10.4f} {results_bootstrap.p_value:>10.4f} [{results_bootstrap.conf_int[0]:>8.4f}, {results_bootstrap.conf_int[1]:>8.4f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Exporting Results\n",
    "\n",
    "Results can be exported to various formats for reporting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export to dictionary\n",
    "result_dict = results.to_dict()\n",
    "print(\"As dictionary:\")\n",
    "for key, value in result_dict.items():\n",
    "    if isinstance(value, float):\n",
    "        print(f\"  {key}: {value:.4f}\")\n",
    "    else:\n",
    "        print(f\"  {key}: {value}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export to DataFrame (useful for combining multiple estimates)\n",
    "result_df = results.to_dataframe()\n",
    "print(\"\\nAs DataFrame:\")\n",
    "result_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this notebook, we covered:\n",
    "\n",
    "- **Basic DiD estimation** with both column-name and formula interfaces\n",
    "- **Adding covariates** to control for observed confounders\n",
    "- **Fixed effects** using dummy variables or within-transformation\n",
    "- **Two-Way Fixed Effects** for panel data\n",
    "- **Cluster-robust standard errors** for correlated observations\n",
    "- **Wild cluster bootstrap** for robust inference with few clusters\n",
    "\n",
    "For more advanced topics, see the other example notebooks:\n",
    "- `02_staggered_did.ipynb` - Staggered adoption with Callaway-Sant'Anna\n",
    "- `03_synthetic_did.ipynb` - Synthetic Difference-in-Differences\n",
    "- `04_parallel_trends.ipynb` - Testing and diagnostics"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}