{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Imputation DiD (Borusyak, Jaravel & Spiess 2024)\n",
    "\n",
    "This tutorial demonstrates the `ImputationDiD` estimator, which implements the efficient imputation approach from Borusyak, Jaravel & Spiess (2024), \"Revisiting Event-Study Designs: Robust and Efficient Estimation\", *Review of Economic Studies*.\n",
    "\n",
    "**When to use ImputationDiD:**\n",
    "- Staggered adoption settings where treatment effects may be **homogeneous** across cohorts and time — produces ~50% shorter CIs than Callaway-Sant'Anna\n",
    "- When you want to use **all untreated observations** (never-treated + not-yet-treated) for maximum efficiency\n",
    "- As a complement to Callaway-Sant'Anna or Sun-Abraham: if all three agree, results are robust; if they disagree, investigate heterogeneity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from diff_diff import (\n",
    "    ImputationDiD, CallawaySantAnna, SunAbraham,\n",
    "    generate_staggered_data, plot_event_study\n",
    ")\n",
    "\n",
    "# For nicer plots (optional)\n",
    "try:\n",
    "    import matplotlib.pyplot as plt\n",
    "    plt.style.use('seaborn-v0_8-whitegrid')\n",
    "    HAS_MATPLOTLIB = True\n",
    "except ImportError:\n",
    "    HAS_MATPLOTLIB = False\n",
    "    print(\"matplotlib not installed - visualization examples will be skipped\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "The imputation estimator follows a simple three-step process:\n",
    "1. Estimate unit and time fixed effects using only untreated observations\n",
    "2. Impute counterfactual Y(0) for treated observations\n",
    "3. Aggregate imputed treatment effects with researcher-chosen weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate staggered adoption data with known treatment effect\n",
    "data = generate_staggered_data(n_units=300, n_periods=10, treatment_effect=2.0, seed=42)\n",
    "\n",
    "# Fit the imputation estimator\n",
    "est = ImputationDiD()\n",
    "results = est.fit(data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "results.print_summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Event Study with Pre-Trend Diagnostics\n\nEvent study aggregation estimates treatment effects at each relative time horizon. Setting `pretrends=True` adds **pre-period coefficients** (negative horizons) to the event study, enabling a diagnostic check of the parallel trends assumption.\n\nUnder parallel trends, pre-period coefficients should cluster around zero — indicating no differential trends before treatment. The reference period (h = -1) is normalized to zero by construction."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Fit with event study aggregation and pre-period coefficients\nest = ImputationDiD(pretrends=True)\nresults_es = est.fit(data, outcome='outcome', unit='unit', time='period',\n                     first_treat='first_treat', aggregate='event_study')\n\n# Plot event study — pre-period region is automatically shaded\nif HAS_MATPLOTLIB:\n    plot_event_study(results_es, title='Imputation DiD Event Study (with Pre-Trends)')\nelse:\n    print(\"Install matplotlib to see visualizations: pip install matplotlib\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View event study effects as a table\n",
    "results_es.to_dataframe(level='event_study')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Formal Pre-Trend Test\n\nThe event study plot above gives a **visual** diagnostic — do pre-period coefficients look close to zero? For a **statistical** check, `pretrend_test()` runs a Wald F-test on whether all pre-treatment leads are jointly zero (Equation 9 in the paper). This complements the plot: the eye spots patterns, the F-test quantifies evidence consistent with parallel trends.\n\nNote: `pretrend_test()` does not require `pretrends=True` — it runs its own internal lead regression on untreated observations, independent of the treatment effect estimator (Proposition 9). This avoids the pre-testing problem identified by Roth (2022)."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run pre-trend test\n",
    "pt = results.pretrend_test(n_leads=3)\n",
    "print(f\"F-statistic: {pt['f_stat']:.3f}\")\n",
    "print(f\"P-value:     {pt['p_value']:.4f}\")\n",
    "print(f\"Leads tested: {pt['n_leads']}\")\n",
    "print(f\"\\nConclusion: {'Fail to reject' if pt['p_value'] > 0.05 else 'Reject'} parallel trends at 5% level\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison with Other Estimators\n",
    "\n",
    "Under homogeneous treatment effects, ImputationDiD, Callaway-Sant'Anna, and Sun-Abraham should produce similar point estimates. The key difference is efficiency — ImputationDiD produces shorter confidence intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit all three estimators on the same data\n",
    "imp = ImputationDiD().fit(data, outcome='outcome', unit='unit',\n",
    "                          time='period', first_treat='first_treat')\n",
    "cs = CallawaySantAnna().fit(data, outcome='outcome', unit='unit',\n",
    "                            time='period', first_treat='first_treat')\n",
    "sa = SunAbraham().fit(data, outcome='outcome', unit='unit',\n",
    "                      time='period', first_treat='first_treat')\n",
    "\n",
    "print(\"Estimator Comparison (True effect = 2.0)\")\n",
    "print(\"=\" * 55)\n",
    "print(f\"{'Estimator':<25} {'ATT':>8} {'SE':>8} {'CI Width':>10}\")\n",
    "print(\"-\" * 55)\n",
    "\n",
    "for name, r in [(\"ImputationDiD\", imp), (\"CallawaySantAnna\", cs), (\"SunAbraham\", sa)]:\n",
    "    ci_width = r.overall_conf_int[1] - r.overall_conf_int[0]\n",
    "    print(f\"{name:<25} {r.overall_att:>8.3f} {r.overall_se:>8.3f} {ci_width:>10.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group Aggregation\n",
    "\n",
    "Group aggregation estimates average treatment effects by treatment cohort (groups defined by first treatment period)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit with group aggregation\n",
    "results_grp = ImputationDiD().fit(data, outcome='outcome', unit='unit',\n",
    "                                   time='period', first_treat='first_treat',\n",
    "                                   aggregate='group')\n",
    "results_grp.to_dataframe(level='group')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Features\n",
    "\n",
    "### Anticipation\n",
    "\n",
    "If treatment effects begin before the official treatment date, use the `anticipation` parameter to account for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Account for 1 period of anticipation\n",
    "est_antic = ImputationDiD(anticipation=1)\n",
    "results_antic = est_antic.fit(data, outcome='outcome', unit='unit',\n",
    "                               time='period', first_treat='first_treat')\n",
    "print(f\"ATT (no anticipation):    {results.overall_att:.3f}\")\n",
    "print(f\"ATT (1-period anticipation): {results_antic.overall_att:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Auxiliary Model Partition\n",
    "\n",
    "The `aux_partition` parameter controls the auxiliary model partition for the conservative variance estimator (Theorem 3). Finer partitions give tighter SEs but may overfit with few observations per group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare different partition choices\n",
    "for partition in ['cohort_horizon', 'cohort', 'horizon']:\n",
    "    r = ImputationDiD(aux_partition=partition).fit(\n",
    "        data, outcome='outcome', unit='unit',\n",
    "        time='period', first_treat='first_treat')\n",
    "    print(f\"aux_partition='{partition}': ATT={r.overall_att:.3f}, SE={r.overall_se:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Feature | ImputationDiD | CallawaySantAnna | SunAbraham |\n",
    "|---------|--------------|------------------|------------|\n",
    "| **Approach** | Impute Y(0) via FE model | Group-time ATT(g,t) | Saturated regression |\n",
    "| **Efficiency** | Most efficient under homogeneity | Less efficient | Least efficient |\n",
    "| **Robustness** | Requires homogeneity for efficiency | Fully robust to heterogeneity | Robust to heterogeneity |\n",
    "| **Control group** | All untreated (always) | Never-treated or not-yet-treated | Never-treated |\n",
    "| **Best for** | Homogeneous effects, maximum power | Heterogeneous effects, flexible | Robustness check |\n",
    "\n",
    "**Reference:** Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting Event-Study Designs: Robust and Efficient Estimation. *Review of Economic Studies*, 91(6), 3253-3285."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}