{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stacked DiD (Wing, Freedman & Hollingsworth 2024)\n",
    "\n",
    "This tutorial demonstrates the `StackedDiD` estimator, which implements the stacked difference-in-differences method from Wing, Freedman & Hollingsworth (2024), \"Stacked Difference-in-Differences\", NBER Working Paper 32054.\n",
    "\n",
    "**When to use StackedDiD:**\n",
    "- Staggered adoption where you want a **regression-based event-study framework** — ideal for practitioners who think in OLS terms\n",
    "- When you want to **inspect the clean comparison dataset** directly — the stacked data is a first-class output\n",
    "- As a **robustness check** alongside Callaway-Sant'Anna or Imputation DiD\n",
    "\n",
    "**Topics covered:**\n",
    "1. Basic usage and overall ATT\n",
    "2. Event study estimation and visualization\n",
    "3. Inside the stacked dataset — sub-experiments, event times, and Q-weights\n",
    "4. Event window and trimming (IC1/IC2)\n",
    "5. Q-weight schemes (aggregate, population, sample share)\n",
    "6. Clean control definitions (not-yet-treated, strict, never-treated)\n",
    "7. Comparison with Callaway-Sant'Anna and Imputation DiD\n",
    "8. Advanced features: anticipation and clustering\n",
    "\n",
    "*See also: [Tutorial 02](02_staggered_did.ipynb) for Callaway-Sant'Anna and Sun-Abraham, [Tutorial 11](11_imputation_did.ipynb) for Imputation DiD, [Tutorial 12](12_two_stage_did.ipynb) for Two-Stage DiD.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from diff_diff import (\n",
    "    StackedDiD, CallawaySantAnna, ImputationDiD,\n",
    "    generate_staggered_data, plot_event_study\n",
    ")\n",
    "\n",
    "# For nicer plots (optional)\n",
    "try:\n",
    "    import matplotlib.pyplot as plt\n",
    "    plt.style.use('seaborn-v0_8-whitegrid')\n",
    "    HAS_MATPLOTLIB = True\n",
    "except ImportError:\n",
    "    HAS_MATPLOTLIB = False\n",
    "    print(\"matplotlib not installed - visualization examples will be skipped\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "The stacked DiD estimator follows a four-step process:\n",
    "1. **Partition** the data into sub-experiments — one per adoption cohort, each with its own treated units and clean controls\n",
    "2. **Restrict** each sub-experiment to an event window defined by `kappa_pre` and `kappa_post` (which can differ)\n",
    "3. **Compute Q-weights** that correct for compositional imbalance across sub-experiments\n",
    "4. **Run a pooled WLS regression** on the weighted stacked dataset\n",
    "\n",
    "The main modeling choice is the event window size (`kappa_pre`, `kappa_post`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate staggered adoption data with known treatment effect\n",
    "data = generate_staggered_data(n_units=300, n_periods=10, treatment_effect=2.0, seed=42)\n",
    "\n",
    "# Fit the stacked DiD estimator\n",
    "est = StackedDiD(kappa_pre=2, kappa_post=2)\n",
    "results = est.fit(data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "results.print_summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Event Study\n",
    "\n",
    "Event study estimates effects at each relative time horizon. The reference period is `e = -1` (last pre-treatment period). Pre-treatment coefficients assess parallel trends; post-treatment coefficients capture dynamic effects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit with event study aggregation\n",
    "est = StackedDiD(kappa_pre=2, kappa_post=2)\n",
    "results_es = est.fit(data, outcome='outcome', unit='unit', time='period',\n",
    "                     first_treat='first_treat', aggregate='event_study')\n",
    "\n",
    "# Plot event study\n",
    "if HAS_MATPLOTLIB:\n",
    "    plot_event_study(results_es, title='Stacked DiD Event Study')\n",
    "else:\n",
    "    print(\"Install matplotlib to see visualizations: pip install matplotlib\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View event study effects as a table\n",
    "results_es.to_dataframe(level='event_study')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inside the Stacked Dataset\n",
    "\n",
    "A key feature of `StackedDiD` is that the stacked dataset is a **first-class output**. You can inspect `results.stacked_data` to see exactly how the estimator constructs its comparisons.\n",
    "\n",
    "The stacked data contains four added columns:\n",
    "- `_sub_exp`: Which adoption cohort defines this sub-experiment\n",
    "- `_event_time`: Relative time to treatment (e.g., -2, -1, 0, 1, 2)\n",
    "- `_D_sa`: Treatment indicator (1 = treated unit, 0 = clean control)\n",
    "- `_Q_weight`: Corrective weight for compositional balance\n",
    "\n",
    "Each sub-experiment is a \"mini DiD\" with its own treated cohort and a set of clean controls. The same control unit can appear in multiple sub-experiments. Q-weights correct for the fact that naive stacking implicitly overweights cohorts with more controls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect stacked data structure\n",
    "sd = results.stacked_data\n",
    "print(f\"Original data shape:  {data.shape}\")\n",
    "print(f\"Stacked data shape:   {sd.shape}\")\n",
    "print(f\"Row expansion factor: {len(sd) / len(data):.1f}x\")\n",
    "print(f\"Number of sub-experiments: {results.n_sub_experiments}\")\n",
    "print(f\"Added columns: {[c for c in sd.columns if c.startswith('_')]}\")\n",
    "print()\n",
    "\n",
    "# Rows per sub-experiment\n",
    "print(\"Rows per sub-experiment:\")\n",
    "print(sd.groupby('_sub_exp').size().to_string())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q-weight summary by sub-experiment\n",
    "print(\"Q-Weight Summary by Sub-Experiment\")\n",
    "print(\"=\" * 56)\n",
    "print(f\"{'Sub-Exp':>8} {'Treated':>10} {'Controls':>10} {'Avg Q (ctrl)':>14}\")\n",
    "print(\"-\" * 56)\n",
    "\n",
    "for sub_exp in sorted(sd['_sub_exp'].unique()):\n",
    "    sub = sd[sd['_sub_exp'] == sub_exp]\n",
    "    treated = sub[sub['_D_sa'] == 1]\n",
    "    controls = sub[sub['_D_sa'] == 0]\n",
    "    n_treated = treated['unit'].nunique()\n",
    "    n_controls = controls['unit'].nunique()\n",
    "    avg_q = controls['_Q_weight'].mean() if len(controls) > 0 else 0.0\n",
    "    print(f\"{int(sub_exp):>8} {n_treated:>10} {n_controls:>10} {avg_q:>14.3f}\")\n",
    "\n",
    "print()\n",
    "print(\"Note: Treated units always have Q = 1. Controls get adjusted weights.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Event Window and Trimming\n",
    "\n",
    "The `kappa_pre` and `kappa_post` parameters define the event window (they can differ for asymmetric windows). Not all cohorts can be included at every window size:\n",
    "\n",
    "- **IC1 (Window fits in panel)**: The event window `[a - kappa_pre, a + kappa_post]` must fall within the panel's time range\n",
    "- **IC2 (Clean controls exist)**: At least one clean control unit must exist for the sub-experiment\n",
    "\n",
    "Cohorts that fail either condition are **trimmed**. When this happens, the estimator emits a `UserWarning` telling you which cohorts were dropped and why. You should expect to see these warnings in the next cell as we increase the window size — they are informative, not errors.\n",
    "\n",
    "**Tradeoff**: A wider window gives more pre/post periods for trend assessment and dynamic effects, but trims more cohorts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Trimming with different kappa values\n",
    "print(\"Effect of Event Window Size on Trimming\")\n",
    "print(\"=\" * 65)\n",
    "print(f\"{'Window':>10} {'Included':>12} {'Trimmed':>12} {'ATT':>10} {'SE':>10}\")\n",
    "print(\"-\" * 65)\n",
    "\n",
    "for kp, kq in [(1, 1), (2, 2), (3, 3), (4, 4)]:\n",
    "    try:\n",
    "        r = StackedDiD(kappa_pre=kp, kappa_post=kq).fit(\n",
    "            data, outcome='outcome', unit='unit', time='period', first_treat='first_treat'\n",
    "        )\n",
    "        window = f\"[{-kp}, {kq}]\"\n",
    "        incl = str(r.groups)\n",
    "        trim = str(r.trimmed_groups) if r.trimmed_groups else \"[]\"\n",
    "        print(f\"{window:>10} {incl:>12} {trim:>12} {r.overall_att:>10.3f} {r.overall_se:>10.3f}\")\n",
    "    except ValueError as e:\n",
    "        window = f\"[{-kp}, {kq}]\"\n",
    "        print(f\"{window:>10}   All cohorts trimmed - window too wide\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reading the trimming warnings.** At `kappa=3`, you should see a warning like:\n",
    "\n",
    "> *Trimmed 1 adoption event(s) that don't satisfy inclusion criteria: [7.0]. IC1 requires event window [-3, 3] to fit within data range [0, 9]. IC2 requires clean controls to exist.*\n",
    "\n",
    "This tells you cohort 7 was dropped because its event window `[7-3, 7+3] = [4, 10]` extends past the last observed period (9). At `kappa=4`, cohort 3 is also trimmed — its window `[3-4, 3+4] = [-1, 7]` starts before the first observed period (0).\n",
    "\n",
    "**What to do when you see these warnings:**\n",
    "\n",
    "1. **Check which cohorts were lost.** Inspect `results.trimmed_groups` — if the trimmed cohorts are central to your research question, the wider window may not be appropriate.\n",
    "2. **Assess the bias-variance tradeoff.** Wider windows give you more pre-treatment periods to assess parallel trends and more post-treatment periods to capture dynamic effects — but at the cost of dropping cohorts at the panel edges. In the table above, notice how the point estimate and SE change as cohorts are trimmed.\n",
    "3. **Use an asymmetric window.** You don't need `kappa_pre == kappa_post`. If you need 3 pre-treatment periods for trend assessment but cohort 7 is trimmed because the *post*-treatment side overflows the panel, you can shorten just `kappa_post`.\n",
    "\n",
    "Let's walk through option 3. The symmetric window `[-3, 3]` trimmed cohort 7 because `7 + 3 = 10` exceeds the last period (9). If we keep 3 pre-treatment periods but reduce `kappa_post` to 2, the window becomes `[7-3, 7+2] = [4, 9]` — which fits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Asymmetric window recovers cohort 7 — no warning this time\nr_asym = StackedDiD(kappa_pre=3, kappa_post=2).fit(\n    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\nprint(f\"Asymmetric [-3, 2]: groups={r_asym.groups}, trimmed={r_asym.trimmed_groups}\")\nprint(f\"\\nAll 3 cohorts included. ATT={r_asym.overall_att:.3f} (SE={r_asym.overall_se:.3f})\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Q-Weight Schemes\n",
    "\n",
    "Wing et al. (2024, Table 1) define three target estimands, each with a different Q-weight formula:\n",
    "\n",
    "- **`\"aggregate\"`** (default): Weight by treated cohort size (N_a^D / N_Ω^D) — the trimmed aggregate ATT. (For unbalanced panels, weights are computed at the observation level per (event_time, sub_exp), which reduces to cohort-size weighting when panels are balanced.)\n",
    "- **`\"population\"`**: Weight by population size of treated cohort (requires a `population` column)\n",
    "- **`\"sample_share\"`**: Weight by sample share of each sub-experiment\n",
    "\n",
    "The choice depends on whether you want cohorts weighted by their treated unit count (`aggregate`), by an external population measure (`population`), or by their share of the stacked sample (`sample_share`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add a unit-level population column representing cohort size\n",
    "# (constant per unit, since Q-weight computation groups by [unit, sub_exp])\n",
    "pop_map = {3: 1000, 5: 2000, 7: 500}  # cohort-level population sizes\n",
    "data['population'] = data['first_treat'].map(pop_map).fillna(0).astype(int)\n",
    "\n",
    "# Compare Q-weight schemes\n",
    "print(\"Q-Weight Scheme Comparison\")\n",
    "print(\"=\" * 60)\n",
    "print(f\"{'Scheme':<16} {'ATT':>10} {'SE':>10} {'CI Width':>12}\")\n",
    "print(\"-\" * 60)\n",
    "\n",
    "for scheme in ['aggregate', 'population', 'sample_share']:\n",
    "    kwargs = {'kappa_pre': 2, 'kappa_post': 2, 'weighting': scheme}\n",
    "    fit_kwargs = dict(outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "    if scheme == 'population':\n",
    "        fit_kwargs['population'] = 'population'\n",
    "    r = StackedDiD(**kwargs).fit(data, **fit_kwargs)\n",
    "    ci_width = r.overall_conf_int[1] - r.overall_conf_int[0]\n",
    "    print(f\"{scheme:<16} {r.overall_att:>10.3f} {r.overall_se:>10.3f} {ci_width:>12.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean Control Definitions\n",
    "\n",
    "The `clean_control` parameter determines which units serve as controls in each sub-experiment:\n",
    "\n",
    "- **`\"not_yet_treated\"`** (default): Units adopted after `a + kappa_post` — most inclusive, maximizes statistical power\n",
    "- **`\"strict\"`**: Units adopted after `a + kappa_post + kappa_pre` — more conservative, excludes units treated during the window\n",
    "- **`\"never_treated\"`**: Only units with `first_treat = inf` — most restrictive, strongest identification\n",
    "\n",
    "More restrictive definitions yield fewer controls and wider standard errors, but provide stronger causal identification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare clean control definitions\n",
    "print(\"Clean Control Definition Comparison\")\n",
    "print(\"=\" * 70)\n",
    "print(f\"{'Definition':<18} {'ATT':>8} {'SE':>8} {'Ctrl Units':>12} {'Cohorts':>10}\")\n",
    "print(\"-\" * 70)\n",
    "\n",
    "for cc in ['not_yet_treated', 'strict', 'never_treated']:\n",
    "    r = StackedDiD(kappa_pre=2, kappa_post=2, clean_control=cc).fit(\n",
    "        data, outcome='outcome', unit='unit', time='period', first_treat='first_treat'\n",
    "    )\n",
    "    print(f\"{cc:<18} {r.overall_att:>8.3f} {r.overall_se:>8.3f} {r.n_control_units:>12} {len(r.groups):>10}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison with Other Estimators\n",
    "\n",
    "StackedDiD, CallawaySantAnna, and ImputationDiD all address TWFE bias in staggered settings, but via different approaches:\n",
    "- **StackedDiD**: Constructs sub-experiments, applies Q-weights, runs pooled WLS\n",
    "- **CallawaySantAnna**: Computes group-time ATT(g,t) effects, then aggregates\n",
    "- **ImputationDiD**: Imputes counterfactual Y(0) via fixed effect model\n",
    "\n",
    "Under homogeneous treatment effects, all three should produce similar point estimates. Disagreement flags potential treatment effect heterogeneity. Agreement across estimators strengthens causal claims."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit all three estimators on the same data\n",
    "sd_r = StackedDiD(kappa_pre=2, kappa_post=2).fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "cs_r = CallawaySantAnna().fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "imp_r = ImputationDiD().fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "\n",
    "print(\"Estimator Comparison (True effect = 2.0)\")\n",
    "print(\"=\" * 55)\n",
    "print(f\"{'Estimator':<25} {'ATT':>8} {'SE':>8} {'CI Width':>10}\")\n",
    "print(\"-\" * 55)\n",
    "\n",
    "for name, r in [(\"StackedDiD\", sd_r), (\"CallawaySantAnna\", cs_r), (\"ImputationDiD\", imp_r)]:\n",
    "    ci_width = r.overall_conf_int[1] - r.overall_conf_int[0]\n",
    "    print(f\"{name:<25} {r.overall_att:>8.3f} {r.overall_se:>8.3f} {ci_width:>10.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Features\n",
    "\n",
    "### Anticipation\n",
    "\n",
    "If treatment effects begin before the official treatment date (e.g., firms change behavior in anticipation of a policy), use the `anticipation` parameter. Setting `anticipation=k` shifts the reference period from `e = -1` to `e = -1 - k`, classifying periods `e >= -k` as post-treatment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare ATT with and without anticipation\n",
    "est_no_antic = StackedDiD(kappa_pre=2, kappa_post=2)\n",
    "results_no_antic = est_no_antic.fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "\n",
    "est_antic = StackedDiD(kappa_pre=2, kappa_post=2, anticipation=1)\n",
    "results_antic = est_antic.fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "\n",
    "print(f\"ATT (no anticipation):       {results_no_antic.overall_att:.3f}\")\n",
    "print(f\"ATT (1-period anticipation): {results_antic.overall_att:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clustering\n",
    "\n",
    "Standard errors can be clustered at two levels:\n",
    "- **`cluster='unit'`** (default): Conservative — accounts for the fact that the same unit appears across multiple sub-experiments\n",
    "- **`cluster='unit_subexp'`**: Treats each sub-experiment appearance as independent — narrower SEs, but assumes independence across sub-experiments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clustering comparison\n",
    "r_unit = StackedDiD(kappa_pre=2, kappa_post=2, cluster='unit').fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "r_subexp = StackedDiD(kappa_pre=2, kappa_post=2, cluster='unit_subexp').fit(\n",
    "    data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
    "\n",
    "print(\"Clustering Comparison\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"{'Cluster Level':<20} {'ATT':>10} {'SE':>10}\")\n",
    "print(\"-\" * 50)\n",
    "print(f\"{'unit':<20} {r_unit.overall_att:>10.3f} {r_unit.overall_se:>10.3f}\")\n",
    "print(f\"{'unit_subexp':<20} {r_subexp.overall_att:>10.3f} {r_subexp.overall_se:>10.3f}\")\n",
    "print()\n",
    "print(\"Point estimates are identical; SEs differ due to clustering level.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Feature | StackedDiD | CallawaySantAnna | ImputationDiD |\n",
    "|---------|-----------|------------------|---------------|\n",
    "| **Approach** | Stack sub-experiments, pooled WLS | Group-time ATT(g,t) aggregation | Impute Y(0) via FE model |\n",
    "| **Framework** | Regression (event-study) | Nonparametric | Regression (imputation) |\n",
    "| **Event window** | Explicit (kappa_pre, kappa_post) | Implicit (all periods) | Implicit (all periods) |\n",
    "| **Group effects** | No (pooled regression) | Yes | Yes |\n",
    "| **Control group** | Configurable (3 options) | Never-treated or not-yet-treated | All untreated |\n",
    "| **Inspectable data** | Yes (stacked_data) | No | Yes (treatment_effects) |\n",
    "| **Best for** | Regression intuition, transparency | Heterogeneous effects | Maximum efficiency |\n",
    "\n",
    "**Reference:** Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked Difference-in-Differences. NBER Working Paper 32054."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}