class: center, middle, inverse, title-slide .title[ # Module 8: Beyond the A/B Test ] .subtitle[ ## Modern DiD, Synthetic Control, and Causal Forests ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } .tiny { font-size: 65%; } .highlight-box { background: #fff3e0; border-left: 4px solid #e65100; padding: 0.5em 1em; margin: 0.5em 0; } .blue-box { background: #e3f2fd; border-left: 4px solid #1565c0; padding: 0.5em 1em; margin: 0.5em 0; } .nav-btn { position: absolute; bottom: 12px; left: 40px; font-size: 11px; background: #e8eaf6; padding: 2px 8px; border-radius: 3px; z-index: 100; text-decoration: none; color: #1a237e; } .nav-btn:hover { background: #c5cae9; } .nav-btn-br { position: absolute; bottom: 12px; right: 70px; font-size: 11px; background: #e8eaf6; padding: 2px 8px; border-radius: 3px; z-index: 100; text-decoration: none; color: #1a237e; } .nav-btn-br:hover { background: #c5cae9; } .inline-btn { font-size: 11px; background: #e8eaf6; padding: 2px 8px; border-radius: 3px; text-decoration: none; color: #1a237e; margin-right: 6px; vertical-align: middle; } .inline-btn:hover { background: #c5cae9; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Experimental Ideal</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Designing Around Interference</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Power and Sample Size</td><td>upcoming</td></tr> <tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr> <tr><td>6</td><td>Multiple Testing & Subgroups</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">External Validity</a></td><td>✓ done</td></tr> <tr><td><b>8</b></td><td><b>Beyond the A/B Test</b> <i>(you are here)</i></td><td>current</td></tr> </table> --- # When Randomization Isn't Enough Three settings where the clean experiment from M1–M5 isn't available: .blue-box[ **1. Staggered rollouts** — Treatment turns on city-by-city over months. There's no clean "post" period that's the same for everyone. Naive two-way fixed effects (TWFE) is **biased** when effects are heterogeneous over time. **2. One treated unit** — A policy hits a single city. No control group to randomize against. *Synthetic control* builds a counterfactual from a weighted donor pool. **3. Heterogeneity that matters for policy** — Knowing the average effect isn't enough; you need to know *who responds*. *Causal forests* estimate treatment effects as a function of covariates. ] -- This module: the modern toolkit for each, with proofs and DGP code in backup slides. --- class: center, middle, inverse # Part 1 — Modern Difference-in-Differences --- name: classic-22-did # Classic 2×2 DiD: Two Groups, Two Periods One city gets the zone-notification feature at `\(t=2\)`; one doesn't. Driver weekly earnings before and after: <img src="slides_files/figure-html/did-22-data-1.png" style="display: block; margin: auto;" /> `$$\widehat{\text{ATT}} = (\bar y_{T,\text{post}} - \bar y_{T,\text{pre}}) - (\bar y_{C,\text{post}} - \bar y_{C,\text{pre}}) = 70 - 20 = 50$$` --- name: parallel-trends-main # Parallel Trends — The Identifying Assumption The DiD estimator equals the ATT *only if* the treated and control groups would have followed the same trend, on average, in the absence of treatment. .highlight-box[ **Parallel trends:** `\(\;\;E[Y_t(0) - Y_{t-1}(0) \mid \text{treated}] = E[Y_t(0) - Y_{t-1}(0) \mid \text{control}]\)` It's an assumption about *counterfactuals* — fundamentally untestable. We can only test pretrends as a proxy. ] -- **The pretrend test:** check whether the two groups moved in parallel *before* treatment. Failing this is bad news. **Passing it is weaker evidence than people think** — Roth (2022) shows pretrend tests have low power against the most-worrying violations. <a href="#pretrend-power-detail" class="nav-btn">Roth pretrend critique</a> --- name: staggered-main # Staggered Adoption — What Changes Real rollouts don't happen all at once. Cities adopt zone-notification at different times. <img src="slides_files/figure-html/staggered-dgp-1.png" style="display: block; margin: auto;" /> Three cohorts adopt at `\(t=8, 12, 16\)`; three never adopt. The effect grows over time and is **larger for earlier-adopting cohorts**. <a href="#staggered-dgp-detail" class="nav-btn-br">DGP code</a> --- name: twfe-main # The TWFE Estimator — Looks Innocent, Isn't The "default" DiD with staggered timing: `$$Y_{it} = \alpha_i + \lambda_t + \beta\, D_{it} + \varepsilon_{it}$$` ```r fit_twfe <- feols(y ~ treated | city + t, data = panel) coef(fit_twfe)["treatedTRUE"] ``` ``` ## treatedTRUE ## 0.822057 ``` -- The true effects are positive and grow over time. The TWFE coefficient is wrong (sometimes the *wrong sign*) when: - effects are heterogeneous *across cohorts*, or - effects grow/shrink *within a cohort over time*. -- **Goodman-Bacon (2021):** `\(\hat\beta_{\text{TWFE}}\)` is a weighted average of all possible 2×2 DiDs in the data. Some of those weights are **negative** — earlier-treated units act as "controls" for later-treated ones, which subtracts a treated trend. <a href="#goodman-bacon-derivation" class="nav-btn">decomposition</a> --- name: bacon-viz-main # Visualizing the Bacon Decomposition `\(\hat\beta_{\text{TWFE}}\)` is a **weighted average of every 2×2 DiD** in the data — each cohort vs every other cohort, plus each vs never-treated. With cohorts `\(g \in \{8, 12, 16, \infty\}\)`, that's **9 distinct 2×2 estimates**: <img src="slides_files/figure-html/bacon-viz-1.png" style="display: block; margin: auto;" /> The **red 2×2s** use an already-treated cohort as control — they difference out a treated trend, not a counterfactual. TWFE (dotted) averages them in anyway, landing far below truth (dashed). <a href="#goodman-bacon-derivation" class="inline-btn">why</a> --- name: modern-toolkit # The Modern Toolkit: One Slide All four heterogeneity-robust estimators solve the same problem — only use **clean** comparisons (treated vs not-yet-treated or never-treated): .small[ | Estimator | Year | Core idea | R package | |---|---|---|---| | **Callaway & Sant'Anna** | 2021 | Group-time ATT for each cohort g and time t, then aggregate | `did` | | **Sun & Abraham** | 2021 | Saturated event-study with cohort×event-time interactions | `fixest` (`sunab()`) | | **de Chaisemartin & D'Haultfœuille** | 2020 | Switchers vs not-yet-switchers | `DIDmultiplegt` | | **Borusyak, Jaravel, Spiess** | 2024 | Impute the untreated outcome from never-treated, then average residuals | `didimputation` | ] .highlight-box[ All four converge in simple cases. Pick one, report the event-study plot, and check robustness with another. **Callaway & Sant'Anna is the most-cited workhorse.** ] --- name: cs-main # Callaway-Sant'Anna in Practice .small[ **Estimand: group-time ATT** `\(\text{ATT}(g, t)\)` — effect on units treated at `\(g\)`, evaluated at `\(t\)`. Two indices because TWFE collapsed this 2D structure into one biased number. **Construction.** For each cohort `\(g\)` and post-period `\(t \geq g\)`, a clean 2×2 DiD vs the never-treated, anchored at `\(g-1\)`: `$$\text{ATT}(g, t) = \big[\bar Y_{g,\,t} - \bar Y_{g,\,g-1}\big] - \big[\bar Y_{\text{nev},\,t} - \bar Y_{\text{nev},\,g-1}\big]$$` *All four indices use `\(g-1\)` (cohort `\(g\)`'s last pre-treatment period), not `\(t-1\)`* — the **universal base period** version of CS. Same baseline for both groups → the formula is a direct parallel-trends test: "did the gap between cohort-$g$ and never-treated change from `\(g{-}1\)` to `\(t\)`?" A separate "varying base period" version with `\(t-1\)` exists but drifts the anchor as `\(t\)` moves. Plot against *event time* `\(e = t - g\)` to align cohorts at `\(e=0\)`. No negative weights — never-treated is uncontaminated. ] <img src="slides_files/figure-html/cs-handcoded-1.png" style="display: block; margin: auto;" /> .small[ Each cohort's ATT grows post-treatment — the dynamic heterogeneity TWFE smeared. Aggregate over `\(g\)` at each `\(e\)` for an event study, or over both for an overall ATT. <a href="#cs-code" class="inline-btn">code</a> ] --- name: honest-did-main # Honest DiD — Bounds Under Partial PT Violations Pretrend tests have low power. What if parallel trends is "almost but not quite" right? .blue-box[ **Rambachan & Roth (2023):** Don't pretend PT holds exactly. Posit a bound on how much the post-treatment differential trend can deviate from the pre-trend, then report a *robust confidence interval*. ] -- Two common restrictions: - **Smoothness, parameter M:** the post-treatment violation can be at most M times larger than the largest pre-treatment violation. - **Relative magnitudes:** the post-period violation cannot exceed the worst pre-period violation. The output is a *sensitivity plot*: the CI grows as you allow more violation. If your conclusion survives plausible M, you're robust. <a href="#honest-did-detail" class="nav-btn">visualization</a> --- name: staggered-application # Application: Staggered Zone-Notification Rollout Three city cohorts adopt zone-notification at `\(t=8, 12, 16\)`. The true effect is positive and **larger for earlier adopters** (these cities had more underserved zones). What does each estimator say? .small[ |Estimator | Estimate| |:-------------------------|--------:| |Truth | 1.425| |TWFE (biased) | 0.822| |Sun-Abraham | 1.405| |Callaway-Sant'Anna (hand) | 1.405| ] TWFE is downward-biased — early cohorts (with the largest effects) get re-used as controls for later cohorts, dragging the estimate down. CS and SA recover the truth. --- name: dd-interview-questions # Sample Interview Questions: Staggered DiD .small[ Realistic 45-min round questions on Goodman-Bacon + Callaway-Sant'Anna / Sun-Abraham. The right column gives what's being tested and a one-paragraph answer skeleton. | Question | What it tests / answer skeleton | |---|---| | **1.** *"You ran TWFE on a staggered rollout of a new pricing algorithm in 30 cities (2022–24). Coefficient is +1.2%. A colleague says you can't trust it. Why?"* | Goodman-Bacon (2021) decomposition: TWFE is a weighted avg of all possible 2×2 DiDs, including ones that use *early-treated as the control* for late-treated. Under treatment-effect heterogeneity those comparisons get *negative* weights → bias of unknown sign. Ask to see the Bacon decomp + the share of negative weights. | | **2.** *"When IS TWFE a valid estimator for staggered designs?"* | Iff (a) treatment effects are homogeneous across cohorts AND across event time, AND (b) parallel trends holds. Marketplace cases routinely fail (a) — novelty effects, learning curves — so default to Callaway-Sant'Anna, Sun-Abraham, dCDH, or BJS. | | **3.** *"Walk me through Callaway-Sant'Anna at a high level."* | Estimate group-time `\(\text{ATT}(g, t)\)` as a clean 2×2 DiD between cohort `\(g\)` and the never-treated, anchored at the pre-period `\(g-1\)`. Aggregate to taste: average over `\(g\)` at each event time `\(e = t - g\)` (event-study) or over both for an overall ATT. No negative weights because the control group is uncontaminated. | | **4.** *"Pre-trends look noisy but not flat. What now?"* | Don't fail-to-reject and move on. Run HonestDiD (Rambachan-Roth 2023): bound post-period bias under credible parallel-trends violations; report the smoothness parameter `\(M\)` at which the CI just covers zero. If `\(M\)` is implausibly large, the result is robust. | | **5.** *"Drivers from treated cities migrate to nearby untreated cities. What breaks?"* | Parallel trends becomes a *no-cross-unit-spillover* statement. Control trends bend down → DiD **overstates** TTE. See [marketplace-matrix](#marketplace-matrix). Mitigate with distance-based donor exclusion, HonestDiD with a spillover-driven `\(\bar M\)`, or switch to switchback if randomization is feasible. | ] --- class: center, middle, inverse # Part 2 — Synthetic Control --- name: sc-one-treated # When You Have *One* Treated Unit A single city introduces a policy (e.g., a minimum-wage floor for drivers). No randomization, no comparable control city. **Build one from a donor pool.** <img src="slides_files/figure-html/sc-data-1.png" style="display: block; margin: auto;" /> No single donor matches. A **convex combination** might. --- name: sc-est-main # Synthetic Control: The Estimator Pick weights `\(w_j \ge 0\)` with `\(\sum_j w_j = 1\)` to match the treated unit's *pre-treatment* outcomes — a **convex combination** of donors: `$$\min_{w} \sum_{t < T_0} \!\left( Y_{1t} - \sum_{j \ge 2} w_j Y_{jt} \right)^{\!2} \;\;\text{s.t.}\;\; w_j \ge 0,\;\; \sum_j w_j = 1$$` <img src="slides_files/figure-html/sc-counterfactual-plot-1.png" style="display: block; margin: auto;" /> The blue dashed line is the synthetic counterfactual: `\(\sum_j \hat w_j Y_{jt}\)` where the `\(\hat w\)` above are 0.33, 0.18, 0.28, 0.14, 0.03, 0.04. <a href="#sc-code" class="inline-btn">code</a> --- name: sc-placebos # Synthetic Control: Inference via Placebos Standard errors don't work — we have one treated unit. Instead: **placebo-in-space**. Re-run the procedure pretending each *donor* is the treated unit. If the true treated unit's gap is unusually large vs the placebo distribution, that's evidence of an effect. <img src="slides_files/figure-html/sc-placebos-1.png" style="display: block; margin: auto;" /> The treated gap (red) drops below the placebo cloud after `\(t=21\)` — the effect is real. --- name: sdid-main # Synthetic DiD: The Bridge Arkhangelsky et al. (2021) generalize both DiD and SC. **Synthetic DiD** uses *two* sets of weights: .blue-box[ - **Unit weights** `\(\hat\omega_i\)` work like SC: match pre-treatment trajectories of donors to the treated unit. - **Time weights** `\(\hat\lambda_t\)` match pre-treatment outcomes of the treated unit to its own post-treatment level — down-weighting periods that don't look like "now". ] .small[ `$$\widehat{\tau}_{\text{SDID}} = \arg\min_{\tau, \mu, \alpha, \beta} \sum_{i,t} (Y_{it} - \mu - \alpha_i - \beta_t - \tau D_{it})^2 \, \hat\omega_i \, \hat\lambda_t$$` ] -- Why this matters: - **DiD** uses uniform weights → biased when donors don't match. - **SC** uses unit weights only → can over-fit pre-period noise. - **SDID** uses both → robust to both. In Arkhangelsky's empirical comparisons, lower MSE than either alone. <a href="#sdid-detail" class="nav-btn">SDID estimator detail</a> --- name: sc-interview-questions # Sample Interview Questions: Synthetic Control .small[ Realistic 45-min round questions on SC and SDiD. The right column gives what's tested and a one-paragraph answer skeleton. | Question | What it tests / answer skeleton | |---|---| | **1.** *"Walk me through donor pool selection for an SC analysis of a single-city policy change."* | Match on pre-treatment outcomes (the load-bearing assumption) and on stable covariates (city size, market structure). Drop donors that experienced their own concurrent shocks; impose a geographic buffer to limit spillover into the donor pool. Report sensitivity to donor exclusion. | | **2.** *"Pre-treatment fit is bad — RMSPE is bigger than the pre-period outcome SD. What now?"* | Don't trust the SC. The synthetic counterfactual isn't tracking the treated unit's pre-trend, so post-period gaps could be noise rather than treatment. Options: expand the donor pool, lengthen the pre-period, switch to SDiD (which allows an additive intercept shift and doesn't need perfect pre-fit), or move to a different design entirely. <a href="#sc-pre-fit" class="inline-btn">RMSPE detail</a> | | **3.** *"How do you do inference for SC when there's one treated unit?"* | **Placebo-in-space** (Abadie-Diamond-Hainmueller): assign treatment to each donor in turn, re-estimate, and compute the post/pre RMSPE ratio. The empirical `\(p\)`-value is the rank of the actual treated unit's ratio in the placebo distribution. Sanity-check with placebo-in-time (fake treatment date in the pre-period). | | **4.** *"Why use SDiD over plain SC?"* | SDiD combines unit weights (SC-style) AND time weights (DiD-style), and allows an additive shift. Result: doesn't require near-perfect pre-fit, lower MSE in Arkhangelsky et al.'s empirical comparisons, and naturally extends to multiple treated units. Default to SDiD when you have a moderate-size donor pool and pre-fit is imperfect. | | **5.** *"You're estimating a fare-structure change in Austin via SC with Houston, Dallas, San Antonio in the donor pool. Drivers near Austin might cross-shop into those cities. What's the bias?"* | Donor pool contamination: treated city's expanded supply pulls drivers from nearby donors → donor trends adjust → synthetic Austin tracks the post-treatment Austin too closely → SC **understates** TTE. See [sc-interference-detail](#sc-interference-detail). Mitigations: geographic buffer, augmented SC (Ben-Michael-Feller-Rothstein 2021), explicit cross-city supply-elasticity model. | ] --- class: center, middle, inverse # Part 3 — Causal Forest --- name: why-hte # Why HTE — Beyond the ATE ATE answers: "should we ship?" HTE answers: "**to whom**?" .highlight-box[ The zone-notification feature has $\widehat{\text{ATE}} = +\$50$/wk. But the effect probably varies — by city density, driver tenure, time-of-week patterns. **Knowing the heterogeneity lets you target rollout, set personalized policies, and forecast aggregate impact under different deployment plans.** ] -- Classic approach: pre-specify subgroups, run interactions. Two problems: - **Multiple testing** — interview question from M6. - **Misspecification** — the right subgroups aren't always the obvious ones. **Causal forest** (Wager & Athey 2018, Athey-Tibshirani-Wager 2019): non-parametric estimation of `\(\tau(x) = E[Y(1) - Y(0) \mid X = x]\)` using an ensemble of *honest* trees. --- name: cf-how-main # Causal Forest — How It Works .pull-left[ **Honest splitting** is the key trick. Each tree: 1. Splits the sample into two halves. 2. Uses one half to *grow* the tree (decide where to split). 3. Uses the other half to *estimate* the treatment effect within each leaf. This separation prevents the same observation from informing both the split rule and the estimate inside it — eliminating the over-fitting bias that plagues vanilla trees. ] .pull-right[ **Forest-level prediction** for a new `\(x\)`: - Each tree drops `\(x\)` down to a leaf. - The leaf defines a *local neighborhood* — observations with similar covariates. - `\(\hat\tau(x)\)` = a *weighted* DiD or IV across that neighborhood, with weights coming from how often training points share a leaf with `\(x\)`. The result has pointwise confidence intervals — Wager & Athey prove asymptotic normality. ] <a href="#cf-splitting-detail" class="nav-btn">splitting algorithm</a> --- name: cf-practice-main # Causal Forest in Practice — `grf` .small[ ```r cf <- causal_forest(X = as.matrix(X), Y = Y, W = W, num.trees = 1000) average_treatment_effect(cf) # ATE estimate + SE ``` ``` ## estimate std.err ## 42.94805 2.67375 ``` ] <img src="slides_files/figure-html/cf-plot-1.png" style="display: block; margin: auto;" /> Predicted `\(\hat\tau(x)\)` rises with density; the forest recovers the truth (dashed). <a href="#cf-dgp" class="inline-btn">DGP code</a> --- name: policy-learning-teaser # Policy Learning — A Quick Teaser Once you have `\(\hat\tau(x)\)`, you can learn a **treatment rule** `\(\pi(x) \in \{0, 1\}\)` that maximizes welfare: `$$\pi^*(x) = \mathbb{1}\{\hat\tau(x) > c(x)\}$$` where `\(c(x)\)` is a per-unit cost (e.g., the cost of pushing a notification). .blue-box[ - **`grf::policy_tree`** — fit a shallow decision tree over the covariates that maximizes estimated welfare. Output: an interpretable rule like *"deploy in dense cities to drivers with <2 years tenure."* - **Athey & Wager (2021)** prove these are **near-optimal policies**: the learned rule's welfare gets within a vanishing-with-sample-size gap of the *optimal* policy's. So the rule isn't just interpretable — it's provably close to the best rule you could have chosen if you knew the truth. ] This is what gets used to *deploy* the model. ATE answers "ship?". HTE answers "to whom?". Policy learning answers "**what should we actually do?**" --- class: center, middle, inverse # Part 4 — One-Pagers --- name: bandits-one-pager # Bandits — Adaptive Experimentation in One Slide .small[ **Setup:** K arms (e.g., 4 ad creatives). At each step, pick one, observe reward. Goal: maximize cumulative reward, not just identify the best arm. **Thompson Sampling** is the workhorse: maintain a posterior over each arm's reward, sample from it, play the argmax. Asymptotically optimal regret; minimal tuning. ] <img src="slides_files/figure-html/bandit-quick-1.png" style="display: block; margin: auto;" /> .small[ **When *not* to use:** when you need an unbiased ATE for a fixed-population decision (a launch / kill call). Bandits make assignment correlated with outcome history — naive analysis is biased. ] --- name: seq-main # Sequential Testing in One Slide .small[ **Problem:** classical p-values are valid only at one fixed sample size. Peeking inflates Type-I error — by the end of the experiment, the running p-value will dip below 0.05 about 30% of the time *under the null*. **Two fixes used in industry:** - **Group-sequential / O'Brien-Fleming:** spend the alpha budget on a pre-specified schedule of looks. Conservative early, normal late. - **Always-valid CIs / mSPRT:** mixture sequential probability ratio. Confidence sequences that hold *uniformly over time*. Stop whenever you want; CIs stay valid. ] ``` ## [1] 0.265 ``` .small[ Under the null, naive peeking yields ≈30% rejection rate, not 5%. mSPRT or OBF would hold this near 5%. ] <a href="#mSPRT-detail" class="nav-btn">mSPRT formula</a> --- class: center, middle, inverse # Part 5 — Wrap-Up --- name: decision-tree # Decision Tree: Which Method When .small[ **Got randomization?** → Use the M5 toolkit (regression adjustment, CUPED, ITT/LATE). **Staggered rollout, multiple cities, multiple time periods?** - Heterogeneous effects expected? → **Callaway-Sant'Anna** (or Sun-Abraham). Avoid TWFE. - Worried about parallel trends? → **Honest DiD** sensitivity bounds. **Single treated unit?** → **Synthetic control** (or **SDID** if you have a moderate-size donor pool). **Want effect heterogeneity?** → **Causal forest** for treatment effects as a function of covariates; **policy tree** for an interpretable deployment rule. **Many arms, online learning?** → **Thompson Sampling**, *unless* you need a clean ATE. **Repeated looks at one experiment?** → **mSPRT / always-valid CIs**. ] --- name: interview-cheat-sheet # Interview Cheat Sheet .blue-box[ **"Walk me through how you'd analyze a staggered rollout."** Three steps. (1) Plot raw outcomes by cohort and event time — does parallel trends look plausible? (2) Don't run TWFE; run Callaway-Sant'Anna and report the event-study plot. (3) Sensitivity-check with Honest DiD and a robustness column from Sun-Abraham. ] .blue-box[ **"How do you estimate a causal effect for one city?"** Synthetic control. Build weights from a donor pool to match pre-treatment outcomes. Inference via placebo-in-space. Mention SDID as the modern improvement. ] .blue-box[ **"How do you get heterogeneous treatment effects?"** Causal forest from `grf`. Honest splitting prevents over-fit. Aggregate to ATE for sanity-check; predict per-unit treatment effects from the covariates; feed into a policy tree if a deployment rule is needed. ] .highlight-box[ **Red flag in interviews:** running TWFE on staggered data without a Goodman-Bacon diagnostic. ] --- name: marketplace-matrix # Marketplace Pros/Cons: AB / Switchback / DiD / SC .small[ Each method evaluated through the **driver-rider network** lens (M2). "Failure mode" gives the *sign* of the bias and the mechanism. | Method | Marketplace pro | Marketplace con | Concrete failure mode | |---|---|---|---| | **Individual A/B** | Cheap; clean identification *if* SUTVA held | Both arms share the same exposure profile → biased blend of DE and SE | Naive 50/50 recovers DE, *not* TTE (see [M2 worked example](../module-02/slides.html#estimands-formal-2)) — [more](#ab-interference-detail) | | **Switchback** | Treat-all vs treat-none periods → targets **TTE directly** | Carryover; demands homogeneous time effects | Driver positioning persists across switches; OFF period isn't truly `\(\mathbf{0}\)` → bias toward zero — [more](#switchback-interference-detail) | | **DiD (CS / staggered)** | Uses observational rollout data; matches how features actually launch | Parallel trends ≈ "no cross-unit spillover from treated to controls" — strong in marketplaces | Drivers reposition across geo borders; control trend bends down → DiD **overstates** TTE — [more](#did-interference-detail) | | **Synthetic control** | Targets one-treated-city rollout cleanly | Donor pool assumed unaffected by treatment | Treated city pulls supply from donor neighbors; synthetic counterfactual bends *toward* treated trend → SC **understates** TTE — [more](#sc-interference-detail) | **Pattern.** Methods anchored on *control units* (DiD, SC) bias when treatment leaks into the control set. Methods comparing *exposure profiles directly* (switchback, full-saturation cluster A/B) target TTE but pay other costs. ] --- # Going Deeper This module is a tour, not a treatment. Each of the three core methods has a course's worth of material behind it. .blue-box[ **Companion course (in development):** *Causal Inference Beyond A/B Tests* — a deep dive into modern DiD (formal CS / SA / BJS estimators, full Honest DiD), synthetic control variants (augmented SC, generalized SC, matrix completion), and the Athey-Wager causal-forest stack (policy learning, contextual bandits via causal trees). Repo will live alongside this one at `~/Desktop/sandbox/courses/causal-inference-beyond-ab/` and link from the same landing page. ] For the read-now references behind today's slides: .small[ - **Goodman-Bacon (2021)** — *Difference-in-differences with variation in treatment timing*, J. Econometrics. - **Callaway & Sant'Anna (2021)** — *Difference-in-differences with multiple time periods*, J. Econometrics. - **Arkhangelsky et al. (2021)** — *Synthetic difference in differences*, AER. - **Wager & Athey (2018)** — *Estimation and inference of heterogeneous treatment effects using random forests*, JASA. - **Rambachan & Roth (2023)** — *A more credible approach to parallel trends*, ReStud. ] --- class: center, middle, inverse # Backup Slides --- name: pretrend-power-detail # Backup: The Roth (2022) Pretrend Critique **The standard pretrend test** has low power against linear (or near-linear) violations — exactly the kind that would bias the post-treatment estimate. <img src="slides_files/figure-html/roth-power-1.png" style="display: block; margin: auto;" /> A meaningful linear violation is missed most of the time. Roth's recommendation: use Honest DiD bounds, not the pretrend test. <a href="#parallel-trends-main" class="nav-btn">← back</a> --- name: staggered-dgp-detail # Backup: Staggered-Adoption DGP .small[ ```r n_cities <- 12; n_t <- 24 cohorts <- tibble( city = 1:n_cities, g = c(rep(8, 3), rep(12, 3), rep(16, 3), rep(Inf, 3)) ) panel <- expand_grid(city = 1:n_cities, t = 1:n_t) |> left_join(cohorts, by = "city") |> mutate( treated = t >= g, # heterogeneous, time-varying effect: bigger for earlier cohorts eff = if_else(treated, 0.6 * (1 + 0.15 * (t - g)) * (1 + 0.7 * (g == 8) - 0.7 * (g == 16)), 0), y = 5 + 0.05 * t + city * 0.1 + eff + rnorm(n(), 0, 0.3) ) ``` ] Three cohorts adopt at `\(t = 8, 12, 16\)`; three never adopt. Effect grows over time and is **larger for earlier cohorts** — the classic case where TWFE breaks. <a href="#staggered-main" class="nav-btn">← back</a> --- name: goodman-bacon-derivation # Backup: The Goodman-Bacon Decomposition For TWFE with staggered timing, the OLS coefficient `\(\hat\beta_{\text{TWFE}}\)` decomposes as: `$$\hat\beta_{\text{TWFE}} = \sum_{k} s_k \, \widehat{\text{ATT}}_k$$` where each `\(\widehat{\text{ATT}}_k\)` is a 2×2 DiD between two cohorts (or a cohort and the never-treated), and the weights `\(s_k\)` depend on: - size of each cohort, - timing of treatment, - variance of treatment exposure within the panel. .highlight-box[ **The bug:** when an *earlier-treated* cohort is used as a comparison group for a *later-treated* one, the post-period for the comparison cohort is also under treatment. So that 2×2 has the form `\((\Delta Y_{\text{treated}}) - (\Delta Y_{\text{also treated, but longer ago}})\)`, and the second piece subtracts a *treatment-effect trend*, not a counterfactual. If treatment effects grow over time, this 2×2 gets a *negative* contribution. TWFE's "weighted average" puts positive weight on this, biasing `\(\hat\beta\)` downward. ] <a href="#twfe-main" class="nav-btn">← back</a> --- name: honest-did-detail # Backup: Honest DiD Visualization The robust CI grows as you allow more violation. The point estimate stays — only the uncertainty changes. <img src="slides_files/figure-html/honest-did-viz-1.png" style="display: block; margin: auto;" /> The smallest `\(M\)` at which the CI crosses zero is the **breakdown value** — how much PT violation you'd need to tolerate before your conclusion flips. <a href="#honest-did-main" class="nav-btn">← back</a> --- name: sc-code # Backup: Synthetic Control via `solve.QP` .small[ ```r library(quadprog) pre <- 1:(t_treat - 1) A <- donors[pre, ]; b <- treated_y[pre] # Quadratic program: # minimize || A w - b ||^2 subject to w_j >= 0, sum_j w_j = 1 Dmat <- 2 * t(A) %*% A dvec <- 2 * t(A) %*% b Amat <- cbind(rep(1, n_donor), diag(n_donor)) # constraints stacked bvec <- c(1, rep(0, n_donor)) # 1 for sum=1, 0 for w >= 0 w_hat <- solve.QP(Dmat, dvec, Amat, bvec, meq = 1)$solution synth_y <- as.numeric(donors %*% w_hat) gap <- treated_y - synth_y ``` ] `solve.QP` minimizes `\(\frac{1}{2} w^\top D w - d^\top w\)` subject to `\(A^\top w \ge b\)`, with `meq` flagging the first constraint as equality. Production code uses `Synth::synth()`, which adds a covariate-importance V-matrix and a nested optimization. <a href="#sc-est-main" class="nav-btn">← back</a> --- name: sdid-detail # Backup: SDID Estimator Mechanics The SDID estimator solves: `$$(\hat\tau, \hat\mu, \hat\alpha, \hat\beta) = \arg\min \sum_{it} (Y_{it} - \mu - \alpha_i - \beta_t - \tau D_{it})^2 \, \hat\omega_i \, \hat\lambda_t$$` where `\(\hat\omega_i\)` are unit weights from a regularized SC-style fit, and `\(\hat\lambda_t\)` are time weights from an analogous time-domain fit. .blue-box[ **Why the regularization matters.** SC's unit weights can over-fit pre-period noise on a short pre-period — picking a donor combination that matches noise rather than the underlying trend. SDID adds an `\(L_2\)` penalty on the weights, smoothing the solution and reducing variance. **Why the time weights matter.** They down-weight pre-periods that don't resemble the treatment period, focusing the comparison on "comparable" history. This is what makes SDID robust to the donor-pool drift that breaks SC. ] R implementation: `synthdid::synthdid_estimate()`. The package returns the estimate, jackknife SE, and a built-in SC vs DiD vs SDID comparison plot. <a href="#sdid-main" class="nav-btn">← back</a> --- name: sc-pre-fit # Backup: When Is Pre-Fit "Bad"? .small[ **RMSPE** = root mean squared prediction error in the pre-period — *the objective the SC weights minimize:* `$$\text{RMSPE}_{\text{pre}} = \sqrt{\frac{1}{T_0}\sum_{t < T_0}\!\Bigl(Y_{1t} - \sum_j \hat w_j\, Y_{jt}\Bigr)^{\!2}}$$` By construction, this is the best the donor pool can do under the SC constraints ($w_j \geq 0$, `\(\sum w_j = 1\)`). **The diagnostic.** Compare `\(\text{RMSPE}_{\text{pre}}\)` to `\(\text{SD}(Y_{1t})_{\text{pre}}\)` — the SD of the treated unit's own pre-period outcomes (the naive "predict the mean" baseline): | Ratio `\(\text{RMSPE}_{\text{pre}} / \text{SD}(Y_{1t})\)` | Interpretation | |---|---| | `\(\ll 1\)` (e.g., 0.1–0.3) | Synthetic tracks treated well — pre-fit is good, SC trustworthy. | | `\(\approx 1\)` | Synthetic is no better than "predict the pre-period mean" — donor pool can't reconstruct the trajectory. | | `\(> 1\)` | Synthetic is *worse than a constant*. Very bad — post-period gap is uninterpretable. | **Why ratio `\(> 1\)` kills SC.** Identification rests on the pre-fit weights `\(\hat w\)` also generating a valid counterfactual *post*-treatment. If `\(\hat w\)` doesn't even reproduce the pre-period (worse than a constant!), there's no evidence the donor pool tracks the treated unit in the absence of treatment — any post-period gap could be fit error spilling forward, not the treatment effect. Q2's fixes (expand donors, lengthen pre-period, switch to SDiD) all target one of those failure modes. ] <a href="#sc-interview-questions" class="nav-btn-br">← back</a> --- name: cs-code # Backup: Hand-Coded Callaway-Sant'Anna .small[ ```r cohort_means <- panel |> filter(is.finite(g)) |> group_by(g, t) |> summarise(y = mean(y), .groups = "drop") never_means <- panel |> filter(is.infinite(g)) |> group_by(t) |> summarise(y_nev = mean(y), .groups = "drop") # ATT(g, t) = (Y_g,t - Y_g,g-1) - (Y_never,t - Y_never,g-1) att_gt <- cohort_means |> left_join(never_means, by = "t") |> group_by(g) |> mutate(att = (y - y[t == g - 1]) - (y_nev - y_nev[t == g - 1])) |> ungroup() |> mutate(e = t - g) ``` ] This is the simplest version: each cohort vs never-treated, anchored at `\(t = g - 1\)`. Production CS adds: not-yet-treated (vs only never-treated) controls, doubly-robust adjustment, and influence-function based standard errors. The `did` package implements all three. <a href="#cs-main" class="nav-btn">← back</a> --- name: cf-dgp # Backup: Causal-Forest DGP .small[ ```r library(grf) # DGP: zone-notification HTE depends on driver tenure + city density n <- 4000 X <- tibble(tenure = rexp(n, rate = 1/2), # years density = runif(n, 0, 1), # city density 0-1 age = sample(20:65, n, replace = TRUE)) W <- rbinom(n, 1, 0.5) # Heterogeneous true effect: rises with density, falls with tenure (capped) true_tau <- 30 + 60 * X$density - 10 * pmin(X$tenure, 4) Y <- 700 + 50 * X$age * 0.5 + true_tau * W + rnorm(n, 0, 80) ``` ] The HTE pattern: drivers in dense cities benefit most ($+\$60$/wk per density unit); long-tenured drivers benefit less (already routing efficiently). The forest must recover this without us specifying the functional form. <a href="#cf-practice-main" class="nav-btn">← back</a> --- name: cf-splitting-detail # Backup: Causal-Forest Splitting Algorithm For each candidate split `\((j, c)\)` on covariate `\(j\)` at threshold `\(c\)`, define left/right child sets. **Vanilla regression-tree criterion:** maximize variance of the *outcome mean* across the two children — i.e., split where `\(Y\)` differs most. **Causal-forest criterion** (Athey, Tibshirani & Wager 2019): maximize the **heterogeneity of the treatment effect** across children: `$$\Delta(j, c) = n_L \, n_R \, (\hat\tau_L - \hat\tau_R)^2 / n$$` where `\(\hat\tau_L, \hat\tau_R\)` are local treatment-effect estimates inside each child. The `grf` implementation uses a fast gradient-based approximation rather than evaluating `\(\hat\tau\)` exactly at every candidate split. .highlight-box[ **Honesty:** within a tree, sample-splitting separates the data used to *grow* the tree from the data used to *estimate `\(\hat\tau\)` inside leaves*. Without this, the same noise that drove the split would inflate the estimated effect inside the leaf — exactly the over-fit that vanilla trees suffer. ] <a href="#cf-how-main" class="nav-btn">← back</a> --- name: mSPRT-detail # Backup: mSPRT — Always-Valid Inference The mSPRT (mixture sequential probability ratio test) constructs a *confidence sequence* `\(\{(L_t, U_t)\}\)` such that: `$$P\!\left(\theta \in (L_t, U_t)\;\;\forall\, t\right) \ge 1 - \alpha$$` The interval is *uniformly* valid — you can stop at any time. The construction (Howard et al. 2021): for a Gaussian outcome with variance `\(\sigma^2\)`, the running CI for `\(\theta = E[X]\)` is `$$\hat\theta_t \pm \sigma \sqrt{\frac{2 \log(1/\alpha) + \log(1 + t \rho^2)}{t \rho^2}}$$` for a chosen mixing scale `\(\rho\)`. The CI shrinks like `\(\sqrt{(\log t) / t}\)` — slightly slower than fixed-$n$ `\(\sqrt{1/t}\)`, the price of always-valid coverage. .blue-box[ **Used in industry by:** Optimizely (mSPRT), Microsoft (group-sequential + mSPRT), Netflix (sequential testing platform). ] <a href="#seq-main" class="nav-btn">← back</a> --- name: ab-interference-detail # Backup: Individual A/B Under Driver-Rider Interference .small[ **Setup.** 50% of drivers in a city get a zone-notification feature; the other 50% don't. Both arms operate in the same zones simultaneously. **The hidden bias.** Treated drivers rush the zone → crowding pushes treated accept-rate *down*. Control drivers in the same zone benefit from less competition → control accept-rate goes *up*. The naive estimator `\(\hat\mu(1) - \hat\mu(0)\)` subtracts a *contaminated control* from a *deflated treated*. **The killer.** In the canonical DGP with symmetric crowd-out, at saturation 0.5 the naive estimator equals the **direct effect**, not TTE. The policy-relevant quantity (TTE) can have the *opposite sign* from what the A/B reports — a +5 pp A/B can mask a −2 pp rollout. (Worked example: [M2 §29](../module-02/slides.html#estimands-formal-2).) **What to do.** - **Cluster A/B at the geographic level** (whole cities) — if clusters are large enough to contain spillovers, the within-cluster estimate approaches TTE. - **Switchback** — treat-all vs treat-none periods alternate; targets TTE directly. - **Saturation experiment** — randomize the *fraction* treated across markets; trace the TTE/ADE curve and extrapolate to 100%. - **Cluster-robust SE alone doesn't help.** It fixes variance, not bias. Don't conflate the two. ] <a href="#marketplace-matrix" class="nav-btn">← back to matrix</a> --- name: switchback-interference-detail # Backup: Switchback Under Driver-Rider Interference .small[ **Strength.** Whole-city ON/OFF periods → at any given moment every driver sees the *same* regime → the comparison is between a near `\(\mathbf{1}\)` and a near `\(\mathbf{0}\)` exposure regime → the estimator targets TTE directly. No treated-vs-control crowding within a period. **Network-side weakness — physical carryover.** Driver positioning is *physical*. A treated period concentrates drivers in zone X; when the period flips OFF, those drivers don't teleport home. The first 5–10 minutes of OFF are still partially treated. **Sign of the bias.** Carryover makes OFF periods look more like ON periods → the ON−OFF gap shrinks → switchback **underestimates** TTE in absolute value (biases toward zero). **What to do.** - **Washout windows.** Drop the first τ minutes after each switch from analysis. τ = 1× the median driver repositioning time is a good starting heuristic. - **Period-length tuning.** Too short → high carryover bias. Too long → too few independent comparisons (reduces effective `\(n\)`). Bojinov et al. (2023) derive the optimal period length — see [switchback-optimal in M3](../module-03/slides.html#switchback-optimal). - **Asymmetric carryover.** If ON→OFF and OFF→ON have different lag profiles (e.g., features that trigger habit formation), use asymmetric washouts. **Fundamental limitation.** Switchback gives you TTE as a single number. It can't decompose into DE and SE — for that you need a saturation design. ] <a href="#marketplace-matrix" class="nav-btn">← back to matrix</a> --- name: did-interference-detail # Backup: DiD Under Driver-Rider Interference .small[ **Setup.** Uber turns on a routing-algorithm change in San Francisco at `\(t^*\)`, leaves Oakland untouched. Standard 2×2 DiD: `$$\text{DiD} = (\bar Y_{SF, post} - \bar Y_{SF, pre}) - (\bar Y_{Oak, post} - \bar Y_{Oak, pre})$$` **The hidden assumption.** Parallel trends says: *absent treatment*, `\(Y_{SF}\)` and `\(Y_{Oak}\)` would have moved in parallel. But "absent treatment" really means *no treatment anywhere in the network*. If the routing change in SF reroutes drivers across the bay bridge, Oakland is **not** at the no-treatment counterfactual — it loses supply. **Sign of the bias.** Spillover *from* SF *into* Oakland → Oakland trend bends DOWN → control rises *less* than it should → DiD = (treated rise) − (control rise) gets *bigger* → **overestimates** the SF-only treatment effect. The same logic applies to staggered DiD when treated cities pull from not-yet-treated controls. **What to do.** - **Distance-based donor exclusion.** Drop control geos within a buffer radius of treated; use only "isolated" controls. - **Saturation-aware adjustment.** Estimate the spillover function from cross-border movement data and subtract the implied control-unit shift. - **HonestDiD with a spillover-driven `\(\bar M\)`.** Set the bound on parallel-trends violation to the worst-case implied by your spillover model — see [honest-did-detail](#honest-did-detail). - **Switch to switchback or cluster A/B if randomization is feasible.** They target TTE directly without the parallel-trends-equals-no-spillover leap. ] <a href="#marketplace-matrix" class="nav-btn">← back to matrix</a> --- name: sc-interference-detail # Backup: Synthetic Control Under Driver-Rider Interference .small[ **Setup.** Uber rolls out a new fare structure in **Austin** only; estimates a synthetic Austin from a weighted combination of Houston, Dallas, San Antonio, etc. **The hidden assumption.** The donor pool is *not* affected by treatment. With a fare change in Austin: drivers near the metro might cross-shop, riders might delay travel, operational adjustments (driver supply hours, vehicle deployment) propagate through Uber's national supply pool. *Houston is no longer at counterfactual.* **Sign of the bias.** Donor cities lose drivers to Austin (or absorb returning supply) → donor trends adjust → synthetic Austin trends with donor adjustments → counterfactual *underestimates* "Austin if untreated" → SC **understates** the true treatment effect. (Sign flips if spillover *helps* donors — e.g., riders priced out of Austin substitute to Houston.) **What to do.** - **Geographic buffer.** Restrict donors to cities far enough that supply cross-shopping is negligible. - **Pre-treatment covariate matching.** Choose donors with similar driver labor markets so spillover impact is balanced (won't eliminate bias but reduces it). - **Augmented SC** (Ben-Michael, Feller, Rothstein 2021). The outcome model can absorb part of the spillover. - **Explicit spillover model.** Estimate cross-city supply elasticity and adjust the synthetic counterfactual. - **Triangulate.** If feasible, validate the SC estimate with a switchback or cluster A/B in a *different* market — divergence flags unmeasured spillover. ] <a href="#marketplace-matrix" class="nav-btn">← back to matrix</a>