class: center, middle, inverse, title-slide .title[ # Module 8: Beyond the A/B Test ] .subtitle[ ## Modern DiD, Synthetic Control, and Causal Forests ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } .tiny { font-size: 65%; } .highlight-box { background: #fff3e0; border-left: 4px solid #e65100; padding: 0.5em 1em; margin: 0.5em 0; } .blue-box { background: #e3f2fd; border-left: 4px solid #1565c0; padding: 0.5em 1em; margin: 0.5em 0; } .nav-btn { position: absolute; bottom: 12px; left: 40px; font-size: 11px; background: #e8eaf6; padding: 2px 8px; border-radius: 3px; z-index: 100; text-decoration: none; color: #1a237e; } .nav-btn:hover { background: #c5cae9; } .nav-btn-br { position: absolute; bottom: 12px; right: 70px; font-size: 11px; background: #e8eaf6; padding: 2px 8px; border-radius: 3px; z-index: 100; text-decoration: none; color: #1a237e; } .nav-btn-br:hover { background: #c5cae9; } .inline-btn { font-size: 11px; background: #e8eaf6; padding: 2px 8px; border-radius: 3px; text-decoration: none; color: #1a237e; margin-right: 6px; vertical-align: middle; } .inline-btn:hover { background: #c5cae9; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Experimental Ideal</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Designing Around Interference</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Power and Sample Size</td><td>upcoming</td></tr> <tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr> <tr><td>6</td><td>Multiple Testing & Subgroups</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">External Validity</a></td><td>✓ done</td></tr> <tr><td><b>8</b></td><td><b>Beyond the A/B Test</b> <i>(you are here)</i></td><td>current</td></tr> </table> --- # When Randomization Isn't Enough Three settings where the clean experiment from M1–M5 isn't available: .blue-box[ **1. Staggered rollouts** — Treatment turns on city-by-city over months. There's no clean "post" period that's the same for everyone. Naive two-way fixed effects (TWFE) is **biased** when effects are heterogeneous over time. **2. One treated unit** — A policy hits a single city. No control group to randomize against. *Synthetic control* builds a counterfactual from a weighted donor pool. **3. Heterogeneity that matters for policy** — Knowing the average effect isn't enough; you need to know *who responds*. *Causal forests* estimate treatment effects as a function of covariates. ] -- This module: the modern toolkit for each, with proofs and DGP code in backup slides. --- class: center, middle, inverse # Part 1 — Modern Difference-in-Differences --- # Classic 2×2 DiD: Two Groups, Two Periods One city gets the zone-notification feature at `\(t=2\)`; one doesn't. Driver weekly earnings before and after: <img src="slides_files/figure-html/did-22-data-1.png" style="display: block; margin: auto;" /> `$$\widehat{\text{ATT}} = (\bar y_{T,\text{post}} - \bar y_{T,\text{pre}}) - (\bar y_{C,\text{post}} - \bar y_{C,\text{pre}}) = 70 - 20 = 50$$` --- name: parallel-trends-main # Parallel Trends — The Identifying Assumption The DiD estimator equals the ATT *only if* the treated and control groups would have followed the same trend, on average, in the absence of treatment. .highlight-box[ **Parallel trends:** `\(\;\;E[Y_t(0) - Y_{t-1}(0) \mid \text{treated}] = E[Y_t(0) - Y_{t-1}(0) \mid \text{control}]\)` It's an assumption about *counterfactuals* — fundamentally untestable. We can only test pretrends as a proxy. ] -- **The pretrend test:** check whether the two groups moved in parallel *before* treatment. Failing this is bad news. **Passing it is weaker evidence than people think** — Roth (2022) shows pretrend tests have low power against the most-worrying violations. <a href="#pretrend-power-detail" class="nav-btn">Roth pretrend critique</a> --- name: staggered-main # Staggered Adoption — What Changes Real rollouts don't happen all at once. Cities adopt zone-notification at different times. <img src="slides_files/figure-html/staggered-dgp-1.png" style="display: block; margin: auto;" /> Three cohorts adopt at `\(t=8, 12, 16\)`; three never adopt. The effect grows over time and is **larger for earlier-adopting cohorts**. <a href="#staggered-dgp-detail" class="nav-btn-br">DGP code</a> --- name: twfe-main # The TWFE Estimator — Looks Innocent, Isn't The "default" DiD with staggered timing: `$$Y_{it} = \alpha_i + \lambda_t + \beta\, D_{it} + \varepsilon_{it}$$` ```r fit_twfe <- feols(y ~ treated | city + t, data = panel) coef(fit_twfe)["treatedTRUE"] ``` ``` ## treatedTRUE ## 0.822057 ``` -- The true effects are positive and grow over time. The TWFE coefficient is wrong (sometimes the *wrong sign*) when: - effects are heterogeneous *across cohorts*, or - effects grow/shrink *within a cohort over time*. -- **Goodman-Bacon (2021):** `\(\hat\beta_{\text{TWFE}}\)` is a weighted average of all possible 2×2 DiDs in the data. Some of those weights are **negative** — earlier-treated units act as "controls" for later-treated ones, which subtracts a treated trend. <a href="#goodman-bacon-derivation" class="nav-btn">decomposition</a> --- name: bacon-viz-main # Visualizing the Bacon Decomposition `\(\hat\beta_{\text{TWFE}}\)` is a **weighted average of every 2×2 DiD** in the data — each cohort vs every other cohort, plus each vs never-treated. With cohorts `\(g \in \{8, 12, 16, \infty\}\)`, that's **9 distinct 2×2 estimates**: <img src="slides_files/figure-html/bacon-viz-1.png" style="display: block; margin: auto;" /> The **red 2×2s** use an already-treated cohort as control — they difference out a treated trend, not a counterfactual. TWFE (dotted) averages them in anyway, landing far below truth (dashed). <a href="#goodman-bacon-derivation" class="inline-btn">why</a> --- # The Modern Toolkit: One Slide All four heterogeneity-robust estimators solve the same problem — only use **clean** comparisons (treated vs not-yet-treated or never-treated): .small[ | Estimator | Year | Core idea | R package | |---|---|---|---| | **Callaway & Sant'Anna** | 2021 | Group-time ATT for each cohort g and time t, then aggregate | `did` | | **Sun & Abraham** | 2021 | Saturated event-study with cohort×event-time interactions | `fixest` (`sunab()`) | | **de Chaisemartin & D'Haultfœuille** | 2020 | Switchers vs not-yet-switchers | `DIDmultiplegt` | | **Borusyak, Jaravel, Spiess** | 2024 | Impute the untreated outcome from never-treated, then average residuals | `didimputation` | ] .highlight-box[ All four converge in simple cases. Pick one, report the event-study plot, and check robustness with another. **Callaway & Sant'Anna is the most-cited workhorse.** ] --- name: cs-main # Callaway-Sant'Anna in Practice The estimand is the **group-time ATT** `\(\text{ATT}(g, t)\)`: the effect on cohort `\(g\)` at calendar time `\(t\)`. We hand-compute it as a 2×2 DiD between each cohort and the never-treated, then plot by event time. <img src="slides_files/figure-html/cs-handcoded-1.png" style="display: block; margin: auto;" /> Each cohort's ATT path is positive and growing post-treatment — the heterogeneity TWFE smeared. <a href="#cs-code" class="inline-btn">code</a> --- name: honest-did-main # Honest DiD — Bounds Under Partial PT Violations Pretrend tests have low power. What if parallel trends is "almost but not quite" right? .blue-box[ **Rambachan & Roth (2023):** Don't pretend PT holds exactly. Posit a bound on how much the post-treatment differential trend can deviate from the pre-trend, then report a *robust confidence interval*. ] -- Two common restrictions: - **Smoothness, parameter M:** the post-treatment violation can be at most M times larger than the largest pre-treatment violation. - **Relative magnitudes:** the post-period violation cannot exceed the worst pre-period violation. The output is a *sensitivity plot*: the CI grows as you allow more violation. If your conclusion survives plausible M, you're robust. <a href="#honest-did-detail" class="nav-btn">visualization</a> --- # Application: Staggered Zone-Notification Rollout Three city cohorts adopt zone-notification at `\(t=8, 12, 16\)`. The true effect is positive and **larger for earlier adopters** (these cities had more underserved zones). What does each estimator say? .small[ |Estimator | Estimate| |:-------------------------|--------:| |Truth | 1.425| |TWFE (biased) | 0.822| |Sun-Abraham | 1.405| |Callaway-Sant'Anna (hand) | 1.405| ] TWFE is downward-biased — early cohorts (with the largest effects) get re-used as controls for later cohorts, dragging the estimate down. CS and SA recover the truth. --- class: center, middle, inverse # Part 2 — Synthetic Control --- # When You Have *One* Treated Unit A single city introduces a policy (e.g., a minimum-wage floor for drivers). No randomization, no comparable control city. **Build one from a donor pool.** <img src="slides_files/figure-html/sc-data-1.png" style="display: block; margin: auto;" /> No single donor matches. A **convex combination** might. --- name: sc-est-main # Synthetic Control: The Estimator Pick weights `\(w_j \ge 0\)` with `\(\sum_j w_j = 1\)` to match the treated unit's *pre-treatment* outcomes — a **convex combination** of donors: `$$\min_{w} \sum_{t < T_0} \!\left( Y_{1t} - \sum_{j \ge 2} w_j Y_{jt} \right)^{\!2} \;\;\text{s.t.}\;\; w_j \ge 0,\;\; \sum_j w_j = 1$$` <img src="slides_files/figure-html/sc-counterfactual-plot-1.png" style="display: block; margin: auto;" /> The blue dashed line is the synthetic counterfactual: `\(\sum_j \hat w_j Y_{jt}\)` where the `\(\hat w\)` above are 0.33, 0.18, 0.28, 0.14, 0.03, 0.04. <a href="#sc-code" class="inline-btn">code</a> --- # Synthetic Control: Inference via Placebos Standard errors don't work — we have one treated unit. Instead: **placebo-in-space**. Re-run the procedure pretending each *donor* is the treated unit. If the true treated unit's gap is unusually large vs the placebo distribution, that's evidence of an effect. <img src="slides_files/figure-html/sc-placebos-1.png" style="display: block; margin: auto;" /> The treated gap (red) drops below the placebo cloud after `\(t=21\)` — the effect is real. --- name: sdid-main # Synthetic DiD: The Bridge Arkhangelsky et al. (2021) generalize both DiD and SC. **Synthetic DiD** uses *two* sets of weights: .blue-box[ - **Unit weights** `\(\hat\omega_i\)` work like SC: match pre-treatment trajectories of donors to the treated unit. - **Time weights** `\(\hat\lambda_t\)` match pre-treatment outcomes of the treated unit to its own post-treatment level — down-weighting periods that don't look like "now". ] .small[ `$$\widehat{\tau}_{\text{SDID}} = \arg\min_{\tau, \mu, \alpha, \beta} \sum_{i,t} (Y_{it} - \mu - \alpha_i - \beta_t - \tau D_{it})^2 \, \hat\omega_i \, \hat\lambda_t$$` ] -- Why this matters: - **DiD** uses uniform weights → biased when donors don't match. - **SC** uses unit weights only → can over-fit pre-period noise. - **SDID** uses both → robust to both. In Arkhangelsky's empirical comparisons, lower MSE than either alone. <a href="#sdid-detail" class="nav-btn">SDID estimator detail</a> --- class: center, middle, inverse # Part 3 — Causal Forest --- # Why HTE — Beyond the ATE ATE answers: "should we ship?" HTE answers: "**to whom**?" .highlight-box[ The zone-notification feature has $\widehat{\text{ATE}} = +\$50$/wk. But the effect probably varies — by city density, driver tenure, time-of-week patterns. **Knowing the heterogeneity lets you target rollout, set personalized policies, and forecast aggregate impact under different deployment plans.** ] -- Classic approach: pre-specify subgroups, run interactions. Two problems: - **Multiple testing** — interview question from M6. - **Misspecification** — the right subgroups aren't always the obvious ones. **Causal forest** (Wager & Athey 2018, Athey-Tibshirani-Wager 2019): non-parametric estimation of `\(\tau(x) = E[Y(1) - Y(0) \mid X = x]\)` using an ensemble of *honest* trees. --- name: cf-how-main # Causal Forest — How It Works .pull-left[ **Honest splitting** is the key trick. Each tree: 1. Splits the sample into two halves. 2. Uses one half to *grow* the tree (decide where to split). 3. Uses the other half to *estimate* the treatment effect within each leaf. This separation prevents the same observation from informing both the split rule and the estimate inside it — eliminating the over-fitting bias that plagues vanilla trees. ] .pull-right[ **Forest-level prediction** for a new `\(x\)`: - Each tree drops `\(x\)` down to a leaf. - The leaf defines a *local neighborhood* — observations with similar covariates. - `\(\hat\tau(x)\)` = a *weighted* DiD or IV across that neighborhood, with weights coming from how often training points share a leaf with `\(x\)`. The result has pointwise confidence intervals — Wager & Athey prove asymptotic normality. ] <a href="#cf-splitting-detail" class="nav-btn">splitting algorithm</a> --- name: cf-practice-main # Causal Forest in Practice — `grf` .small[ ```r cf <- causal_forest(X = as.matrix(X), Y = Y, W = W, num.trees = 1000) average_treatment_effect(cf) # ATE estimate + SE ``` ``` ## estimate std.err ## 42.94805 2.67375 ``` ] <img src="slides_files/figure-html/cf-plot-1.png" style="display: block; margin: auto;" /> Predicted `\(\hat\tau(x)\)` rises with density; the forest recovers the truth (dashed). <a href="#cf-dgp" class="inline-btn">DGP code</a> --- # Policy Learning — A Quick Teaser Once you have `\(\hat\tau(x)\)`, you can learn a **treatment rule** `\(\pi(x) \in \{0, 1\}\)` that maximizes welfare: `$$\pi^*(x) = \mathbb{1}\{\hat\tau(x) > c(x)\}$$` where `\(c(x)\)` is a per-unit cost (e.g., the cost of pushing a notification). .blue-box[ - **`grf::policy_tree`** — fit a shallow decision tree over the covariates that maximizes estimated welfare. Output: an interpretable rule like *"deploy in dense cities to drivers with <2 years tenure."* - **Athey & Wager (2021)** prove these are **near-optimal policies**: the learned rule's welfare gets within a vanishing-with-sample-size gap of the *optimal* policy's. So the rule isn't just interpretable — it's provably close to the best rule you could have chosen if you knew the truth. ] This is what gets used to *deploy* the model. ATE answers "ship?". HTE answers "to whom?". Policy learning answers "**what should we actually do?**" --- class: center, middle, inverse # Part 4 — One-Pagers --- # Bandits — Adaptive Experimentation in One Slide .small[ **Setup:** K arms (e.g., 4 ad creatives). At each step, pick one, observe reward. Goal: maximize cumulative reward, not just identify the best arm. **Thompson Sampling** is the workhorse: maintain a posterior over each arm's reward, sample from it, play the argmax. Asymptotically optimal regret; minimal tuning. ] <img src="slides_files/figure-html/bandit-quick-1.png" style="display: block; margin: auto;" /> .small[ **When *not* to use:** when you need an unbiased ATE for a fixed-population decision (a launch / kill call). Bandits make assignment correlated with outcome history — naive analysis is biased. ] --- name: seq-main # Sequential Testing in One Slide .small[ **Problem:** classical p-values are valid only at one fixed sample size. Peeking inflates Type-I error — by the end of the experiment, the running p-value will dip below 0.05 about 30% of the time *under the null*. **Two fixes used in industry:** - **Group-sequential / O'Brien-Fleming:** spend the alpha budget on a pre-specified schedule of looks. Conservative early, normal late. - **Always-valid CIs / mSPRT:** mixture sequential probability ratio. Confidence sequences that hold *uniformly over time*. Stop whenever you want; CIs stay valid. ] ``` ## [1] 0.265 ``` .small[ Under the null, naive peeking yields ≈30% rejection rate, not 5%. mSPRT or OBF would hold this near 5%. ] <a href="#mSPRT-detail" class="nav-btn">mSPRT formula</a> --- class: center, middle, inverse # Part 5 — Wrap-Up --- # Decision Tree: Which Method When .small[ **Got randomization?** → Use the M5 toolkit (regression adjustment, CUPED, ITT/LATE). **Staggered rollout, multiple cities, multiple time periods?** - Heterogeneous effects expected? → **Callaway-Sant'Anna** (or Sun-Abraham). Avoid TWFE. - Worried about parallel trends? → **Honest DiD** sensitivity bounds. **Single treated unit?** → **Synthetic control** (or **SDID** if you have a moderate-size donor pool). **Want effect heterogeneity?** → **Causal forest** for treatment effects as a function of covariates; **policy tree** for an interpretable deployment rule. **Many arms, online learning?** → **Thompson Sampling**, *unless* you need a clean ATE. **Repeated looks at one experiment?** → **mSPRT / always-valid CIs**. ] --- # Interview Cheat Sheet .blue-box[ **"Walk me through how you'd analyze a staggered rollout."** Three steps. (1) Plot raw outcomes by cohort and event time — does parallel trends look plausible? (2) Don't run TWFE; run Callaway-Sant'Anna and report the event-study plot. (3) Sensitivity-check with Honest DiD and a robustness column from Sun-Abraham. ] .blue-box[ **"How do you estimate a causal effect for one city?"** Synthetic control. Build weights from a donor pool to match pre-treatment outcomes. Inference via placebo-in-space. Mention SDID as the modern improvement. ] .blue-box[ **"How do you get heterogeneous treatment effects?"** Causal forest from `grf`. Honest splitting prevents over-fit. Aggregate to ATE for sanity-check; predict per-unit treatment effects from the covariates; feed into a policy tree if a deployment rule is needed. ] .highlight-box[ **Red flag in interviews:** running TWFE on staggered data without a Goodman-Bacon diagnostic. ] --- # Going Deeper This module is a tour, not a treatment. Each of the three core methods has a course's worth of material behind it. .blue-box[ **Companion course (in development):** *Causal Inference Beyond A/B Tests* — a deep dive into modern DiD (formal CS / SA / BJS estimators, full Honest DiD), synthetic control variants (augmented SC, generalized SC, matrix completion), and the Athey-Wager causal-forest stack (policy learning, contextual bandits via causal trees). Repo will live alongside this one at `~/Desktop/sandbox/courses/causal-inference-beyond-ab/` and link from the same landing page. ] For the read-now references behind today's slides: .small[ - **Goodman-Bacon (2021)** — *Difference-in-differences with variation in treatment timing*, J. Econometrics. - **Callaway & Sant'Anna (2021)** — *Difference-in-differences with multiple time periods*, J. Econometrics. - **Arkhangelsky et al. (2021)** — *Synthetic difference in differences*, AER. - **Wager & Athey (2018)** — *Estimation and inference of heterogeneous treatment effects using random forests*, JASA. - **Rambachan & Roth (2023)** — *A more credible approach to parallel trends*, ReStud. ] --- class: center, middle, inverse # Backup Slides --- name: pretrend-power-detail # Backup: The Roth (2022) Pretrend Critique **The standard pretrend test** has low power against linear (or near-linear) violations — exactly the kind that would bias the post-treatment estimate. <img src="slides_files/figure-html/roth-power-1.png" style="display: block; margin: auto;" /> A meaningful linear violation is missed most of the time. Roth's recommendation: use Honest DiD bounds, not the pretrend test. <a href="#parallel-trends-main" class="nav-btn">← back</a> --- name: staggered-dgp-detail # Backup: Staggered-Adoption DGP .small[ ```r n_cities <- 12; n_t <- 24 cohorts <- tibble( city = 1:n_cities, g = c(rep(8, 3), rep(12, 3), rep(16, 3), rep(Inf, 3)) ) panel <- expand_grid(city = 1:n_cities, t = 1:n_t) |> left_join(cohorts, by = "city") |> mutate( treated = t >= g, # heterogeneous, time-varying effect: bigger for earlier cohorts eff = if_else(treated, 0.6 * (1 + 0.15 * (t - g)) * (1 + 0.7 * (g == 8) - 0.7 * (g == 16)), 0), y = 5 + 0.05 * t + city * 0.1 + eff + rnorm(n(), 0, 0.3) ) ``` ] Three cohorts adopt at `\(t = 8, 12, 16\)`; three never adopt. Effect grows over time and is **larger for earlier cohorts** — the classic case where TWFE breaks. <a href="#staggered-main" class="nav-btn">← back</a> --- name: goodman-bacon-derivation # Backup: The Goodman-Bacon Decomposition For TWFE with staggered timing, the OLS coefficient `\(\hat\beta_{\text{TWFE}}\)` decomposes as: `$$\hat\beta_{\text{TWFE}} = \sum_{k} s_k \, \widehat{\text{ATT}}_k$$` where each `\(\widehat{\text{ATT}}_k\)` is a 2×2 DiD between two cohorts (or a cohort and the never-treated), and the weights `\(s_k\)` depend on: - size of each cohort, - timing of treatment, - variance of treatment exposure within the panel. .highlight-box[ **The bug:** when an *earlier-treated* cohort is used as a comparison group for a *later-treated* one, the post-period for the comparison cohort is also under treatment. So that 2×2 has the form `\((\Delta Y_{\text{treated}}) - (\Delta Y_{\text{also treated, but longer ago}})\)`, and the second piece subtracts a *treatment-effect trend*, not a counterfactual. If treatment effects grow over time, this 2×2 gets a *negative* contribution. TWFE's "weighted average" puts positive weight on this, biasing `\(\hat\beta\)` downward. ] <a href="#twfe-main" class="nav-btn">← back</a> --- name: honest-did-detail # Backup: Honest DiD Visualization The robust CI grows as you allow more violation. The point estimate stays — only the uncertainty changes. <img src="slides_files/figure-html/honest-did-viz-1.png" style="display: block; margin: auto;" /> The smallest `\(M\)` at which the CI crosses zero is the **breakdown value** — how much PT violation you'd need to tolerate before your conclusion flips. <a href="#honest-did-main" class="nav-btn">← back</a> --- name: sc-code # Backup: Synthetic Control via `solve.QP` .small[ ```r library(quadprog) pre <- 1:(t_treat - 1) A <- donors[pre, ]; b <- treated_y[pre] # Quadratic program: # minimize || A w - b ||^2 subject to w_j >= 0, sum_j w_j = 1 Dmat <- 2 * t(A) %*% A dvec <- 2 * t(A) %*% b Amat <- cbind(rep(1, n_donor), diag(n_donor)) # constraints stacked bvec <- c(1, rep(0, n_donor)) # 1 for sum=1, 0 for w >= 0 w_hat <- solve.QP(Dmat, dvec, Amat, bvec, meq = 1)$solution synth_y <- as.numeric(donors %*% w_hat) gap <- treated_y - synth_y ``` ] `solve.QP` minimizes `\(\frac{1}{2} w^\top D w - d^\top w\)` subject to `\(A^\top w \ge b\)`, with `meq` flagging the first constraint as equality. Production code uses `Synth::synth()`, which adds a covariate-importance V-matrix and a nested optimization. <a href="#sc-est-main" class="nav-btn">← back</a> --- name: sdid-detail # Backup: SDID Estimator Mechanics The SDID estimator solves: `$$(\hat\tau, \hat\mu, \hat\alpha, \hat\beta) = \arg\min \sum_{it} (Y_{it} - \mu - \alpha_i - \beta_t - \tau D_{it})^2 \, \hat\omega_i \, \hat\lambda_t$$` where `\(\hat\omega_i\)` are unit weights from a regularized SC-style fit, and `\(\hat\lambda_t\)` are time weights from an analogous time-domain fit. .blue-box[ **Why the regularization matters.** SC's unit weights can over-fit pre-period noise on a short pre-period — picking a donor combination that matches noise rather than the underlying trend. SDID adds an `\(L_2\)` penalty on the weights, smoothing the solution and reducing variance. **Why the time weights matter.** They down-weight pre-periods that don't resemble the treatment period, focusing the comparison on "comparable" history. This is what makes SDID robust to the donor-pool drift that breaks SC. ] R implementation: `synthdid::synthdid_estimate()`. The package returns the estimate, jackknife SE, and a built-in SC vs DiD vs SDID comparison plot. <a href="#sdid-main" class="nav-btn">← back</a> --- name: cs-code # Backup: Hand-Coded Callaway-Sant'Anna .small[ ```r cohort_means <- panel |> filter(is.finite(g)) |> group_by(g, t) |> summarise(y = mean(y), .groups = "drop") never_means <- panel |> filter(is.infinite(g)) |> group_by(t) |> summarise(y_nev = mean(y), .groups = "drop") # ATT(g, t) = (Y_g,t - Y_g,g-1) - (Y_never,t - Y_never,g-1) att_gt <- cohort_means |> left_join(never_means, by = "t") |> group_by(g) |> mutate(att = (y - y[t == g - 1]) - (y_nev - y_nev[t == g - 1])) |> ungroup() |> mutate(e = t - g) ``` ] This is the simplest version: each cohort vs never-treated, anchored at `\(t = g - 1\)`. Production CS adds: not-yet-treated (vs only never-treated) controls, doubly-robust adjustment, and influence-function based standard errors. The `did` package implements all three. <a href="#cs-main" class="nav-btn">← back</a> --- name: cf-dgp # Backup: Causal-Forest DGP .small[ ```r library(grf) # DGP: zone-notification HTE depends on driver tenure + city density n <- 4000 X <- tibble(tenure = rexp(n, rate = 1/2), # years density = runif(n, 0, 1), # city density 0-1 age = sample(20:65, n, replace = TRUE)) W <- rbinom(n, 1, 0.5) # Heterogeneous true effect: rises with density, falls with tenure (capped) true_tau <- 30 + 60 * X$density - 10 * pmin(X$tenure, 4) Y <- 700 + 50 * X$age * 0.5 + true_tau * W + rnorm(n, 0, 80) ``` ] The HTE pattern: drivers in dense cities benefit most ($+\$60$/wk per density unit); long-tenured drivers benefit less (already routing efficiently). The forest must recover this without us specifying the functional form. <a href="#cf-practice-main" class="nav-btn">← back</a> --- name: cf-splitting-detail # Backup: Causal-Forest Splitting Algorithm For each candidate split `\((j, c)\)` on covariate `\(j\)` at threshold `\(c\)`, define left/right child sets. **Vanilla regression-tree criterion:** maximize variance of the *outcome mean* across the two children — i.e., split where `\(Y\)` differs most. **Causal-forest criterion** (Athey, Tibshirani & Wager 2019): maximize the **heterogeneity of the treatment effect** across children: `$$\Delta(j, c) = n_L \, n_R \, (\hat\tau_L - \hat\tau_R)^2 / n$$` where `\(\hat\tau_L, \hat\tau_R\)` are local treatment-effect estimates inside each child. The `grf` implementation uses a fast gradient-based approximation rather than evaluating `\(\hat\tau\)` exactly at every candidate split. .highlight-box[ **Honesty:** within a tree, sample-splitting separates the data used to *grow* the tree from the data used to *estimate `\(\hat\tau\)` inside leaves*. Without this, the same noise that drove the split would inflate the estimated effect inside the leaf — exactly the over-fit that vanilla trees suffer. ] <a href="#cf-how-main" class="nav-btn">← back</a> --- name: mSPRT-detail # Backup: mSPRT — Always-Valid Inference The mSPRT (mixture sequential probability ratio test) constructs a *confidence sequence* `\(\{(L_t, U_t)\}\)` such that: `$$P\!\left(\theta \in (L_t, U_t)\;\;\forall\, t\right) \ge 1 - \alpha$$` The interval is *uniformly* valid — you can stop at any time. The construction (Howard et al. 2021): for a Gaussian outcome with variance `\(\sigma^2\)`, the running CI for `\(\theta = E[X]\)` is `$$\hat\theta_t \pm \sigma \sqrt{\frac{2 \log(1/\alpha) + \log(1 + t \rho^2)}{t \rho^2}}$$` for a chosen mixing scale `\(\rho\)`. The CI shrinks like `\(\sqrt{(\log t) / t}\)` — slightly slower than fixed-$n$ `\(\sqrt{1/t}\)`, the price of always-valid coverage. .blue-box[ **Used in industry by:** Optimizely (mSPRT), Microsoft (group-sequential + mSPRT), Netflix (sequential testing platform). ] <a href="#seq-main" class="nav-btn">← back</a>