Module 8: Beyond the A/B Test

.title[
# Module 8: Beyond the A/B Test
]
.subtitle[
## Modern DiD, Synthetic Control, and Causal Forests
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">The Experimental Ideal</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Designing Around Interference</a></td><td>✓ done</td></tr>
<tr><td>4</td><td>Power and Sample Size</td><td>upcoming</td></tr>
<tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr>
<tr><td>6</td><td>Multiple Testing & Subgroups</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">External Validity</a></td><td>✓ done</td></tr>
<tr><td><b>8</b></td><td><b>Beyond the A/B Test</b> <i>(you are here)</i></td><td>current</td></tr>
</table>

---

# When Randomization Isn't Enough

Three settings where the clean experiment from M1–M5 isn't available:

.blue-box[
**1. Staggered rollouts** — Treatment turns on city-by-city over months. There's no clean "post" period that's the same for everyone. Naive two-way fixed effects (TWFE) is **biased** when effects are heterogeneous over time.

**2. One treated unit** — A policy hits a single city. No control group to randomize against. *Synthetic control* builds a counterfactual from a weighted donor pool.

**3. Heterogeneity that matters for policy** — Knowing the average effect isn't enough; you need to know *who responds*. *Causal forests* estimate treatment effects as a function of covariates.
]

This module: the modern toolkit for each, with proofs and DGP code in backup slides.

---

# Part 1 — Modern Difference-in-Differences

---

# Classic 2×2 DiD: Two Groups, Two Periods

One city gets the zone-notification feature at `$t=2$`; one doesn't. Driver weekly earnings before and after:

`$$\widehat{\text{ATT}} = (\bar y_{T,\text{post}} - \bar y_{T,\text{pre}}) - (\bar y_{C,\text{post}} - \bar y_{C,\text{pre}}) = 70 - 20 = 50$$`

---
name: parallel-trends-main

# Parallel Trends — The Identifying Assumption

The DiD estimator equals the ATT *only if* the treated and control groups would have followed the same trend, on average, in the absence of treatment.

.highlight-box[
**Parallel trends:** `$\;\;E[Y_t(0) - Y_{t-1}(0) \mid \text{treated}] = E[Y_t(0) - Y_{t-1}(0) \mid \text{control}]$`

It's an assumption about *counterfactuals* — fundamentally untestable. We can only test pretrends as a proxy.
]

**The pretrend test:** check whether the two groups moved in parallel *before* treatment. Failing this is bad news. **Passing it is weaker evidence than people think** — Roth (2022) shows pretrend tests have low power against the most-worrying violations.

<a href="#pretrend-power-detail" class="nav-btn">Roth pretrend critique</a>

---

# Staggered Adoption — What Changes

Real rollouts don't happen all at once. Cities adopt zone-notification at different times.

Three cohorts adopt at `$t=8, 12, 16$`; three never adopt. The effect grows over time and is **larger for earlier-adopting cohorts**.

---

# The TWFE Estimator — Looks Innocent, Isn't

The "default" DiD with staggered timing:

`$$Y_{it} = \alpha_i + \lambda_t + \beta\, D_{it} + \varepsilon_{it}$$`

```r
fit_twfe <- feols(y ~ treated | city + t, data = panel)
coef(fit_twfe)["treatedTRUE"]
```

```
## treatedTRUE 
##    0.822057
```

The true effects are positive and grow over time. The TWFE coefficient is wrong (sometimes the *wrong sign*) when:

- effects are heterogeneous *across cohorts*, or
- effects grow/shrink *within a cohort over time*.

**Goodman-Bacon (2021):** `$\hat\beta_{\text{TWFE}}$` is a weighted average of all possible 2×2 DiDs in the data. Some of those weights are **negative** — earlier-treated units act as "controls" for later-treated ones, which subtracts a treated trend.

<a href="#goodman-bacon-derivation" class="nav-btn">decomposition</a>

---

# Visualizing the Bacon Decomposition

`$\hat\beta_{\text{TWFE}}$` is a **weighted average of every 2×2 DiD** in the data — each cohort vs every other cohort, plus each vs never-treated. With cohorts `$g \in \{8, 12, 16, \infty\}$`, that's **9 distinct 2×2 estimates**:

The **red 2×2s** use an already-treated cohort as control — they difference out a treated trend, not a counterfactual. TWFE (dotted) averages them in anyway, landing far below truth (dashed). <a href="#goodman-bacon-derivation" class="inline-btn">why</a>

---

# The Modern Toolkit: One Slide

All four heterogeneity-robust estimators solve the same problem — only use **clean** comparisons (treated vs not-yet-treated or never-treated):

.small[
| Estimator | Year | Core idea | R package |
|---|---|---|---|
| **Callaway & Sant'Anna** | 2021 | Group-time ATT for each cohort g and time t, then aggregate | `did` |
| **Sun & Abraham** | 2021 | Saturated event-study with cohort×event-time interactions | `fixest` (`sunab()`) |
| **de Chaisemartin & D'Haultfœuille** | 2020 | Switchers vs not-yet-switchers | `DIDmultiplegt` |
| **Borusyak, Jaravel, Spiess** | 2024 | Impute the untreated outcome from never-treated, then average residuals | `didimputation` |
]

.highlight-box[
All four converge in simple cases. Pick one, report the event-study plot, and check robustness with another. **Callaway & Sant'Anna is the most-cited workhorse.**
]

---

# Callaway-Sant'Anna in Practice

The estimand is the **group-time ATT** `$\text{ATT}(g, t)$`: the effect on cohort `$g$` at calendar time `$t$`. We hand-compute it as a 2×2 DiD between each cohort and the never-treated, then plot by event time.

Each cohort's ATT path is positive and growing post-treatment — the heterogeneity TWFE smeared. <a href="#cs-code" class="inline-btn">code</a>

---
name: honest-did-main

# Honest DiD — Bounds Under Partial PT Violations

Pretrend tests have low power. What if parallel trends is "almost but not quite" right?

.blue-box[
**Rambachan & Roth (2023):** Don't pretend PT holds exactly. Posit a bound on how much the post-treatment differential trend can deviate from the pre-trend, then report a *robust confidence interval*.
]

Two common restrictions:
- **Smoothness, parameter M:** the post-treatment violation can be at most M times larger than the largest pre-treatment violation.
- **Relative magnitudes:** the post-period violation cannot exceed the worst pre-period violation.

The output is a *sensitivity plot*: the CI grows as you allow more violation. If your conclusion survives plausible M, you're robust.

<a href="#honest-did-detail" class="nav-btn">visualization</a>

---

# Application: Staggered Zone-Notification Rollout

Three city cohorts adopt zone-notification at `$t=8, 12, 16$`. The true effect is positive and **larger for earlier adopters** (these cities had more underserved zones). What does each estimator say?

|Estimator                 | Estimate|
|:-------------------------|--------:|
|Truth                     |    1.425|
|TWFE (biased)             |    0.822|
|Sun-Abraham               |    1.405|
|Callaway-Sant'Anna (hand) |    1.405|
]

TWFE is downward-biased — early cohorts (with the largest effects) get re-used as controls for later cohorts, dragging the estimate down. CS and SA recover the truth.

---

# Part 2 — Synthetic Control

---

# When You Have *One* Treated Unit

A single city introduces a policy (e.g., a minimum-wage floor for drivers). No randomization, no comparable control city. **Build one from a donor pool.**

No single donor matches. A **convex combination** might.

---

# Synthetic Control: The Estimator

Pick weights `$w_j \ge 0$` with `$\sum_j w_j = 1$` to match the treated unit's *pre-treatment* outcomes — a **convex combination** of donors:

`$$\min_{w} \sum_{t < T_0} \!\left( Y_{1t} - \sum_{j \ge 2} w_j Y_{jt} \right)^{\!2} \;\;\text{s.t.}\;\; w_j \ge 0,\;\; \sum_j w_j = 1$$`

The blue dashed line is the synthetic counterfactual: `$\sum_j \hat w_j Y_{jt}$` where the `$\hat w$` above are 0.33, 0.18, 0.28, 0.14, 0.03, 0.04. <a href="#sc-code" class="inline-btn">code</a>

---

# Synthetic Control: Inference via Placebos

Standard errors don't work — we have one treated unit. Instead: **placebo-in-space**. Re-run the procedure pretending each *donor* is the treated unit. If the true treated unit's gap is unusually large vs the placebo distribution, that's evidence of an effect.

The treated gap (red) drops below the placebo cloud after `$t=21$` — the effect is real.

---

# Synthetic DiD: The Bridge

Arkhangelsky et al. (2021) generalize both DiD and SC. **Synthetic DiD** uses *two* sets of weights:

.blue-box[
- **Unit weights** `$\hat\omega_i$` work like SC: match pre-treatment trajectories of donors to the treated unit.
- **Time weights** `$\hat\lambda_t$` match pre-treatment outcomes of the treated unit to its own post-treatment level — down-weighting periods that don't look like "now".
]

.small[
`$$\widehat{\tau}_{\text{SDID}} = \arg\min_{\tau, \mu, \alpha, \beta} \sum_{i,t} (Y_{it} - \mu - \alpha_i - \beta_t - \tau D_{it})^2 \, \hat\omega_i \, \hat\lambda_t$$`
]

Why this matters:
- **DiD** uses uniform weights → biased when donors don't match.
- **SC** uses unit weights only → can over-fit pre-period noise.
- **SDID** uses both → robust to both. In Arkhangelsky's empirical comparisons, lower MSE than either alone.

<a href="#sdid-detail" class="nav-btn">SDID estimator detail</a>

---

# Part 3 — Causal Forest

---

# Why HTE — Beyond the ATE

ATE answers: "should we ship?" HTE answers: "**to whom**?"

.highlight-box[
The zone-notification feature has $\widehat{\text{ATE}} = +\$50$/wk. But the effect probably varies — by city density, driver tenure, time-of-week patterns. **Knowing the heterogeneity lets you target rollout, set personalized policies, and forecast aggregate impact under different deployment plans.**
]

Classic approach: pre-specify subgroups, run interactions. Two problems:
- **Multiple testing** — interview question from M6.
- **Misspecification** — the right subgroups aren't always the obvious ones.

**Causal forest** (Wager & Athey 2018, Athey-Tibshirani-Wager 2019): non-parametric estimation of `$\tau(x) = E[Y(1) - Y(0) \mid X = x]$` using an ensemble of *honest* trees.

---

# Causal Forest — How It Works

1. Splits the sample into two halves.
2. Uses one half to *grow* the tree (decide where to split).
3. Uses the other half to *estimate* the treatment effect within each leaf.

This separation prevents the same observation from informing both the split rule and the estimate inside it — eliminating the over-fitting bias that plagues vanilla trees.
]

.pull-right[
**Forest-level prediction** for a new `$x$`:
- Each tree drops `$x$` down to a leaf.
- The leaf defines a *local neighborhood* — observations with similar covariates.
- `$\hat\tau(x)$` = a *weighted* DiD or IV across that neighborhood, with weights coming from how often training points share a leaf with `$x$`.

The result has pointwise confidence intervals — Wager & Athey prove asymptotic normality.
]

<a href="#cf-splitting-detail" class="nav-btn">splitting algorithm</a>

---

# Causal Forest in Practice — `grf`

```r
cf <- causal_forest(X = as.matrix(X), Y = Y, W = W, num.trees = 1000)
average_treatment_effect(cf)                    # ATE estimate + SE
```

```
## estimate  std.err 
## 42.94805  2.67375
```
]

Predicted `$\hat\tau(x)$` rises with density; the forest recovers the truth (dashed). <a href="#cf-dgp" class="inline-btn">DGP code</a>

---

# Policy Learning — A Quick Teaser

Once you have `$\hat\tau(x)$`, you can learn a **treatment rule** `$\pi(x) \in \{0, 1\}$` that maximizes welfare:

`$$\pi^*(x) = \mathbb{1}\{\hat\tau(x) > c(x)\}$$`

where `$c(x)$` is a per-unit cost (e.g., the cost of pushing a notification).

.blue-box[
- **`grf::policy_tree`** — fit a shallow decision tree over the covariates that maximizes estimated welfare. Output: an interpretable rule like *"deploy in dense cities to drivers with <2 years tenure."*
- **Athey & Wager (2021)** prove these are **near-optimal policies**: the learned rule's welfare gets within a vanishing-with-sample-size gap of the *optimal* policy's. So the rule isn't just interpretable — it's provably close to the best rule you could have chosen if you knew the truth.
]

This is what gets used to *deploy* the model. ATE answers "ship?". HTE answers "to whom?". Policy learning answers "**what should we actually do?**"

---

# Part 4 — One-Pagers

---

# Bandits — Adaptive Experimentation in One Slide

.small[
**Setup:** K arms (e.g., 4 ad creatives). At each step, pick one, observe reward. Goal: maximize cumulative reward, not just identify the best arm.

**Thompson Sampling** is the workhorse: maintain a posterior over each arm's reward, sample from it, play the argmax. Asymptotically optimal regret; minimal tuning.
]

.small[
**When *not* to use:** when you need an unbiased ATE for a fixed-population decision (a launch / kill call). Bandits make assignment correlated with outcome history — naive analysis is biased.
]

---

# Sequential Testing in One Slide

.small[
**Problem:** classical p-values are valid only at one fixed sample size. Peeking inflates Type-I error — by the end of the experiment, the running p-value will dip below 0.05 about 30% of the time *under the null*.

**Two fixes used in industry:**
- **Group-sequential / O'Brien-Fleming:** spend the alpha budget on a pre-specified schedule of looks. Conservative early, normal late.
- **Always-valid CIs / mSPRT:** mixture sequential probability ratio. Confidence sequences that hold *uniformly over time*. Stop whenever you want; CIs stay valid.
]

```
## [1] 0.265
```

.small[
Under the null, naive peeking yields ≈30% rejection rate, not 5%. mSPRT or OBF would hold this near 5%.
]

<a href="#mSPRT-detail" class="nav-btn">mSPRT formula</a>

---

# Part 5 — Wrap-Up

---

# Decision Tree: Which Method When

**Staggered rollout, multiple cities, multiple time periods?**
- Heterogeneous effects expected? → **Callaway-Sant'Anna** (or Sun-Abraham). Avoid TWFE.
- Worried about parallel trends? → **Honest DiD** sensitivity bounds.

**Single treated unit?** → **Synthetic control** (or **SDID** if you have a moderate-size donor pool).

**Want effect heterogeneity?** → **Causal forest** for treatment effects as a function of covariates; **policy tree** for an interpretable deployment rule.

**Many arms, online learning?** → **Thompson Sampling**, *unless* you need a clean ATE.

**Repeated looks at one experiment?** → **mSPRT / always-valid CIs**.
]

---

# Interview Cheat Sheet

.blue-box[
**"Walk me through how you'd analyze a staggered rollout."** Three steps. (1) Plot raw outcomes by cohort and event time — does parallel trends look plausible? (2) Don't run TWFE; run Callaway-Sant'Anna and report the event-study plot. (3) Sensitivity-check with Honest DiD and a robustness column from Sun-Abraham.
]

.blue-box[
**"How do you estimate a causal effect for one city?"** Synthetic control. Build weights from a donor pool to match pre-treatment outcomes. Inference via placebo-in-space. Mention SDID as the modern improvement.
]

.blue-box[
**"How do you get heterogeneous treatment effects?"** Causal forest from `grf`. Honest splitting prevents over-fit. Aggregate to ATE for sanity-check; predict per-unit treatment effects from the covariates; feed into a policy tree if a deployment rule is needed.
]

.highlight-box[
**Red flag in interviews:** running TWFE on staggered data without a Goodman-Bacon diagnostic.
]

---

# Going Deeper

This module is a tour, not a treatment. Each of the three core methods has a course's worth of material behind it.

.blue-box[
**Companion course (in development):** *Causal Inference Beyond A/B Tests* — a deep dive into modern DiD (formal CS / SA / BJS estimators, full Honest DiD), synthetic control variants (augmented SC, generalized SC, matrix completion), and the Athey-Wager causal-forest stack (policy learning, contextual bandits via causal trees).

Repo will live alongside this one at `~/Desktop/sandbox/courses/causal-inference-beyond-ab/` and link from the same landing page.
]

For the read-now references behind today's slides:

.small[
- **Goodman-Bacon (2021)** — *Difference-in-differences with variation in treatment timing*, J. Econometrics.
- **Callaway & Sant'Anna (2021)** — *Difference-in-differences with multiple time periods*, J. Econometrics.
- **Arkhangelsky et al. (2021)** — *Synthetic difference in differences*, AER.
- **Wager & Athey (2018)** — *Estimation and inference of heterogeneous treatment effects using random forests*, JASA.
- **Rambachan & Roth (2023)** — *A more credible approach to parallel trends*, ReStud.
]

---

# Backup Slides

---
name: pretrend-power-detail

# Backup: The Roth (2022) Pretrend Critique

**The standard pretrend test** has low power against linear (or near-linear) violations — exactly the kind that would bias the post-treatment estimate.

A meaningful linear violation is missed most of the time. Roth's recommendation: use Honest DiD bounds, not the pretrend test.

---
name: staggered-dgp-detail

# Backup: Staggered-Adoption DGP

```r
n_cities <- 12; n_t <- 24
cohorts <- tibble(
  city = 1:n_cities,
  g = c(rep(8, 3), rep(12, 3), rep(16, 3), rep(Inf, 3))
)
panel <- expand_grid(city = 1:n_cities, t = 1:n_t) |>
  left_join(cohorts, by = "city") |>
  mutate(
    treated = t >= g,
    # heterogeneous, time-varying effect: bigger for earlier cohorts
    eff = if_else(treated, 0.6 * (1 + 0.15 * (t - g)) *
                            (1 + 0.7 * (g == 8) - 0.7 * (g == 16)), 0),
    y = 5 + 0.05 * t + city * 0.1 + eff + rnorm(n(), 0, 0.3)
  )
```
]

Three cohorts adopt at `$t = 8, 12, 16$`; three never adopt. Effect grows over time and is **larger for earlier cohorts** — the classic case where TWFE breaks.

---
name: goodman-bacon-derivation

# Backup: The Goodman-Bacon Decomposition

For TWFE with staggered timing, the OLS coefficient `$\hat\beta_{\text{TWFE}}$` decomposes as:

`$$\hat\beta_{\text{TWFE}} = \sum_{k} s_k \, \widehat{\text{ATT}}_k$$`

where each `$\widehat{\text{ATT}}_k$` is a 2×2 DiD between two cohorts (or a cohort and the never-treated), and the weights `$s_k$` depend on:

- size of each cohort,
- timing of treatment,
- variance of treatment exposure within the panel.

.highlight-box[
**The bug:** when an *earlier-treated* cohort is used as a comparison group for a *later-treated* one, the post-period for the comparison cohort is also under treatment. So that 2×2 has the form `$(\Delta Y_{\text{treated}}) - (\Delta Y_{\text{also treated, but longer ago}})$`, and the second piece subtracts a *treatment-effect trend*, not a counterfactual.

If treatment effects grow over time, this 2×2 gets a *negative* contribution. TWFE's "weighted average" puts positive weight on this, biasing `$\hat\beta$` downward.
]

---
name: honest-did-detail

# Backup: Honest DiD Visualization

The robust CI grows as you allow more violation. The point estimate stays — only the uncertainty changes.

The smallest `$M$` at which the CI crosses zero is the **breakdown value** — how much PT violation you'd need to tolerate before your conclusion flips.

---
name: sc-code

# Backup: Synthetic Control via `solve.QP`

```r
library(quadprog)
pre <- 1:(t_treat - 1)
A <- donors[pre, ]; b <- treated_y[pre]

# Quadratic program:
#   minimize  || A w - b ||^2     subject to  w_j >= 0,  sum_j w_j = 1
Dmat <- 2 * t(A) %*% A
dvec <- 2 * t(A) %*% b
Amat <- cbind(rep(1, n_donor), diag(n_donor))   # constraints stacked
bvec <- c(1, rep(0, n_donor))                   # 1 for sum=1, 0 for w >= 0
w_hat   <- solve.QP(Dmat, dvec, Amat, bvec, meq = 1)$solution
synth_y <- as.numeric(donors %*% w_hat)
gap     <- treated_y - synth_y
```
]

`solve.QP` minimizes `$\frac{1}{2} w^\top D w - d^\top w$` subject to `$A^\top w \ge b$`, with `meq` flagging the first constraint as equality. Production code uses `Synth::synth()`, which adds a covariate-importance V-matrix and a nested optimization.

---
name: sdid-detail

# Backup: SDID Estimator Mechanics

The SDID estimator solves:

`$$(\hat\tau, \hat\mu, \hat\alpha, \hat\beta) = \arg\min \sum_{it} (Y_{it} - \mu - \alpha_i - \beta_t - \tau D_{it})^2 \, \hat\omega_i \, \hat\lambda_t$$`

where `$\hat\omega_i$` are unit weights from a regularized SC-style fit, and `$\hat\lambda_t$` are time weights from an analogous time-domain fit.

.blue-box[
**Why the regularization matters.** SC's unit weights can over-fit pre-period noise on a short pre-period — picking a donor combination that matches noise rather than the underlying trend. SDID adds an `$L_2$` penalty on the weights, smoothing the solution and reducing variance.

**Why the time weights matter.** They down-weight pre-periods that don't resemble the treatment period, focusing the comparison on "comparable" history. This is what makes SDID robust to the donor-pool drift that breaks SC.
]

R implementation: `synthdid::synthdid_estimate()`. The package returns the estimate, jackknife SE, and a built-in SC vs DiD vs SDID comparison plot.

---
name: cs-code

# Backup: Hand-Coded Callaway-Sant'Anna

This is the simplest version: each cohort vs never-treated, anchored at `$t = g - 1$`. Production CS adds: not-yet-treated (vs only never-treated) controls, doubly-robust adjustment, and influence-function based standard errors. The `did` package implements all three.

---
name: cf-dgp

# Backup: Causal-Forest DGP

```r
library(grf)
# DGP: zone-notification HTE depends on driver tenure + city density
n <- 4000
X <- tibble(tenure = rexp(n, rate = 1/2),    # years
            density = runif(n, 0, 1),        # city density 0-1
            age = sample(20:65, n, replace = TRUE))
W <- rbinom(n, 1, 0.5)
# Heterogeneous true effect: rises with density, falls with tenure (capped)
true_tau <- 30 + 60 * X$density - 10 * pmin(X$tenure, 4)
Y <- 700 + 50 * X$age * 0.5 + true_tau * W + rnorm(n, 0, 80)
```
]

The HTE pattern: drivers in dense cities benefit most ($+\$60$/wk per density unit); long-tenured drivers benefit less (already routing efficiently). The forest must recover this without us specifying the functional form.

---
name: cf-splitting-detail

# Backup: Causal-Forest Splitting Algorithm

For each candidate split `$(j, c)$` on covariate `$j$` at threshold `$c$`, define left/right child sets.

**Vanilla regression-tree criterion:** maximize variance of the *outcome mean* across the two children — i.e., split where `$Y$` differs most.

**Causal-forest criterion** (Athey, Tibshirani & Wager 2019): maximize the **heterogeneity of the treatment effect** across children:

`$$\Delta(j, c) = n_L \, n_R \, (\hat\tau_L - \hat\tau_R)^2 / n$$`

where `$\hat\tau_L, \hat\tau_R$` are local treatment-effect estimates inside each child. The `grf` implementation uses a fast gradient-based approximation rather than evaluating `$\hat\tau$` exactly at every candidate split.

.highlight-box[
**Honesty:** within a tree, sample-splitting separates the data used to *grow* the tree from the data used to *estimate `$\hat\tau$` inside leaves*. Without this, the same noise that drove the split would inflate the estimated effect inside the leaf — exactly the over-fit that vanilla trees suffer.
]

---
name: mSPRT-detail

# Backup: mSPRT — Always-Valid Inference

The mSPRT (mixture sequential probability ratio test) constructs a *confidence sequence* `$\{(L_t, U_t)\}$` such that:

`$$P\!\left(\theta \in (L_t, U_t)\;\;\forall\, t\right) \ge 1 - \alpha$$`

The interval is *uniformly* valid — you can stop at any time.

The construction (Howard et al. 2021): for a Gaussian outcome with variance `$\sigma^2$`, the running CI for `$\theta = E[X]$` is

`$$\hat\theta_t \pm \sigma \sqrt{\frac{2 \log(1/\alpha) + \log(1 + t \rho^2)}{t \rho^2}}$$`

for a chosen mixing scale `$\rho$`. The CI shrinks like `$\sqrt{(\log t) / t}$` — slightly slower than fixed-$n$ `$\sqrt{1/t}$`, the price of always-valid coverage.

.blue-box[
**Used in industry by:** Optimizely (mSPRT), Microsoft (group-sequential + mSPRT), Netflix (sequential testing platform).
]