Module 5: Analyzing Experiments

class: center, middle, inverse, title-slide

.title[
# Module 5: Analyzing Experiments
]
.subtitle[
## From Difference in Means to CUPED, ITT, and LATE
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">The Experimental Ideal</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Designing Around Interference</a></td><td>✓ done</td></tr>
<tr><td>4</td><td>Power and Sample Size</td><td>done</td></tr>
<tr><td><b>5</b></td><td><b>Analyzing Experiments</b> <i>(you are here)</i></td><td>current</td></tr>
<tr><td>6</td><td>Multiple Testing & Subgroups</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">External Validity</a></td><td>✓ done</td></tr>
<tr><td>8</td><td><a href="../module-08/slides.html">Beyond the A/B Test</a></td><td>✓ done</td></tr>
</table>

---

# Start Simple: Difference in Means

.small[
Same zone-notification experiment as Module 1, now with **weekly earnings** as the primary outcome (continuous — easier to demonstrate variance reduction).

```r
n <- 2000
drivers <- tibble(                                     # ~$500/wk baseline; +$50/wk effect
  id = 1:n,
  pre_earnings = rlnorm(n, meanlog = 6.2, sdlog = 0.4),
  notification = rep(c(0, 1), each = n / 2),
  post_earnings = pre_earnings + 50 * notification + rnorm(n, sd = 80)
)
```

```r
drivers |>
  group_by(notification) |>
  summarise(mean_earn = mean(post_earnings)) |>
  summarise(ate = mean_earn[notification == 1] - mean_earn[notification == 0])
```

```
## # A tibble: 1 × 1
##     ate
##   <dbl>
## 1  51.6
```

Unbiased for the ATE. But the standard error is large because `post_earnings` is noisy.
]

---
name: ols-equals-dim-main

# Regression: Same Thing, More Flexible

```r
fit_simple <- lm(post_earnings ~ notification, data = drivers)
summary(fit_simple)$coefficients |> round(3)
```

```
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   528.287      7.480  70.624        0
## notification   51.567     10.579   4.875        0
```

The coefficient on `notification` is identical to the difference in means.

.blue-box[
**Key fact:** In a randomized experiment with no covariates, OLS regression of `$Y$` on `$D$` gives exactly the difference in means. The regression framework just makes it easier to add covariates and compute standard errors.
]

<a href="#ols-equals-dim" class="nav-btn">proof</a>

---

# Adding Covariates: Why and How

Pre-treatment earnings (`pre_earnings`) predict post-treatment earnings. Adding it as a covariate reduces residual variance:

```r
fit_covar <- lm(post_earnings ~ notification + pre_earnings, data = drivers)
summary(fit_covar)$coefficients |> round(3)
```

```
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -3.265      5.020  -0.650    0.516
## notification   48.496      3.613  13.421    0.000
## pre_earnings    1.006      0.008 122.999    0.000
```

.pull-left[
**Without covariate:**

```
## SE(notification): 10.58
```
]

.pull-right[
**With covariate:**

```
## SE(notification): 3.61
```
]

.highlight-box[
The standard error dropped substantially. Same data, same experiment, but a **tighter** estimate — just by using information we already had.
]

---

# How Much Does the Covariate Help?

Vary the SD of pre-treatment earnings (holding the post-treatment noise fixed at `$\sigma_\varepsilon = 80$`) and re-fit both regressions:

.small[
The unadjusted regression treats *all* of `post_earnings`'s variance as noise. The adjusted regression strips out the part explained by `pre_earnings`, so only `$\sigma_\varepsilon$` remains. The wider the gap, the more variance reduction you buy.
]

---

# Lin (2013): Covariate Adjustment Done Right

Simply adding covariates to a regression can be slightly wrong when treatment effects are heterogeneous. Lin (2013) showed the right way:

.blue-box[
**Lin's estimator:** Interact the (demeaned) covariate with treatment:

`$$Y_i = \alpha + \tau D_i + \beta (X_i - \bar{X}) + \gamma D_i (X_i - \bar{X}) + \varepsilon_i$$`

This is **always** at least as precise as the simple difference in means, regardless of whether the covariate matters.
]

.small[

```r
drivers <- drivers |>
  mutate(pre_earnings_dm = pre_earnings - mean(pre_earnings))
fit_lin <- lm(post_earnings ~ notification * pre_earnings_dm, data = drivers)
summary(fit_lin)$coefficients |> round(3)
```

```
##                              Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   529.798      2.553 207.498    0.000
## notification                   48.496      3.611  13.430    0.000
## pre_earnings_dm                 0.990      0.011  86.293    0.000
## notification:pre_earnings_dm    0.032      0.016   1.940    0.053
```
]

The treatment effect estimate is the coefficient on `notification`.

---

# CUPED: Variance Reduction Using Pre-Experiment Data

.small[
**CUPED** (Controlled-experiment Using Pre-Experiment Data): Use the pre-treatment outcome to reduce variance of the treatment effect estimate. Widely used at Microsoft, Netflix, and other tech companies.

The idea: construct a variance-reduced outcome
`$$\tilde{Y}_i = Y_i - \theta (X_i - \bar{X})$$`
where X is the pre-treatment outcome and theta is the OLS slope of Y on X (Cov(Y, X) / Var(X)).

```r
theta <- cov(drivers$post_earnings, drivers$pre_earnings) /
         var(drivers$pre_earnings)
drivers <- drivers |>
  mutate(y_cuped = post_earnings -
                   theta * (pre_earnings - mean(pre_earnings)))

cat("Var(post_earnings):", round(var(drivers$post_earnings), 0), "\n")
```

```
## Var(post_earnings): 56591
```

```r
cat("Var(y_cuped):      ", round(var(drivers$y_cuped), 0), "\n")
```

```
## Var(y_cuped):       7110
```

```r
cat("Variance reduction:",
    round(1 - var(drivers$y_cuped)/var(drivers$post_earnings), 3))
```

```
## Variance reduction: 0.874
```
]

---
name: cuped-regression-main

# CUPED = Regression (It's the Same Thing)

```r
# CUPED estimate
cuped_est <- mean(drivers$y_cuped[drivers$notification == 1]) -
             mean(drivers$y_cuped[drivers$notification == 0])

# Regression with covariate
reg_est <- coef(fit_covar)["notification"]

cat("CUPED estimate:     ", round(cuped_est, 2), "\n")
```

```
## CUPED estimate:      48.49
```

```r
cat("Regression estimate:", round(reg_est, 2), "\n")
```

```
## Regression estimate: 48.5
```

.highlight-box[
**CUPED is just regression adjustment repackaged.** The optimal `$\theta$` is the OLS coefficient of `$Y$` on `$X$`. The CUPED-adjusted difference in means equals the regression coefficient on treatment.

Why use CUPED framing? It's easier to implement in production pipelines (you transform the metric once, then compute simple means).
]

<a href="#cuped-fwl" class="nav-btn">FWL connection</a>

---

# Simulation: CUPED Variance Reduction

Both centered on the truth, but CUPED has a much narrower spread.

---

# Stratification and Post-Stratification

**Stratified randomization:** Randomize *within* blocks defined by important covariates. Guarantees exact balance.

```r
# Stratify by pre_earnings tercile
drivers_strat <- drivers |>
  mutate(stratum = ntile(pre_earnings, 3),
         # Within each stratum, assign half to treatment (1) and half to control (0).
         # ave() applies the function group-wise; we feed it row indices and ignore
         # them — the function just returns a shuffled 0/1 vector of the right length.
         notif_strat = ave(1:n(), stratum, FUN = function(x)
           sample(rep(c(0, 1), length.out = length(x)))))

# Post-stratification: add stratum fixed effects
fit_strat <- lm(post_earnings ~ notif_strat + factor(stratum),
                data = drivers_strat)
summary(fit_strat)$coefficients[1:2, ] |> round(3)
```

```
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  345.451      6.521  52.972    0.000
## notif_strat    4.006      6.525   0.614    0.539
```

.highlight-box[
**Post-stratification** adjusts for strata after the fact (even if randomization wasn't stratified). It's a free variance reduction — always include randomization strata as fixed effects.
]

---

# Robust Standard Errors

OLS standard errors assume homoskedasticity. In experiments, we should use **heteroskedasticity-robust** (HC) standard errors.

.small[

```r
# HC2 standard errors (recommended for experiments)
# Using the sandwich formula manually:
X <- model.matrix(fit_covar)
e <- residuals(fit_covar)
hat_vals <- hatvalues(fit_covar)
meat <- t(X) %*% diag(e^2 / (1 - hat_vals)) %*% X
bread <- solve(t(X) %*% X)
vcov_hc2 <- bread %*% meat %*% bread
se_hc2 <- sqrt(diag(vcov_hc2))

cat("OLS SE: ", round(summary(fit_covar)$coefficients[2, 2], 3), "\n")
```

```
## OLS SE:  3.613
```

```r
cat("HC2 SE: ", round(se_hc2[2], 3), "\n")
```

```
## HC2 SE:  3.614
```
]

For **cluster-randomized** designs, cluster the standard errors at the level of randomization:

.small[

```r
# If randomized at city level (Module 3):
# lm(y ~ notification, data = df) with cluster-robust SEs at city level
# Or use fixest::feols(y ~ notification, cluster = ~city, data = df)
```
]

---

# ITT vs LATE: Non-Compliance

Not every driver assigned to the notification feature actually receives notifications — some have notifications disabled at the OS level, some opt out in-app.

.small[

```r
n <- 2000
experiment <- tibble(
  id = 1:n,
  assigned = rep(c(0, 1), each = n / 2),
  # 70% compliance: only 70% of those assigned actually receive notifications
  receives_notif = if_else(assigned == 1, rbinom(n, 1, prob = 0.70), 0L),
  # Effect only for compliers: +$60/wk
  post_earnings = rlnorm(n, meanlog = 6.2, sdlog = 0.4) +
                  60 * receives_notif +
                  rnorm(n, sd = 60)
)
```
]

.pull-left[
**ITT** (Intent-to-Treat):
.small[

```r
itt <- experiment |>
  group_by(assigned) |>
  summarise(m = mean(post_earnings)) |>
  pull(m) |> diff()
cat("ITT: $", round(itt, 1))
```

```
## ITT: $ 42
```
]
]

.pull-right[
**LATE** (Local Average Treatment Effect):
.small[

```r
compliance_rate <- experiment |>
  filter(assigned == 1) |>
  pull(receives_notif) |> mean()
late <- itt / compliance_rate
cat("LATE: $", round(late, 1),
    "| Compliance:", round(compliance_rate, 2))
```

```
## LATE: $ 59.8 | Compliance: 0.7
```
]
]

---

# ITT vs LATE: What to Report?

.blue-box[
**ITT** = effect of *assigning* treatment (rolling out the notification feature). Always identified by randomization. This is the policy-relevant effect: "what happens if we ship to everyone?"

**LATE** = effect of *actually receiving* notifications, on compliers. Requires additional assumptions (monotonicity, exclusion restriction). This is the "efficacy" effect.
]

|Estimand                      | Estimate|Issue                         |
|:-----------------------------|--------:|:-----------------------------|
|ITT (OLS on assignment)       |     42.0|Diluted by non-compliance     |
|LATE (IV / Wald)              |     59.8|Accounts for non-compliance   |
|Naive (OLS on actual receipt) |     52.5|BIASED: receipt is endogenous |

.highlight-box[
**Never** regress on actual treatment take-up — it's endogenous. Use assignment as an instrument (2SLS) to get LATE, or report ITT.
]

---

# Application: Zone-Notification with CUPED

---

# Application: DiD for Pre/Post Intervention

The author-nudge experiment from Module 2: 200 studies, 4 treatment arms (T0–T3), measure hypothesis-reporting rates before and after the intervention.

.small[

```r
n_studies <- 200
studies <- tibble(
  study_id = 1:n_studies,
  has_pap = rbinom(n_studies, 1, 0.4),
  arm = sample(rep(c("T0", "T1", "T2", "T3"), each = n_studies / 4)),
  # Pre-period: ~42% of hypotheses reported on average
  y_pre = rbeta(n_studies, shape1 = 3, shape2 = 4),
  # Post-period: T0 no change, T1-T3 increase reporting
  effect = case_when(
    arm == "T0" ~ 0,
    arm == "T1" ~ 0.03,
    arm == "T2" ~ 0.06,
    arm == "T3" ~ 0.10
  ),
  y_post = pmin(1, y_pre + effect + rnorm(n_studies, sd = 0.08))
)
```
]

---

# DiD: Estimating Treatment Effects

.small[

```r
# Reshape to panel (DiD needs long format)
panel <- studies |>
  pivot_longer(c(y_pre, y_post), names_to = "period",
               values_to = "y_report", names_prefix = "y_") |>
  mutate(post = if_else(period == "post", 1, 0))

# Cross-section (post-only) vs DiD (panel + study FE, T0 baseline)
fit_cs  <- lm(y_post ~ arm, data = studies)
fit_did <- lm(y_report ~ factor(study_id) + post * arm, data = panel)
```
]

.pull-left[
.small[

|arm      | truth|    cs| cs_se|   did| did_se|
|:--------|-----:|-----:|-----:|-----:|------:|
|T1 vs T0 |  0.03| 0.060| 0.039| 0.027|  0.016|
|T2 vs T0 |  0.06| 0.055| 0.039| 0.051|  0.016|
|T3 vs T0 |  0.10| 0.159| 0.039| 0.109|  0.016|
]
]

.pull-right[
.highlight-box[
.small[
Both unbiased (arm is randomized). **DiD shrinks SE ~2×** by absorbing study fixed effects — same trick as CUPED. Identifying assumption: parallel trends.
]
]
]

---

# Pre-Analysis Plans

.blue-box[
**Pre-analysis plan (PAP):** A document specifying the analysis *before* seeing the data. Includes: primary outcomes, estimating equations, subgroup analyses, and how to handle complications.
]

A good PAP specifies:

1. **Primary outcome(s)** and how they're constructed
2. **Estimating equation** (e.g., the DiD specification above)
3. **Standard errors** (robust? clustered? at what level?)
4. **Multiple testing** corrections (if applicable — Module 6)
5. **Subgroup analyses** (which ones are confirmatory vs. exploratory?)
6. **Rules for deviations** (what do you do if compliance is lower than expected?)

.highlight-box[
**Why PAPs matter:** Without pre-commitment, researchers (or product teams) can run dozens of specifications and report only the significant ones. A PAP distinguishes confirmatory from exploratory analysis.
]

---

# Summary: The Analysis Toolkit

| Method | When to Use | Key Benefit |
|--------|------------|-------------|
| Difference in means | Simplest case | Unbiased, transparent |
| Regression + covariates | Pre-treatment predictors available | Variance reduction |
| Lin (2013) | Heterogeneous effects possible | Robust to misspecification |
| CUPED | Pre-treatment outcome available | Large variance reduction |
| Stratification FEs | Stratified randomization | Uses design information |
| HC2 standard errors | Always (in experiments) | Robust to heteroskedasticity |
| Cluster-robust SEs | Cluster randomization | Correct inference |
| ITT | Non-compliance present | Always identified |
| LATE / IV | Want "efficacy" effect | Handles non-compliance |
| DiD | Panel data (pre/post) | Controls for time-invariant confounders |

---

# Key Takeaways

1. **Difference in means** is the foundation. Regression gives the same answer but scales to covariates.

2. **Covariate adjustment** (Lin 2013, CUPED) reduces variance without introducing bias. Always use pre-treatment data when available.

3. **CUPED is regression.** The `$\theta$` is just the OLS coefficient. CUPED is a convenient production implementation.

4. **Robust SEs** (HC2) should be standard. Cluster at the level of randomization.

5. **ITT vs LATE:** Report ITT as the primary result. Use IV for LATE when compliance matters. Never naively regress on take-up.

6. **Pre-analysis plans** separate confirmatory from exploratory analysis. They protect you from yourself.

---

# Exercise Preview

In the exercise you will:

1. Implement CUPED from scratch and verify it matches regression
2. Compare precision with and without covariate adjustment across 1,000 simulations
3. Implement Lin's (2013) estimator and compare to naive covariate adjustment
4. Estimate ITT and LATE with simulated non-compliance data
5. Run a DiD specification on simulated panel data

See `exercise.R` for the starter code.

---
class: center, middle, inverse

# Backup Slides

---
name: ols-equals-dim

# Backup: Why OLS = Difference in Means

For randomized `$D \in \{0, 1\}$`, run OLS of `$Y$` on `$D$` with intercept:
`$$\widehat{Y}_i = \hat\alpha + \hat\beta D_i.$$`

OLS minimizes `$\sum_i (Y_i - \hat\alpha - \hat\beta D_i)^2$`. **Split the sum by treatment status:**
`$$\sum_{D_i = 0} (Y_i - \hat\alpha)^2 \;+\; \sum_{D_i = 1} (Y_i - \hat\alpha - \hat\beta)^2.$$`

Each piece is a "fit a single number to a sample" problem. The minimizer is the corresponding group mean:

`$$\hat\alpha = \bar{Y}_0 \quad\text{and}\quad \hat\alpha + \hat\beta = \bar{Y}_1.$$`

Therefore

`$$\boxed{\;\hat\beta_{\text{OLS}} \;=\; \bar{Y}_1 - \bar{Y}_0 \;=\; \widehat{\text{ATE}}.\;}$$`

This is **exact in finite samples**, not just asymptotic. The same algebra goes through in a regression with covariates, but only when the covariates are *centered and interacted with* `$D$` — that's the Lin (2013) estimator. <a href="#lin-proof" class="inline-btn">Lin proof</a>

---
name: cuped-derivation

# Backup: CUPED Derivation

Estimate `$\tau = E[Y(1)] - E[Y(0)]$` with minimum variance.

**Step 1:** Define `$\tilde{Y}_i = Y_i - \theta(X_i - E[X])$` where `$X_i$` is a pre-treatment covariate.

**Step 2:** The variance of the difference in means of `$\tilde{Y}$` is:
`$$\text{Var}(\hat{\tau}_{\text{CUPED}}) = \text{Var}(\hat{\tau}_{\text{raw}}) + \theta^2 \text{Var}(\bar{X}_T - \bar{X}_C) - 2\theta \text{Cov}(\hat{\tau}_{\text{raw}}, \bar{X}_T - \bar{X}_C)$$`

**Step 3:** Minimize over `$\theta$`:
`$$\theta^* = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}$$`

(The OLS coefficient from regressing `$Y$` on `$X$` — CUPED is regression.)

**Step 4:** Variance reduction:
`$$\frac{\text{Var}(\hat{\tau}_{\text{CUPED}})}{\text{Var}(\hat{\tau}_{\text{raw}})} = 1 - \text{Corr}(Y, X)^2$$`

---
name: cuped-fwl

# Backup: CUPED ↔ FWL / Residualization

.small[
**Frisch–Waugh–Lovell.** To recover `$\hat\beta$` from `$Y = \alpha + \beta D + \gamma X + \varepsilon$`:

1. Residualize `$Y$` on `$X$`: `$\tilde Y = Y - \hat\gamma_Y X$` (with `$\hat\gamma_Y = \text{Cov}(Y,X)/\text{Var}(X)$`)
2. Residualize `$D$` on `$X$`: `$\tilde D = D - \hat\gamma_D X$`
3. Regress `$\tilde Y$` on `$\tilde D$`. The coefficient equals `$\hat\beta$` from the full regression.

**In a randomized experiment, `$D \perp X$` by design.** So `$\hat\gamma_D \to 0$` and step 2 collapses: `$\tilde D = D - \bar D$` — no residualization needed.

What remains is exactly CUPED:
`$$\tilde Y_i = Y_i - \theta(X_i - \bar X), \quad \theta = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}$$`
then compare means by `$D$`. The `$\theta$` in CUPED *is* the OLS slope `$\hat\gamma_Y$` from FWL's first step. CUPED is the experiment-only shortcut: in observational settings, where `$D$` is not independent of `$X$`, you have to residualize `$D$` too — i.e., go back to full FWL.

| Setting | Residualize Y? | Residualize D? |
|---|---|---|
| Observational (D not indep. X) | yes | yes — full FWL |
| Experiment (D indep. X) | yes | **no — that's CUPED** |
]

---
name: lin-proof

# Backup: Why Lin (2013) Works

The standard covariate-adjusted estimator regresses `$Y$` on `$D$` and `$X$`:
`$$Y_i = \alpha + \tau D_i + \beta X_i + \varepsilon_i$$`

**Problem:** This constrains the coefficient on `$X$` to be the same in treatment and control groups. If the relationship between `$X$` and `$Y$` differs by treatment status, `$\hat{\tau}$` can be biased.

**Lin's fix:** Interact the demeaned covariate with treatment:
`$$Y_i = \alpha + \tau D_i + \beta_0 \tilde{X}_i + \beta_1 D_i \tilde{X}_i + \varepsilon_i$$`
where `$\tilde{X}_i = X_i - \bar{X}$`.

**Properties:**
1. `$\hat{\tau}$` is consistent for the ATE regardless of the true data-generating process.
2. `$\hat{\tau}$` is at least as precise as the unadjusted estimator (asymptotically).
3. With HC2 standard errors, inference is valid even under heteroskedasticity.

**Intuition:** Demeaning ensures the interaction term doesn't shift the intercept. The interaction allows different slopes in each group, avoiding functional form bias.

---
name: wald-estimator

# Backup: The Wald Estimator (LATE)

With one-sided non-compliance (only treatment group can take treatment):

`$$\text{LATE} = \frac{\text{ITT}}{\text{compliance rate}} = \frac{E[Y | Z=1] - E[Y | Z=0]}{E[D | Z=1] - E[D | Z=0]}$$`

where `$Z$` = assignment, `$D$` = actual treatment take-up.

This is equivalent to 2SLS with `$Z$` as an instrument for `$D$`.

**Required assumptions:**
1. **Relevance:** `$Z$` affects `$D$` (first stage is strong)
2. **Independence:** `$Z$` is randomly assigned
3. **Exclusion restriction:** `$Z$` affects `$Y$` only through `$D$`
4. **Monotonicity:** No "defiers" (no one does the opposite of their assignment)

Under these assumptions, LATE identifies the effect on **compliers** — those who take treatment when assigned to it and don't when assigned to control.