Module 7: External Validity & Generalizability

class: center, middle, inverse, title-slide

.title[
# Module 7: External Validity & Generalizability
]
.subtitle[
## Your Zone-Notification Worked in 8 Cities. Will It Work in the Other 32?
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">The Experimental Ideal</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Designing Around Interference</a></td><td>✓ done</td></tr>
<tr><td>4</td><td>Power and Sample Size</td><td>done</td></tr>
<tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr>
<tr><td>6</td><td>Multiple Testing & Subgroups</td><td>done</td></tr>
<tr><td><b>7</b></td><td><b>External Validity & Generalizability</b> <i>(you are here)</i></td><td>current</td></tr>
<tr><td>8</td><td><a href="../module-08/slides.html">Beyond the A/B Test</a></td><td>✓ done</td></tr>
</table>

---

# The Setup: Zone-Notification in 8 of 40 Cities

The platform from Modules 1–3 ran the **zone-notification experiment** in just **8 hand-picked metros** (the ones with the most engineering bandwidth — and the densest markets). Pooled estimate of the direct effect on accept rate: **≈ +0.11** — about double the +0.05 the M3 model implied for an average city.

The CEO asks: "Great, let's roll this out to all 40 cities." Should you?

---

# Internal vs External Validity

.pull-left[
.blue-box[
**Internal validity:** Is the causal estimate correct *within the study*?
- Randomization was done properly
- No attrition, spillovers, or compliance issues
- The estimated ATE is unbiased for the study population
]
]

.pull-right[
.blue-box[
**External validity:** Does the effect generalize *beyond the study*?
- To other populations (cities, countries)
- To other time periods
- To other implementations of the treatment
- To other outcome measures
]
]

.highlight-box[
**The tradeoff:** Lab experiments maximize internal validity (tight control) but may have low external validity (artificial setting). Field experiments in one site have high internal validity but limited external validity. Multi-site experiments trade some precision for generalizability.
]

---

# Why 8-City Results Might Not Generalize

The selected metros are systematically denser than the typical city. Cities with thin demand or sparse driver pools (negative `city_effect`) see much smaller effects — sometimes near zero.

---

# Site Selection Bias

.blue-box[
.small[
**Site selection bias:** experiment sites are not random samples of the target population — they're chosen for convenience, feasibility, or where the treatment is expected to work.
]
]

.pull-left[
.small[
**Common patterns:**
- Tech companies test in the densest US metros first
- Development RCTs run in places with cooperative governments
- Drug trials recruit from academic medical centers, not community clinics

In our setup: the 8 selected metros sit in the top quintile of `city_effect`. Their average local effect overstates the all-cities target.
]
]

.pull-right[
<img src="slides_files/figure-html/site-selection-1.png" style="display: block; margin: auto;" />
]

---

# Heterogeneous Treatment Effects: The Key

.small[
**If treatment effects are homogeneous** (same everywhere), external validity is free. The problem is entirely driven by **heterogeneous treatment effects** that correlate with site characteristics.
]

---

# Transportability: The Formal Framework

.blue-box[
**Transportability** (Pearl & Bareinboim, 2014): When can we use a causal effect estimated in population `$A$` to predict the effect in population `$B$`?
]

**Intuition:** You can transport if:

1. You know **which variables** moderate the treatment effect (effect modifiers)
2. You can **measure** those variables in both populations
3. You can **adjust** for differences in those variables

`$$\text{ATE}_B = \sum_x \text{CATE}(x) \cdot P_B(X = x)$$`

Re-weight the site-specific conditional effects by the target population's covariate distribution.

.highlight-box[
**In practice:** This requires knowing the effect modifiers. If you miss an important one, transportability fails. Domain knowledge is essential.
]

---

# Transportability: Simulation

Naive transport from the 8-metro source overestimates the 32-city target by ≈ 0.083. Reweighting the source-fitted CATE by the target's `city_effect` distribution recovers the truth.

---

# Dose-Response and Mechanism

.pull-left[
**Why does it work?** matters more than **does it work?** for generalizability.

If you understand the **mechanism**, you can predict *where* the treatment will work:

- Price cut works because it shifts demand → only works if supply can absorb it
- Zone notification works because it routes drivers to demand → works wherever demand is concentrated
- A nudge works because it reduces friction → works wherever friction is the bottleneck
]

.pull-right[
<img src="slides_files/figure-html/dose-response-1.png" style="display: block; margin: auto;" />
]

.highlight-box[
**Interview tip:** "If you ran the experiment at one dose level, can you predict the effect at another?" Only if you have a model of the dose-response relationship. A binary experiment gives you one point on the curve.
]

---

# Temporal Validity: Effects That Decay

.small[
**Novelty effects in tech:** drivers who didn't know about the high-demand zone now do — but once everyone knows (and the platform's other surfaces start showing the same info), the *marginal* informational value of the notification shrinks. The 4-week experiment's average (≈ 0.07) is roughly **3× the long-run effect** (≈ 0.02). Fix: run longer, or model the decay curve.
]

---

# Hawthorne and Demand Effects

.pull-left[
.blue-box[
**Hawthorne effect:** People change behavior because they know they are being observed, not because of the treatment itself.
]

- Lab experiments are most susceptible
- A/B tests where users don't know they're in an experiment are less susceptible
- But surveys, feedback forms, and opt-in studies are vulnerable
]

.pull-right[
.blue-box[
**Demand effect:** Participants guess the hypothesis and behave accordingly (or contrarily).
]

- "I think they want me to click more ads, so I will"
- Or: "I think they want me to click more, so I won't (reactance)"
- Mitigated by: blinding, behavioral (not self-reported) outcomes, deception
]

.highlight-box[
**For tech interviews:** These are less of a concern in standard A/B tests (users don't know their assignment). But they matter for: employee experiments, opt-in trials, user research studies, and any experiment with self-reported outcomes.
]

---

# What Would Help Generalizability?

.small[
**Strategy 1: Multi-site experiments.** Run in 5–10 cities **stratified by `city_effect`**, not just the densest 8. Estimate site-specific effects and model the heterogeneity.

**Strategy 2: Identify effect modifiers.** Use the 8 estimated ATEs as data points; fit ATE on `city_effect`; plug a *target* city's `city_effect` into the line to predict its ATE — instead of just averaging the 8.

```r
sample_8 <- filter(cities_df, selected)
fit <- lm(local_effect ~ city_effect, data = sample_8)
predict(fit, data.frame(city_effect = -0.05))   # moderator   → ≈ 0.03
```

```
##    1 
## 0.03
```

```r
mean(sample_8$local_effect)                      # naive 8-city → ≈ 0.11
```

```
## [1] 0.114612
```

Same 8 cities of data → two different answers. Moderator says a thin-market city's ATE is ≈ 4× smaller than the naive guess.

**Strategy 3: Understand the mechanism.** Zone-notification routes drivers to concentrated demand. **Strongest** where demand is concentrated (high `city_effect`), **weakest** in thin markets — exactly the heterogeneity we see in the data.
]

---

# When Extrapolation Fails: A Simulation

---

# Strategies for Improving External Validity

| Strategy | How it helps | Cost |
|----------|-------------|------|
| **Multi-site experiments** | Directly measures heterogeneity | Expensive, complex logistics |
| **Stratified site selection** | Ensures diversity on key dimensions | Need to know the key dimensions |
| **Effect modifier analysis** | Models why effects vary | Needs theory + data on moderators |
| **Mechanism tests** | Predicts where treatment will work | Requires understanding of mechanism |
| **Replication** | Confirms or refutes in new settings | Time, resources |
| **Longer experiments** | Captures steady state, not novelty | Opportunity cost |

.highlight-box[
**The meta-point:** External validity cannot be tested — only argued for. Internal validity can be designed in (randomization). External validity requires theory, multiple studies, and honest uncertainty about generalizability.
]

---

# Key Takeaways

1. **Internal validity** (is the estimate right?) and **external validity** (does it generalize?) are distinct. High internal validity does not imply high external validity.

2. External validity fails when **treatment effects are heterogeneous** and the experimental population differs from the target population on the effect modifiers.

3. **Site selection bias** is pervasive: experiments run in convenient or favorable locations overstate average effects.

4. **Transportability** requires knowing the effect modifiers and adjusting for distributional differences. Without this knowledge, extrapolation is guesswork.

5. **Novelty effects** and **Hawthorne effects** threaten temporal external validity. Short experiments may overstate long-run impacts.

6. **Understanding the mechanism** is the most powerful tool for generalizability: *why* it works tells you *where* it will work.

---

# Exercise Preview

In the exercise (`exercise.R`) you will:

1. Simulate treatment effects that vary across 30 sites based on site characteristics
2. Run an experiment in a biased subset of sites and show the estimate doesn't generalize
3. Apply a transportability reweighting approach and show it recovers the target ATE
4. Simulate novelty effects and show how experiment duration affects the estimate
5. Compare multi-site vs single-site experimental designs

See `exercise.R` for the starter code.

---
class: center, middle, inverse

# Backup Slides

---
name: transport-formula

# Backup: Transportability Formula

**Setup:** Experiment in population `$A$`, target population `$B$`. Let `$X$` be observed effect modifiers.

The ATE in `$B$` is:

`$$\tau_B = E_B[Y(1) - Y(0)] = \sum_x E_A[Y(1) - Y(0) \mid X = x] \cdot P_B(X = x)$$`

**Assumptions required:**

1. **Conditional exchangeability across populations:** `$\tau(x) = E_A[Y(1) - Y(0) \mid X = x] = E_B[Y(1) - Y(0) \mid X = x]$`

2. **Positivity:** `$P_A(X = x) > 0$` for all `$x$` with `$P_B(X = x) > 0$`

3. **Correct specification** of the effect modifier set `$X$`

**In practice:** Estimate `$\hat{\tau}(x)$` from the experimental data, then reweight:

`$$\hat{\tau}_B = \frac{1}{n_B} \sum_{i \in B} \hat{\tau}(x_i)$$`

This is analogous to survey reweighting / inverse probability weighting, but for treatment effects rather than means.

---
name: multisite-design

# Backup: Multi-Site Experimental Design

**Option 1: Same treatment, many sites**

- Estimate site-specific effects `$\hat{\tau}_s$` for `$s = 1, \ldots, S$`
- The overall ATE is a weighted average: `$\hat{\tau} = \sum_s w_s \hat{\tau}_s$`
- Weights can be equal (average site effect) or proportional to site size (population ATE)
- Key: report the **variance** of site-specific effects, not just the average

**Option 2: Vary the treatment across sites (factorial design)**

- Different doses, implementations, or contexts
- Allows estimation of dose-response curves
- Requires more sites and/or larger samples

**Power consideration:**

- With `$S$` sites, you have `$S$` "observations" for estimating heterogeneity
- Individual-level `$N$` does not help with site-level inference
- Rule of thumb: `$S \ge 20$` for meaningful heterogeneity analysis

---
name: novelty-modeling

# Backup: Modeling Novelty / Decay Effects

**Exponential decay model:**

`$$\tau(t) = \tau_{\infty} + (\tau_0 - \tau_{\infty}) \cdot e^{-\lambda t}$$`

where:
- `$\tau_0$` = initial treatment effect (includes novelty)
- `$\tau_{\infty}$` = long-run (steady state) treatment effect
- `$\lambda$` = decay rate
- `$t$` = time since treatment onset

**Estimation:** Run a long experiment (or multiple cohorts with staggered start). Estimate `$\tau(t)$` for each time period and fit the decay curve.

**Practical note:** If `$\tau_{\infty} = 0$`, the treatment has no lasting effect. This is surprisingly common for UI changes, notifications, and other "attention-grabbing" interventions. A 4-week experiment cannot distinguish a slowly-decaying effect from a permanent one.