class: center, middle, inverse, title-slide .title[ # Module 7: External Validity & Generalizability ] .subtitle[ ## Your Zone-Notification Worked in 8 Cities. Will It Work in the Other 32? ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } .tiny { font-size: 65%; } .highlight-box { background: #fff3e0; border-left: 4px solid #e65100; padding: 0.5em 1em; margin: 0.5em 0; } .blue-box { background: #e3f2fd; border-left: 4px solid #1565c0; padding: 0.5em 1em; margin: 0.5em 0; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Experimental Ideal</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Designing Around Interference</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Power and Sample Size</td><td>done</td></tr> <tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr> <tr><td>6</td><td>Multiple Testing & Subgroups</td><td>done</td></tr> <tr><td><b>7</b></td><td><b>External Validity & Generalizability</b> <i>(you are here)</i></td><td>current</td></tr> <tr><td>8</td><td><a href="../module-08/slides.html">Beyond the A/B Test</a></td><td>✓ done</td></tr> </table> --- # The Setup: Zone-Notification in 8 of 40 Cities The platform from Modules 1–3 ran the **zone-notification experiment** in just **8 hand-picked metros** (the ones with the most engineering bandwidth — and the densest markets). Pooled estimate of the direct effect on accept rate: **≈ +0.11** — about double the +0.05 the M3 model implied for an average city. <img src="slides_files/figure-html/zone-experiment-1.png" style="display: block; margin: auto;" /> -- The CEO asks: "Great, let's roll this out to all 40 cities." Should you? --- # Internal vs External Validity .pull-left[ .blue-box[ **Internal validity:** Is the causal estimate correct *within the study*? - Randomization was done properly - No attrition, spillovers, or compliance issues - The estimated ATE is unbiased for the study population ] ] .pull-right[ .blue-box[ **External validity:** Does the effect generalize *beyond the study*? - To other populations (cities, countries) - To other time periods - To other implementations of the treatment - To other outcome measures ] ] -- .highlight-box[ **The tradeoff:** Lab experiments maximize internal validity (tight control) but may have low external validity (artificial setting). Field experiments in one site have high internal validity but limited external validity. Multi-site experiments trade some precision for generalizability. ] --- # Why 8-City Results Might Not Generalize <img src="slides_files/figure-html/city-heterogeneity-1.png" style="display: block; margin: auto;" /> -- The selected metros are systematically denser than the typical city. Cities with thin demand or sparse driver pools (negative `city_effect`) see much smaller effects — sometimes near zero. --- # Site Selection Bias .blue-box[ .small[ **Site selection bias:** experiment sites are not random samples of the target population — they're chosen for convenience, feasibility, or where the treatment is expected to work. ] ] .pull-left[ .small[ **Common patterns:** - Tech companies test in the densest US metros first - Development RCTs run in places with cooperative governments - Drug trials recruit from academic medical centers, not community clinics In our setup: the 8 selected metros sit in the top quintile of `city_effect`. Their average local effect overstates the all-cities target. ] ] .pull-right[ <img src="slides_files/figure-html/site-selection-1.png" style="display: block; margin: auto;" /> ] --- # Heterogeneous Treatment Effects: The Key <img src="slides_files/figure-html/hte-simulation-1.png" style="display: block; margin: auto;" /> -- .small[ **If treatment effects are homogeneous** (same everywhere), external validity is free. The problem is entirely driven by **heterogeneous treatment effects** that correlate with site characteristics. ] --- # Transportability: The Formal Framework .blue-box[ **Transportability** (Pearl & Bareinboim, 2014): When can we use a causal effect estimated in population `\(A\)` to predict the effect in population `\(B\)`? ] -- **Intuition:** You can transport if: 1. You know **which variables** moderate the treatment effect (effect modifiers) 2. You can **measure** those variables in both populations 3. You can **adjust** for differences in those variables -- `$$\text{ATE}_B = \sum_x \text{CATE}(x) \cdot P_B(X = x)$$` Re-weight the site-specific conditional effects by the target population's covariate distribution. -- .highlight-box[ **In practice:** This requires knowing the effect modifiers. If you miss an important one, transportability fails. Domain knowledge is essential. ] --- # Transportability: Simulation <img src="slides_files/figure-html/transport-sim-1.png" style="display: block; margin: auto;" /> -- Naive transport from the 8-metro source overestimates the 32-city target by ≈ 0.083. Reweighting the source-fitted CATE by the target's `city_effect` distribution recovers the truth. --- # Dose-Response and Mechanism .pull-left[ **Why does it work?** matters more than **does it work?** for generalizability. If you understand the **mechanism**, you can predict *where* the treatment will work: - Price cut works because it shifts demand → only works if supply can absorb it - Zone notification works because it routes drivers to demand → works wherever demand is concentrated - A nudge works because it reduces friction → works wherever friction is the bottleneck ] .pull-right[ <img src="slides_files/figure-html/dose-response-1.png" style="display: block; margin: auto;" /> ] -- .highlight-box[ **Interview tip:** "If you ran the experiment at one dose level, can you predict the effect at another?" Only if you have a model of the dose-response relationship. A binary experiment gives you one point on the curve. ] --- # Temporal Validity: Effects That Decay <img src="slides_files/figure-html/novelty-effect-1.png" style="display: block; margin: auto;" /> -- .small[ **Novelty effects in tech:** drivers who didn't know about the high-demand zone now do — but once everyone knows (and the platform's other surfaces start showing the same info), the *marginal* informational value of the notification shrinks. The 4-week experiment's average (≈ 0.07) is roughly **3× the long-run effect** (≈ 0.02). Fix: run longer, or model the decay curve. ] --- # Hawthorne and Demand Effects .pull-left[ .blue-box[ **Hawthorne effect:** People change behavior because they know they are being observed, not because of the treatment itself. ] - Lab experiments are most susceptible - A/B tests where users don't know they're in an experiment are less susceptible - But surveys, feedback forms, and opt-in studies are vulnerable ] .pull-right[ .blue-box[ **Demand effect:** Participants guess the hypothesis and behave accordingly (or contrarily). ] - "I think they want me to click more ads, so I will" - Or: "I think they want me to click more, so I won't (reactance)" - Mitigated by: blinding, behavioral (not self-reported) outcomes, deception ] -- .highlight-box[ **For tech interviews:** These are less of a concern in standard A/B tests (users don't know their assignment). But they matter for: employee experiments, opt-in trials, user research studies, and any experiment with self-reported outcomes. ] --- # What Would Help Generalizability? .small[ **Strategy 1: Multi-site experiments.** Run in 5–10 cities **stratified by `city_effect`**, not just the densest 8. Estimate site-specific effects and model the heterogeneity. **Strategy 2: Identify effect modifiers.** Use the 8 estimated ATEs as data points; fit ATE on `city_effect`; plug a *target* city's `city_effect` into the line to predict its ATE — instead of just averaging the 8. ```r sample_8 <- filter(cities_df, selected) fit <- lm(local_effect ~ city_effect, data = sample_8) predict(fit, data.frame(city_effect = -0.05)) # moderator → ≈ 0.03 ``` ``` ## 1 ## 0.03 ``` ```r mean(sample_8$local_effect) # naive 8-city → ≈ 0.11 ``` ``` ## [1] 0.114612 ``` Same 8 cities of data → two different answers. Moderator says a thin-market city's ATE is ≈ 4× smaller than the naive guess. **Strategy 3: Understand the mechanism.** Zone-notification routes drivers to concentrated demand. **Strongest** where demand is concentrated (high `city_effect`), **weakest** in thin markets — exactly the heterogeneity we see in the data. ] --- # When Extrapolation Fails: A Simulation <img src="slides_files/figure-html/extrapolation-fail-1.png" style="display: block; margin: auto;" /> --- # Strategies for Improving External Validity | Strategy | How it helps | Cost | |----------|-------------|------| | **Multi-site experiments** | Directly measures heterogeneity | Expensive, complex logistics | | **Stratified site selection** | Ensures diversity on key dimensions | Need to know the key dimensions | | **Effect modifier analysis** | Models why effects vary | Needs theory + data on moderators | | **Mechanism tests** | Predicts where treatment will work | Requires understanding of mechanism | | **Replication** | Confirms or refutes in new settings | Time, resources | | **Longer experiments** | Captures steady state, not novelty | Opportunity cost | -- .highlight-box[ **The meta-point:** External validity cannot be tested — only argued for. Internal validity can be designed in (randomization). External validity requires theory, multiple studies, and honest uncertainty about generalizability. ] --- # Key Takeaways 1. **Internal validity** (is the estimate right?) and **external validity** (does it generalize?) are distinct. High internal validity does not imply high external validity. 2. External validity fails when **treatment effects are heterogeneous** and the experimental population differs from the target population on the effect modifiers. 3. **Site selection bias** is pervasive: experiments run in convenient or favorable locations overstate average effects. 4. **Transportability** requires knowing the effect modifiers and adjusting for distributional differences. Without this knowledge, extrapolation is guesswork. 5. **Novelty effects** and **Hawthorne effects** threaten temporal external validity. Short experiments may overstate long-run impacts. 6. **Understanding the mechanism** is the most powerful tool for generalizability: *why* it works tells you *where* it will work. --- # Exercise Preview In the exercise (`exercise.R`) you will: 1. Simulate treatment effects that vary across 30 sites based on site characteristics 2. Run an experiment in a biased subset of sites and show the estimate doesn't generalize 3. Apply a transportability reweighting approach and show it recovers the target ATE 4. Simulate novelty effects and show how experiment duration affects the estimate 5. Compare multi-site vs single-site experimental designs See `exercise.R` for the starter code. --- class: center, middle, inverse # Backup Slides --- name: transport-formula # Backup: Transportability Formula **Setup:** Experiment in population `\(A\)`, target population `\(B\)`. Let `\(X\)` be observed effect modifiers. The ATE in `\(B\)` is: `$$\tau_B = E_B[Y(1) - Y(0)] = \sum_x E_A[Y(1) - Y(0) \mid X = x] \cdot P_B(X = x)$$` -- **Assumptions required:** 1. **Conditional exchangeability across populations:** `\(\tau(x) = E_A[Y(1) - Y(0) \mid X = x] = E_B[Y(1) - Y(0) \mid X = x]\)` 2. **Positivity:** `\(P_A(X = x) > 0\)` for all `\(x\)` with `\(P_B(X = x) > 0\)` 3. **Correct specification** of the effect modifier set `\(X\)` -- **In practice:** Estimate `\(\hat{\tau}(x)\)` from the experimental data, then reweight: `$$\hat{\tau}_B = \frac{1}{n_B} \sum_{i \in B} \hat{\tau}(x_i)$$` This is analogous to survey reweighting / inverse probability weighting, but for treatment effects rather than means. --- name: multisite-design # Backup: Multi-Site Experimental Design **Option 1: Same treatment, many sites** - Estimate site-specific effects `\(\hat{\tau}_s\)` for `\(s = 1, \ldots, S\)` - The overall ATE is a weighted average: `\(\hat{\tau} = \sum_s w_s \hat{\tau}_s\)` - Weights can be equal (average site effect) or proportional to site size (population ATE) - Key: report the **variance** of site-specific effects, not just the average -- **Option 2: Vary the treatment across sites (factorial design)** - Different doses, implementations, or contexts - Allows estimation of dose-response curves - Requires more sites and/or larger samples -- **Power consideration:** - With `\(S\)` sites, you have `\(S\)` "observations" for estimating heterogeneity - Individual-level `\(N\)` does not help with site-level inference - Rule of thumb: `\(S \ge 20\)` for meaningful heterogeneity analysis --- name: novelty-modeling # Backup: Modeling Novelty / Decay Effects **Exponential decay model:** `$$\tau(t) = \tau_{\infty} + (\tau_0 - \tau_{\infty}) \cdot e^{-\lambda t}$$` where: - `\(\tau_0\)` = initial treatment effect (includes novelty) - `\(\tau_{\infty}\)` = long-run (steady state) treatment effect - `\(\lambda\)` = decay rate - `\(t\)` = time since treatment onset -- **Estimation:** Run a long experiment (or multiple cohorts with staggered start). Estimate `\(\tau(t)\)` for each time period and fit the decay curve. **Practical note:** If `\(\tau_{\infty} = 0\)`, the treatment has no lasting effect. This is surprisingly common for UI changes, notifications, and other "attention-grabbing" interventions. A 4-week experiment cannot distinguish a slowly-decaying effect from a permanent one.