Module 1: The Experimental Ideal

class: center, middle, inverse, title-slide

.title[
# Module 1: The Experimental Ideal
]
.subtitle[
## Potential Outcomes, Selection Bias, and Why Randomization Works
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td><b>1</b></td><td><b>The Experimental Ideal</b> <i>(you are here)</i></td><td>current</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">SUTVA and When It Breaks</a></td><td>✓ done</td></tr>
<tr><td>3</td><td>Designing Around Interference</td><td>upcoming</td></tr>
<tr><td>4</td><td>Power and Sample Size</td><td>upcoming</td></tr>
<tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr>
<tr><td>6</td><td>Multiple Testing & Subgroups</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">External Validity</a></td><td>✓ done</td></tr>
<tr><td>8</td><td><a href="../module-08/slides.html">Beyond the A/B Test</a></td><td>✓ done</td></tr>
</table>

---

# The Setup: A Zone-Notification Experiment

When a driver is heading into a zone with higher-than-average subsequent demand, a ride-sharing platform wants to know: **does sending the driver a push notification affect their decision to accept the next ride offer — and their earnings?**

The product team looks at the data:

|Group        |   N| Accept rate|
|:------------|---:|-----------:|
|Not notified | 489|       0.342|
|Notified     | 511|       0.481|

"The notification increases accept rate by ~15 percentage points! Ship it!"

...or does it?

---

# The Fundamental Problem

We want to know the **causal effect** of the notification on each driver `$i$`'s accept decision:

`$$\tau_i = Y_i(1) - Y_i(0)$$`

| Driver | `$Y_i(1)$` (notified) | `$Y_i(0)$` (not notified) | Effect |
|--------|---------------------|--------------------------|--------|
| Alice  | 1 (accepted)        | ?                        | ?      |
| Bob    | ?                   | 0 (declined)             | ?      |
| Carol  | 1 (accepted)        | ?                        | ?      |
| Dave   | ?                   | 1 (accepted)             | ?      |

We only observe **one** potential outcome per driver. The other is the *counterfactual* — it doesn't exist in the data. We can never compute `$\tau_i$` directly.

---

# What We *Can* Estimate: Average Effects

Even though individual effects are unobservable, we can estimate **averages**:

- **ATE** = `$E[Y_i(1) - Y_i(0)]$` — average effect across *everyone*

- **ATT** = `$E[Y_i(1) - Y_i(0) \mid D_i = 1]$` — average effect on those who *got* treatment

- **ATU** = `$E[Y_i(1) - Y_i(0) \mid D_i = 0]$` — average effect on the *untreated*

.highlight-box[
**When do these differ?** When (1) treatment effects are heterogeneous AND (2) who gets treated is correlated with the effect. Example: if experienced drivers both opt into notifications AND benefit more from them, then ATT > ATE > ATU.
]

---
name: selection-decomp

# Selection Bias: The Decomposition

The naive comparison — difference in observed means — is:

`$$E[Y_i \mid D_i = 1] - E[Y_i \mid D_i = 0]$$`

This equals:

`$$= \underbrace{E[Y_i(1) - Y_i(0) \mid D_i = 1]}_{\text{ATT}} + \underbrace{E[Y_i(0) \mid D_i = 1] - E[Y_i(0) \mid D_i = 0]}_{\text{selection bias}}$$`

.highlight-box[
**Selection bias** = the difference in *baseline* outcomes between the groups. Drivers who opted into notifications are more experienced → they would have accepted more rides *even without the notification*.
]

<a href="#selection-proof" class="btn-link">see proof →</a>

---

# Selection Bias: Visualized

The notified group has higher experience on average → higher accept rate *regardless of the notification*. The naive estimate conflates the notification effect with pre-existing differences.

---

# Randomization: The Fix

Randomly assign 500 drivers to receive the notification, 500 to control:

```r
set.seed(132)
n <- 1000
rct <- tibble(
  driver_id = 1:n,
  experience = rnorm(n),
  # Random assignment — independent of experience
  notification = sample(rep(c(0, 1), each = n/2)),
  # LPM outcome: P(accept) = 0.4 + 0.2*experience + 0.05*notification (clipped)
  # True effect of notification = 0.05 (5pp)
  accepted = rbinom(n, 1,
    prob = pmin(1, pmax(0, 0.4 + 0.2 * experience + 0.05 * notification)))
)
```

|Group    |   N| Mean experience| Accept rate|
|:--------|---:|---------------:|-----------:|
|Control  | 500|           0.011|       0.402|
|Notified | 500|           0.015|       0.452|

Experience is balanced (by construction). The difference in accept rate is an unbiased estimate of the ATE.

---

# Why Does Randomization Work?

Random assignment makes treatment independent of potential outcomes:

`$$(Y_i(1), Y_i(0)) \perp D_i$$`

This kills selection bias:

`$$E[Y_i(0) \mid D_i = 1] = E[Y_i(0) \mid D_i = 0]$$`

So the simple difference in means equals the ATE:

`$$E[Y \mid D=1] - E[Y \mid D=0] = E[Y(1)] - E[Y(0)] = \text{ATE}$$`

.blue-box[
**But:** randomization guarantees this *in expectation*, not in every sample. Any given experiment can have imbalanced covariates by chance. That's what inference (confidence intervals, p-values) is for.
]

---

# Simulation: Randomization Works in Expectation

.small[
**What sets the spread?** The standard error of the difference in means, `$\text{SE} = \sqrt{\sigma_1^2/n_1 + \sigma_0^2/n_0}$`. Larger `$n$` → narrower; less outcome variance ( `$\sigma^2$` ) → narrower. Pre-treatment covariates or stratification can tighten it further (Module 5).
]

---

# The Naive Estimate Is Biased

**Naive** = *observational*: no randomization. Drivers self-select into notifications — more experienced drivers opt in with higher probability, `$P(D=1) = 0.5 + 0.15 \cdot \text{experience}$` (clipped to `$[0,1]$`). We then compute the simple difference in accept rates, as if it were an RCT.

The naive distribution is shifted right — it **overestimates** the effect because experienced drivers both opt into notifications and accept more.

---

# Numerical Decomposition

.pull-left[

|Component        | Value|
|:----------------|-----:|
|Naive estimate   | 0.163|
|= ATT            | 0.050|
|+ Selection bias | 0.113|
]

.pull-right[
- **ATT** = the real causal effect on treated drivers
- **Selection bias** = treated drivers had higher `$Y(0)$` (they would have accepted more even without the notification)
- **Naive** = ATT + bias → overstates the effect
]

---

# ATE vs ATT vs ATU: When It Matters

Recap — **ATE**: all drivers · **ATT**: opt-in drivers · **ATU**: non-opt-in drivers.

**New scenario:** effects are now *heterogeneous* — novices benefit more than experienced drivers. Average still ~5pp, but ATT and ATU diverge.

.pull-left[

```r
# Heterogeneous DGP
rct_het <- tibble(
  experience = rnorm(n),
  tau = pmax(0, 0.05 - 0.05 * experience),
  y0  = pmin(1, pmax(0, 0.4 + 0.2 * experience)),
  y1  = pmin(1, pmax(0, y0 + tau)),
  # self-selected (observational)
  D_obs = rbinom(n, 1, prob =
    pmin(1, pmax(0, 0.5 + 0.15 * experience))),
  # randomized
  D_rct = sample(rep(c(0, 1), each = n/2))
)
```
]

.pull-right[

```r
rct_het |>
  summarise(
    ATE     = mean(y1 - y0),
    ATT_obs = mean((y1 - y0)[D_obs == 1]),
    ATU_obs = mean((y1 - y0)[D_obs == 0]),
    ATT_rct = mean((y1 - y0)[D_rct == 1]),
    ATU_rct = mean((y1 - y0)[D_rct == 0])
  ) |> round(3)
```

```
## # A tibble: 1 × 5
##     ATE ATT_obs ATU_obs ATT_rct ATU_rct
##   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 0.054   0.039   0.068   0.054   0.053
```
]

---

# Estimand vs Estimator: Simulation ≠ Reality

**Simulation**: we constructed both `y0` and `y1` for every unit, so `mean(y1 − y0)` gives the *true* ATE. **Reality**: for each unit we observe only one outcome, `$Y_i = D_i \cdot y1_i + (1 - D_i) \cdot y0_i$`. The other is counterfactual.

.pull-left[

```r
rct_het |>
  mutate(
    # observed = realized potential
    Y_obs = if_else(D_obs == 1, y1, y0),
    Y_rct = if_else(D_rct == 1, y1, y0)
  ) |>
  summarise(
    true_ATE  = mean(y1 - y0),
    naive_obs = mean(Y_obs[D_obs == 1]) -
                mean(Y_obs[D_obs == 0]),
    naive_rct = mean(Y_rct[D_rct == 1]) -
                mean(Y_rct[D_rct == 0])
  ) |> round(3)
```
]

.pull-right[

```
## # A tibble: 1 × 3
##   true_ATE naive_obs naive_rct
##      <dbl>     <dbl>     <dbl>
## 1    0.054     0.166     0.048
```

- `true_ATE` requires both potential outcomes → only in simulation.
- `naive_rct` ≈ `true_ATE` → **RCT identifies ATE** (and ATT, ATU — all equal under randomization).
- `naive_obs` is biased upward — mixes ATT with selection bias.
- RCTs don't recover `ATT_obs` (effect on *actual adopters*) — that's external validity (Module 7).
]

---

# Where Does the Bias Come From?

.pull-left[
Decompose the naive estimator:

`naive_obs` = `ATT_obs` + selection bias

where selection bias `$= E[y_0 \mid D{=}1] - E[y_0 \mid D{=}0]$`.

```r
rct_het |>
  summarise(
    naive_obs = mean(y1[D_obs == 1]) -
                mean(y0[D_obs == 0]),
    ATT_obs   = mean((y1 - y0)[D_obs == 1]),
    sel_bias  = mean(y0[D_obs == 1]) -
                mean(y0[D_obs == 0])
  ) |> round(3)
```

```
## # A tibble: 1 × 3
##   naive_obs ATT_obs sel_bias
##       <dbl>   <dbl>    <dbl>
## 1     0.166   0.039    0.126
```
]

.pull-right[
**Two distinct gaps:**

- **ATT_obs vs ATE** (≈ 0.015): driven by *effect heterogeneity* — how much τ varies across who opts in. Modest here.
- **naive_obs vs ATE** (≈ 0.112): driven by *baseline y0 differences* — experienced drivers have higher y0 **and** opt in more.

.highlight-box[
**Key intuition:** selection bias knows nothing about the treatment effect. Even if the notification did *nothing*, the naive comparison would still show a ~0.126 gap — just from who selects in.
]
]

**ATT / ATU coming up:** M5 (**LATE** = compliers' ATT), M6 (subgroup heterogeneity), M7 (ATT_obs via external validity).

---

# SUTVA: A Preview

Everything so far assumed:

1. **No interference:** Alice's notification doesn't affect Bob's accept decision
2. **No hidden versions:** the notification is the same for everyone

.highlight-box[
In marketplace experiments, **both assumptions are routinely violated:**
- Notifying many drivers sends them toward the same zone → more supply there, fewer rides per driver; and non-notified drivers face less competition elsewhere
- The "same" notification has different effects depending on time of day, traffic, and driver earnings downstream
]

This is Module 2's topic — and it's what tech interviewers care about most. The experimental ideal from this module is the *starting point*, not the ending point.

---

# Key Takeaways

1. **Potential outcomes** define the causal effect: `$\tau_i = Y_i(1) - Y_i(0)$`. We can never observe both.

2. The naive comparison = **ATT + selection bias**. Selection bias = pre-existing differences between groups.

3. **Randomization** makes treatment independent of potential outcomes → kills selection bias → simple difference in means is unbiased for ATE.

4. Randomization works **in expectation**, not in every sample. Inference handles the sampling variation.

5. **ATE ≠ ATT ≠ ATU** when effects are heterogeneous and treatment is correlated with the effect. Always ask "for whom?"

6. **SUTVA** (no interference, no hidden versions) is assumed throughout — and almost always violated in tech. That's next.

---

# Exercise Preview

In the exercise you will:

1. Simulate potential outcomes for 1,000 drivers with heterogeneous effects
2. Show that the naive comparison is biased (reproduce the decomposition)
3. Show that randomization eliminates bias in expectation (500 simulated RCTs)
4. Compute ATE, ATT, ATU and show when they diverge
5. Break SUTVA in a simple simulation and see what happens to the estimate

See `exercise.R` for the starter code.

---

# Where We Are

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td><b>1</b></td><td><b>The Experimental Ideal</b></td><td>✓ done</td></tr>
<tr><td>2</td><td>SUTVA and When It Breaks <i>(next)</i></td><td>up next</td></tr>
<tr><td>3</td><td>Designing Around Interference</td><td>upcoming</td></tr>
<tr><td>4</td><td>Power and Sample Size</td><td>upcoming</td></tr>
<tr><td>5</td><td><a href="../module-05/slides.html">Analyzing Experiments</a></td><td>✓ done</td></tr>
<tr><td>6</td><td>Multiple Testing & Subgroups</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">External Validity</a></td><td>✓ done</td></tr>
<tr><td>8</td><td><a href="../module-08/slides.html">Beyond the A/B Test</a></td><td>✓ done</td></tr>
</table>

---
class: center, middle, inverse

# Backup Slides

---
name: selection-proof

# Backup: Deriving the Selection Bias Decomposition

Start with the naive estimand (difference in observed conditional means):

`$$E[Y \mid D=1] - E[Y \mid D=0]$$`

Since `$Y = D \cdot Y(1) + (1-D) \cdot Y(0)$`, the treated group shows `$Y(1)$` and the control group shows `$Y(0)$`:

`$$= E[Y(1) \mid D=1] - E[Y(0) \mid D=0]$$`

Add and subtract `$E[Y(0) \mid D=1]$`:

`$$= \underbrace{E[Y(1) \mid D=1] - E[Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{selection bias}}$$`

Under randomization, `$E[Y(0) \mid D=1] = E[Y(0) \mid D=0]$`, so selection bias = 0 and ATT = ATE.