Module 1: The Learning Problem

class: center, middle, inverse, title-slide

.title[
# Module 1: The Learning Problem
]
.subtitle[
## Disparate Wait Times in Ride-Sharing
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td><b>1</b></td><td><b>The Learning Problem</b> <i>(you are here)</i></td><td>← current</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>✓</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation &amp; Selection</a></td><td>✓</td></tr>
<tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr>
<tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr>
<tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks &amp; Metrics</a></td><td>✓ done</td></tr>
<tr><td>8</td><td>Auditing &amp; Interpretability</td><td>upcoming</td></tr>
</table>

---

# The Setup

We observe data `$(X, Y)$` and want to learn a function `$f$` such that

`$$Y = f(X) + \varepsilon, \quad \mathbb{E}[\varepsilon] = 0, \quad \text{Var}(\varepsilon) = \sigma^2$$`

- Choose `$\hat{f}$` from some model class by minimizing a **loss function** on training data

- Evaluate on **held-out** data (validation / test)

- `$\varepsilon$` is irreducible — no model can eliminate it

---

# Loss Functions

| Task | Loss | Formula |
|------|------|---------|
| Regression | MSE | `$\frac{1}{n} \sum_i (y_i - \hat{y}_i)^2$` |
| Classification | 0-1 loss | `$\frac{1}{n} \sum_i \mathbb{I}(y_i \neq \hat{y}_i)$` |
| Classification | Cross-entropy | `$-\sum_i y_i \log(\hat{p}_i)$` |

- MSE penalizes large errors more heavily

- Cross-entropy rewards **well-calibrated probabilities** — preferred for probabilistic classifiers

---

# Train / Validation / Test

| Set | Purpose | When used |
|-----|---------|-----------|
| **Training** | Fit the model parameters | During training |
| **Validation** | Choose hyperparameters, compare models | During model selection |
| **Test** | Final unbiased performance estimate | Once, at the end |

Cross-validation approximates the validation step by rotating through folds.

---

# Example: Validation Picks the Hyperparameter

Train three models on the **same** training set, evaluate on **validation**:

```r
set.seed(99)
demo <- tibble(x = runif(200, 0, 10),
               y = sin(x) + rnorm(200, 0, 0.3))
demo_train <- demo[1:140, ]
demo_val   <- demo[141:200, ]

# Model A: degree-5 polynomial (less flexible)
fit_A <- lm(y ~ poly(x, 5), data = demo_train)

# Model B: degree-8 polynomial (moderate flexibility)
fit_B <- lm(y ~ poly(x, 8), data = demo_train)

# Model C: degree-20 polynomial (very flexible)
fit_C <- lm(y ~ poly(x, 20), data = demo_train)
```

---

# Validation Results

|Model         | # Params| Train MSE| Val MSE|
|:-------------|--------:|---------:|-------:|
|A (degree 5)  |        6|    0.1243|  0.1764|
|B (degree 8)  |        9|    0.0946|  0.1243|
|C (degree 20) |       21|    0.0882|  0.1414|

- Model C wins on **training** data (lowest Train MSE) — it memorizes best

- Model B wins on **validation** data (lowest Val MSE) — it generalizes best

- Too simple (A) → underfits. Too complex (C) → overfits. **Validation finds the sweet spot (B).**

---

# Example: What Validation Reveals

---
class: inverse, center, middle

# Bias-Variance Tradeoff

---

# The Decomposition

For any estimate `$\hat{f}$`, the expected prediction error at a point `$x_0$` is:

`$$\mathbb{E}\left[(Y - \hat{f}(x_0))^2\right] = \underbrace{\left[\text{Bias}(\hat{f}(x_0))\right]^2}_{\text{systematic error}} + \underbrace{\text{Var}(\hat{f}(x_0))}_{\text{sensitivity to training set}} + \underbrace{\sigma^2}_{\text{irreducible}}$$`

where:

- `$\text{Bias}(\hat{f}(x_0)) = f(x_0) - \mathbb{E}[\hat{f}(x_0)]$`

- `$\text{Var}(\hat{f}(x_0)) = \mathbb{E}\left[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)])^2\right]$`

---

# Proof — Step 1: Separate Noise from Estimation

Start by substituting `$Y = f(x_0) + \varepsilon$`:

`$$\mathbb{E}[(Y - \hat{f})^2] = \mathbb{E}[((f - \hat{f}) + \varepsilon)^2]$$`

Expand the square:

`$$= \mathbb{E}[(f - \hat{f})^2] + 2\,\mathbb{E}[(f - \hat{f})\,\varepsilon] + \mathbb{E}[\varepsilon^2]$$`

The cross term vanishes: `$\varepsilon$` is independent of `$\hat{f}$` and `$\mathbb{E}[\varepsilon] = 0$`

`$$= \mathbb{E}[(f(x_0) - \hat{f}(x_0))^2] + \sigma^2$$`

---

# Proof — Step 2: Add and Subtract the Mean

The trick: insert `$\mathbb{E}[\hat{f}]$` into the estimation error:

`$$f - \hat{f} = \underbrace{(f - \mathbb{E}[\hat{f}])}_{\text{Bias}} + \underbrace{(\mathbb{E}[\hat{f}] - \hat{f})}_{\text{deviation from mean}}$$`

Square and take expectation:

`$$\mathbb{E}[(f - \hat{f})^2] = \text{Bias}^2 + 2 \cdot \text{Bias} \cdot \underbrace{\mathbb{E}[\mathbb{E}[\hat{f}] - \hat{f}]}_{= 0} + \underbrace{\mathbb{E}[(\hat{f} - \mathbb{E}[\hat{f}])^2]}_{\text{Var}(\hat{f})}$$`

The cross term vanishes because `$\mathbb{E}[\hat{f} - \mathbb{E}[\hat{f}]] = 0$`.

---

# Proof — Result

`$$\boxed{\mathbb{E}\left[(Y - \hat{f}(x_0))^2\right] = \left[\text{Bias}(\hat{f}(x_0))\right]^2 + \text{Var}(\hat{f}(x_0)) + \sigma^2}$$`

| Component | Simple model (e.g., linear) | Complex model (e.g., degree-20 poly) |
|-----------|---------------------------|--------------------------------------|
| Bias² | High | Low |
| Variance | Low | High |

The **sweet spot** minimizes their sum → this is model selection.

---

# Back to Our Models — Where Are We Zooming In?

---
count: false

# Back to Our Models — Where Are We Zooming In?

---

# Seeing It: 100 Simulated Training Sets

**Variance** = spread of the orange dots around their mean. **Bias** = gap between the mean (X) and the true f(x).

---

# Model B (degree 8): Less Bias, More Variance

Bias is smaller, but variance is larger — the **tradeoff** in action.

---

# Model C (degree 20): Less Bias, More Variance

Most flexible model: bias vanishes but variance explodes — **overfitting**.

---

# Connecting the Pieces: MSE = Bias² + Variance + σ²

|Model         |  Bias²| Variance|   σ²| Bias²+Var+σ²| Avg MSE (500 sims)| Val MSE (1 sample)| Train MSE|
|:-------------|------:|--------:|----:|------------:|------------------:|------------------:|---------:|
|A (degree 5)  | 0.0234|   0.0035| 0.09|       0.1169|             0.1169|             0.1911|    0.1243|
|B (degree 8)  | 0.0000|   0.0040| 0.09|       0.0940|             0.0940|             0.1282|    0.0946|
|C (degree 20) | 0.0000|   0.0110| 0.09|       0.1010|             0.1010|             0.1360|    0.0882|

- **Bias² + Var + σ²** matches **Avg MSE** (averaged over 500 simulations) — the decomposition works!

- **Val MSE (1 sample)** is what we observe in practice — a single draw, so it won't match exactly

- In practice, you **cannot** compute Bias² or Variance from one training set. The decomposition is a conceptual tool that explains *why* Val MSE has a U-shape

- σ² = 0.09 is the **floor** — no model can beat it (irreducible noise)

---
class: inverse, center, middle

# Discrimination Angle
### Statistical Bias ≠ Social Bias

---

# Two Kinds of Bias

**Statistical bias:** model's average prediction is off — it systematically over- or under-estimates.

`$$\text{Bias} = f(x_0) - \mathbb{E}[\hat{f}(x_0)]$$`

Fix: use a more flexible model.

**Social bias (disparate impact):** model treats groups differently in ways that cause harm — even if statistically unbiased overall.

Fix: ...it's complicated. That's the rest of this course.

---

# A Concrete Example

Imagine a wait-time prediction model:

- Predicts average wait = 5 min across **all** neighborhoods ✅ (low bias!)

- But: **3 min** in wealthy/white neighborhoods, **8 min** in low-income/Black neighborhoods

- The model isn't *wrong* — it accurately reflects the *system*

- The system itself encodes discrimination, and the model faithfully learns it

> **A statistically good model can be a socially harmful model.**

---

# How This Plays Out at Uber/Lyft

1. **Dispatch optimization** minimizes total wait → sends more drivers where demand is dense → underserves sparse/low-income areas

2. **Surge pricing** responds to supply-demand imbalance → areas with fewer drivers get higher prices → low-income riders pay more

3. **ETA prediction** trained on historical data → if service was worse in certain neighborhoods, the model predicts longer waits → drivers avoid those areas → **self-fulfilling prophecy**

---

# What to Always Ask

When evaluating any model:

1. What is the loss **overall**?

2. What is the loss **per group**?

3. Are error rates **equal** across groups?

This is the foundation for fairness metrics in Modules 11–12.

---
class: inverse, center, middle

# Exercise
### Let's See It in the Data

---

# Part 1: Synthetic Ride-Sharing Data

```r
set.seed(42)
n <- 2000

neighborhoods <- tibble(
  neighborhood  = c("Downtown", "Midtown", "Eastside", "Southside"),
  base_wait     = c(3, 4, 7, 9),
  driver_supply = c(50, 35, 15, 10),
  pct_minority  = c(0.20, 0.35, 0.70, 0.85),
  median_income = c(95000, 72000, 38000, 29000)
)

rides <- tibble(
  ride_id = 1:n,
  neighborhood = sample(
    neighborhoods$neighborhood, n, replace = TRUE,
    prob = c(0.35, 0.30, 0.20, 0.15)
  )
) |>
  left_join(neighborhoods, by = "neighborhood") |>
  mutate(
    hour = sample(0:23, n, replace = TRUE),
    is_peak = as.integer(hour %in% c(7, 8, 9, 17, 18, 19)),
    peak_penalty = is_peak * (10 - driver_supply / 10),
    wait_time = pmax(base_wait + peak_penalty + rnorm(n, 0, 1.5), 0.5)
  )
```

---

# Wait Time Distributions

Downtown and Midtown cluster around 3–4 min. Eastside and Southside: 7–9 min.

---

# Summary by Neighborhood

```r
rides |>
  group_by(neighborhood) |>
  summarise(
    n_rides = n(), avg_wait = round(mean(wait_time), 1),
    pct_minority = first(pct_minority),
    median_income = scales::dollar(first(median_income)),
    .groups = "drop"
  ) |>
  arrange(avg_wait) |>
  knitr::kable()
```

|neighborhood | n_rides| avg_wait| pct_minority|median_income |
|:------------|-------:|--------:|------------:|:-------------|
|Downtown     |     723|      4.3|         0.20|$95,000       |
|Midtown      |     578|      5.6|         0.35|$72,000       |
|Eastside     |     403|      8.9|         0.70|$38,000       |
|Southside    |     296|     11.3|         0.85|$29,000       |

---

# Part 2: Train / Val / Test Split

```r
set.seed(123)
train_idx <- sample(1:n, 0.6 * n)
val_idx   <- sample(setdiff(1:n, train_idx), 0.2 * n)
test_idx  <- setdiff(1:n, c(train_idx, val_idx))

train <- rides[train_idx, ]
val   <- rides[val_idx, ]
test  <- rides[test_idx, ]
```

---

# Part 2: Fit Polynomials

Fit polynomials of degree 1 through 15 on `hour`:

```r
degrees <- 1:15
results <- map_dfr(degrees, function(d) {
  fit <- lm(wait_time ~ poly(hour, d), data = train)
  tibble(
    degree    = d,
    train_mse = mean((train$wait_time - predict(fit, train))^2),
    val_mse   = mean((val$wait_time - predict(fit, val))^2)
  )
})
```

---

# The Classic Tradeoff Curve

---

# Part 3: Disaggregate by Group

Fit the best model and check errors **per neighborhood**:

```r
best_degree <- results$degree[which.min(results$val_mse)]
best_fit <- lm(wait_time ~ poly(hour, best_degree), data = train)

test |>
  mutate(predicted = predict(best_fit, test)) |>
  group_by(neighborhood) |>
  summarise(
    n = n(), avg_actual = round(mean(wait_time), 1),
    avg_predicted = round(mean(predicted), 1),
    mse = round(mean((wait_time - predicted)^2), 1),
    pct_minority = first(pct_minority), .groups = "drop"
  ) |>
  arrange(pct_minority) |>
  knitr::kable()
```

|neighborhood |   n| avg_actual| avg_predicted|  mse| pct_minority|
|:------------|---:|----------:|-------------:|----:|------------:|
|Downtown     | 131|        4.2|           6.6|  9.7|         0.20|
|Midtown      | 115|        5.6|           6.7|  4.3|         0.35|
|Eastside     |  81|        9.9|           7.1| 12.7|         0.70|
|Southside    |  73|       11.6|           6.8| 26.8|         0.85|

---

# Error Concentrates in Minority Areas

---

# Part 4: Add Neighborhood as a Feature

.small[

```r
better_fit <- lm(wait_time ~ poly(hour, best_degree) + neighborhood, data = train)

test |>
  mutate(pred_v1 = predict(best_fit, test),
         pred_v2 = predict(better_fit, test)) |>
  group_by(neighborhood) |>
  summarise(
    avg_actual = round(mean(wait_time), 1),
    avg_pred_v1 = round(mean(pred_v1), 1),
    avg_pred_v2 = round(mean(pred_v2), 1),
    mse_v1 = round(mean((wait_time - pred_v1)^2), 1),
    mse_v2 = round(mean((wait_time - pred_v2)^2), 1),
    pct_minority = first(pct_minority), .groups = "drop"
  ) |>
  arrange(pct_minority) |>
  knitr::kable()
```

|neighborhood | avg_actual| avg_pred_v1| avg_pred_v2| mse_v1| mse_v2| pct_minority|
|:------------|----------:|-----------:|-----------:|------:|------:|------------:|
|Downtown     |        4.2|         6.6|         4.2|    9.7|    4.0|         0.20|
|Midtown      |        5.6|         6.7|         5.7|    4.3|    2.9|         0.35|
|Eastside     |        9.9|         7.1|         9.4|   12.7|    4.6|         0.70|
|Southside    |       11.6|         6.8|        11.7|   26.8|    3.5|         0.85|
]

Model v2 is **more accurate** — but it has **learned the disparity**.

---
class: inverse

# The Key Questions

<br>

### 1. Should a model predict what **IS** or what **SHOULD BE**?

<br>

### 2. If we use these predictions for driver dispatch, do we perpetuate the very inequality we measured?

<br>

### 3. Is a model that ignores neighborhood "fairer" even though it's less accurate?

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td>The Learning Problem <i>(just finished)</i></td><td>✓ done</td></tr>
<tr><td><b>2</b></td><td><b><a href="../module-02/slides.html">Linear Models →</a></b></td><td>next</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation &amp; Selection</a></td><td>upcoming</td></tr>
<tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr>
<tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr>
<tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks &amp; Metrics</a></td><td>✓ done</td></tr>
<tr><td>8</td><td>Auditing &amp; Interpretability</td><td>upcoming</td></tr>
</table>