class: center, middle, inverse, title-slide .title[ # Module 1: The Learning Problem ] .subtitle[ ## Disparate Wait Times in Ride-Sharing ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small .remark-code, .small table { font-size: 70%; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td><b>1</b></td><td><b>The Learning Problem</b> <i>(you are here)</i></td><td>← current</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>✓</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation & Selection</a></td><td>✓</td></tr> <tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks & Metrics</a></td><td>✓ done</td></tr> <tr><td>8</td><td>Auditing & Interpretability</td><td>upcoming</td></tr> </table> --- # The Setup We observe data `\((X, Y)\)` and want to learn a function `\(f\)` such that `$$Y = f(X) + \varepsilon, \quad \mathbb{E}[\varepsilon] = 0, \quad \text{Var}(\varepsilon) = \sigma^2$$` -- - Choose `\(\hat{f}\)` from some model class by minimizing a **loss function** on training data -- - Evaluate on **held-out** data (validation / test) -- - `\(\varepsilon\)` is irreducible — no model can eliminate it --- # Loss Functions | Task | Loss | Formula | |------|------|---------| | Regression | MSE | `\(\frac{1}{n} \sum_i (y_i - \hat{y}_i)^2\)` | | Classification | 0-1 loss | `\(\frac{1}{n} \sum_i \mathbb{I}(y_i \neq \hat{y}_i)\)` | | Classification | Cross-entropy | `\(-\sum_i y_i \log(\hat{p}_i)\)` | -- - MSE penalizes large errors more heavily -- - Cross-entropy rewards **well-calibrated probabilities** — preferred for probabilistic classifiers --- # Train / Validation / Test | Set | Purpose | When used | |-----|---------|-----------| | **Training** | Fit the model parameters | During training | | **Validation** | Choose hyperparameters, compare models | During model selection | | **Test** | Final unbiased performance estimate | Once, at the end | -- Cross-validation approximates the validation step by rotating through folds. --- # Example: Validation Picks the Hyperparameter Train three models on the **same** training set, evaluate on **validation**: ```r set.seed(99) demo <- tibble(x = runif(200, 0, 10), y = sin(x) + rnorm(200, 0, 0.3)) demo_train <- demo[1:140, ] demo_val <- demo[141:200, ] # Model A: degree-5 polynomial (less flexible) fit_A <- lm(y ~ poly(x, 5), data = demo_train) # Model B: degree-8 polynomial (moderate flexibility) fit_B <- lm(y ~ poly(x, 8), data = demo_train) # Model C: degree-20 polynomial (very flexible) fit_C <- lm(y ~ poly(x, 20), data = demo_train) ``` --- # Validation Results |Model | # Params| Train MSE| Val MSE| |:-------------|--------:|---------:|-------:| |A (degree 5) | 6| 0.1243| 0.1764| |B (degree 8) | 9| 0.0946| 0.1243| |C (degree 20) | 21| 0.0882| 0.1414| -- - Model C wins on **training** data (lowest Train MSE) — it memorizes best -- - Model B wins on **validation** data (lowest Val MSE) — it generalizes best -- - Too simple (A) → underfits. Too complex (C) → overfits. **Validation finds the sweet spot (B).** --- # Example: What Validation Reveals <img src="slides_files/figure-html/validation-plot-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Bias-Variance Tradeoff --- # The Decomposition For any estimate `\(\hat{f}\)`, the expected prediction error at a point `\(x_0\)` is: `$$\mathbb{E}\left[(Y - \hat{f}(x_0))^2\right] = \underbrace{\left[\text{Bias}(\hat{f}(x_0))\right]^2}_{\text{systematic error}} + \underbrace{\text{Var}(\hat{f}(x_0))}_{\text{sensitivity to training set}} + \underbrace{\sigma^2}_{\text{irreducible}}$$` -- where: - `\(\text{Bias}(\hat{f}(x_0)) = f(x_0) - \mathbb{E}[\hat{f}(x_0)]\)` - `\(\text{Var}(\hat{f}(x_0)) = \mathbb{E}\left[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)])^2\right]\)` --- # Proof — Step 1: Separate Noise from Estimation Start by substituting `\(Y = f(x_0) + \varepsilon\)`: `$$\mathbb{E}[(Y - \hat{f})^2] = \mathbb{E}[((f - \hat{f}) + \varepsilon)^2]$$` -- Expand the square: `$$= \mathbb{E}[(f - \hat{f})^2] + 2\,\mathbb{E}[(f - \hat{f})\,\varepsilon] + \mathbb{E}[\varepsilon^2]$$` -- The cross term vanishes: `\(\varepsilon\)` is independent of `\(\hat{f}\)` and `\(\mathbb{E}[\varepsilon] = 0\)` `$$= \mathbb{E}[(f(x_0) - \hat{f}(x_0))^2] + \sigma^2$$` --- # Proof — Step 2: Add and Subtract the Mean The trick: insert `\(\mathbb{E}[\hat{f}]\)` into the estimation error: `$$f - \hat{f} = \underbrace{(f - \mathbb{E}[\hat{f}])}_{\text{Bias}} + \underbrace{(\mathbb{E}[\hat{f}] - \hat{f})}_{\text{deviation from mean}}$$` -- Square and take expectation: `$$\mathbb{E}[(f - \hat{f})^2] = \text{Bias}^2 + 2 \cdot \text{Bias} \cdot \underbrace{\mathbb{E}[\mathbb{E}[\hat{f}] - \hat{f}]}_{= 0} + \underbrace{\mathbb{E}[(\hat{f} - \mathbb{E}[\hat{f}])^2]}_{\text{Var}(\hat{f})}$$` -- The cross term vanishes because `\(\mathbb{E}[\hat{f} - \mathbb{E}[\hat{f}]] = 0\)`. --- # Proof — Result `$$\boxed{\mathbb{E}\left[(Y - \hat{f}(x_0))^2\right] = \left[\text{Bias}(\hat{f}(x_0))\right]^2 + \text{Var}(\hat{f}(x_0)) + \sigma^2}$$` -- | Component | Simple model (e.g., linear) | Complex model (e.g., degree-20 poly) | |-----------|---------------------------|--------------------------------------| | Bias² | High | Low | | Variance | Low | High | -- The **sweet spot** minimizes their sum → this is model selection. --- # Back to Our Models — Where Are We Zooming In? <img src="slides_files/figure-html/zoom-context-plain-1.png" style="display: block; margin: auto;" /> --- count: false # Back to Our Models — Where Are We Zooming In? <img src="slides_files/figure-html/zoom-context-box-1.png" style="display: block; margin: auto;" /> --- # Seeing It: 100 Simulated Training Sets <img src="slides_files/figure-html/validation-zoom-1.png" style="display: block; margin: auto;" /> -- **Variance** = spread of the orange dots around their mean. **Bias** = gap between the mean (X) and the true f(x). --- # Model B (degree 8): Less Bias, More Variance <img src="slides_files/figure-html/validation-zoom-B-1.png" style="display: block; margin: auto;" /> -- Bias is smaller, but variance is larger — the **tradeoff** in action. --- # Model C (degree 20): Less Bias, More Variance <img src="slides_files/figure-html/validation-zoom-C-1.png" style="display: block; margin: auto;" /> -- Most flexible model: bias vanishes but variance explodes — **overfitting**. --- # Connecting the Pieces: MSE = Bias² + Variance + σ² |Model | Bias²| Variance| σ²| Bias²+Var+σ²| Avg MSE (500 sims)| Val MSE (1 sample)| Train MSE| |:-------------|------:|--------:|----:|------------:|------------------:|------------------:|---------:| |A (degree 5) | 0.0234| 0.0035| 0.09| 0.1169| 0.1169| 0.1911| 0.1243| |B (degree 8) | 0.0000| 0.0040| 0.09| 0.0940| 0.0940| 0.1282| 0.0946| |C (degree 20) | 0.0000| 0.0110| 0.09| 0.1010| 0.1010| 0.1360| 0.0882| -- - **Bias² + Var + σ²** matches **Avg MSE** (averaged over 500 simulations) — the decomposition works! -- - **Val MSE (1 sample)** is what we observe in practice — a single draw, so it won't match exactly -- - In practice, you **cannot** compute Bias² or Variance from one training set. The decomposition is a conceptual tool that explains *why* Val MSE has a U-shape -- - σ² = 0.09 is the **floor** — no model can beat it (irreducible noise) --- class: inverse, center, middle # Discrimination Angle ### Statistical Bias ≠ Social Bias --- # Two Kinds of Bias **Statistical bias:** model's average prediction is off — it systematically over- or under-estimates. `$$\text{Bias} = f(x_0) - \mathbb{E}[\hat{f}(x_0)]$$` Fix: use a more flexible model. -- **Social bias (disparate impact):** model treats groups differently in ways that cause harm — even if statistically unbiased overall. Fix: ...it's complicated. That's the rest of this course. --- # A Concrete Example Imagine a wait-time prediction model: -- - Predicts average wait = 5 min across **all** neighborhoods ✅ (low bias!) -- - But: **3 min** in wealthy/white neighborhoods, **8 min** in low-income/Black neighborhoods -- - The model isn't *wrong* — it accurately reflects the *system* -- - The system itself encodes discrimination, and the model faithfully learns it -- > **A statistically good model can be a socially harmful model.** --- # How This Plays Out at Uber/Lyft 1. **Dispatch optimization** minimizes total wait → sends more drivers where demand is dense → underserves sparse/low-income areas -- 2. **Surge pricing** responds to supply-demand imbalance → areas with fewer drivers get higher prices → low-income riders pay more -- 3. **ETA prediction** trained on historical data → if service was worse in certain neighborhoods, the model predicts longer waits → drivers avoid those areas → **self-fulfilling prophecy** --- # What to Always Ask When evaluating any model: -- 1. What is the loss **overall**? -- 2. What is the loss **per group**? -- 3. Are error rates **equal** across groups? -- This is the foundation for fairness metrics in Modules 11–12. --- class: inverse, center, middle # Exercise ### Let's See It in the Data --- # Part 1: Synthetic Ride-Sharing Data ```r set.seed(42) n <- 2000 neighborhoods <- tibble( neighborhood = c("Downtown", "Midtown", "Eastside", "Southside"), base_wait = c(3, 4, 7, 9), driver_supply = c(50, 35, 15, 10), pct_minority = c(0.20, 0.35, 0.70, 0.85), median_income = c(95000, 72000, 38000, 29000) ) rides <- tibble( ride_id = 1:n, neighborhood = sample( neighborhoods$neighborhood, n, replace = TRUE, prob = c(0.35, 0.30, 0.20, 0.15) ) ) |> left_join(neighborhoods, by = "neighborhood") |> mutate( hour = sample(0:23, n, replace = TRUE), is_peak = as.integer(hour %in% c(7, 8, 9, 17, 18, 19)), peak_penalty = is_peak * (10 - driver_supply / 10), wait_time = pmax(base_wait + peak_penalty + rnorm(n, 0, 1.5), 0.5) ) ``` --- # Wait Time Distributions <img src="slides_files/figure-html/density-plot-1.png" style="display: block; margin: auto;" /> Downtown and Midtown cluster around 3–4 min. Eastside and Southside: 7–9 min. --- # Summary by Neighborhood ```r rides |> group_by(neighborhood) |> summarise( n_rides = n(), avg_wait = round(mean(wait_time), 1), pct_minority = first(pct_minority), median_income = scales::dollar(first(median_income)), .groups = "drop" ) |> arrange(avg_wait) |> knitr::kable() ``` |neighborhood | n_rides| avg_wait| pct_minority|median_income | |:------------|-------:|--------:|------------:|:-------------| |Downtown | 723| 4.3| 0.20|$95,000 | |Midtown | 578| 5.6| 0.35|$72,000 | |Eastside | 403| 8.9| 0.70|$38,000 | |Southside | 296| 11.3| 0.85|$29,000 | --- # Part 2: Train / Val / Test Split ```r set.seed(123) train_idx <- sample(1:n, 0.6 * n) val_idx <- sample(setdiff(1:n, train_idx), 0.2 * n) test_idx <- setdiff(1:n, c(train_idx, val_idx)) train <- rides[train_idx, ] val <- rides[val_idx, ] test <- rides[test_idx, ] ``` --- # Part 2: Fit Polynomials Fit polynomials of degree 1 through 15 on `hour`: ```r degrees <- 1:15 results <- map_dfr(degrees, function(d) { fit <- lm(wait_time ~ poly(hour, d), data = train) tibble( degree = d, train_mse = mean((train$wait_time - predict(fit, train))^2), val_mse = mean((val$wait_time - predict(fit, val))^2) ) }) ``` --- # The Classic Tradeoff Curve <img src="slides_files/figure-html/bv-plot-1.png" style="display: block; margin: auto;" /> --- # Part 3: Disaggregate by Group Fit the best model and check errors **per neighborhood**: ```r best_degree <- results$degree[which.min(results$val_mse)] best_fit <- lm(wait_time ~ poly(hour, best_degree), data = train) test |> mutate(predicted = predict(best_fit, test)) |> group_by(neighborhood) |> summarise( n = n(), avg_actual = round(mean(wait_time), 1), avg_predicted = round(mean(predicted), 1), mse = round(mean((wait_time - predicted)^2), 1), pct_minority = first(pct_minority), .groups = "drop" ) |> arrange(pct_minority) |> knitr::kable() ``` |neighborhood | n| avg_actual| avg_predicted| mse| pct_minority| |:------------|---:|----------:|-------------:|----:|------------:| |Downtown | 131| 4.2| 6.6| 9.7| 0.20| |Midtown | 115| 5.6| 6.7| 4.3| 0.35| |Eastside | 81| 9.9| 7.1| 12.7| 0.70| |Southside | 73| 11.6| 6.8| 26.8| 0.85| --- # Error Concentrates in Minority Areas <img src="slides_files/figure-html/mse-bar-1.png" style="display: block; margin: auto;" /> --- # Part 4: Add Neighborhood as a Feature .small[ ```r better_fit <- lm(wait_time ~ poly(hour, best_degree) + neighborhood, data = train) test |> mutate(pred_v1 = predict(best_fit, test), pred_v2 = predict(better_fit, test)) |> group_by(neighborhood) |> summarise( avg_actual = round(mean(wait_time), 1), avg_pred_v1 = round(mean(pred_v1), 1), avg_pred_v2 = round(mean(pred_v2), 1), mse_v1 = round(mean((wait_time - pred_v1)^2), 1), mse_v2 = round(mean((wait_time - pred_v2)^2), 1), pct_minority = first(pct_minority), .groups = "drop" ) |> arrange(pct_minority) |> knitr::kable() ``` |neighborhood | avg_actual| avg_pred_v1| avg_pred_v2| mse_v1| mse_v2| pct_minority| |:------------|----------:|-----------:|-----------:|------:|------:|------------:| |Downtown | 4.2| 6.6| 4.2| 9.7| 4.0| 0.20| |Midtown | 5.6| 6.7| 5.7| 4.3| 2.9| 0.35| |Eastside | 9.9| 7.1| 9.4| 12.7| 4.6| 0.70| |Southside | 11.6| 6.8| 11.7| 26.8| 3.5| 0.85| ] Model v2 is **more accurate** — but it has **learned the disparity**. --- class: inverse # The Key Questions <br> ### 1. Should a model predict what **IS** or what **SHOULD BE**? -- <br> ### 2. If we use these predictions for driver dispatch, do we perpetuate the very inequality we measured? -- <br> ### 3. Is a model that ignores neighborhood "fairer" even though it's less accurate? --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td>The Learning Problem <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>2</b></td><td><b><a href="../module-02/slides.html">Linear Models →</a></b></td><td>next</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation & Selection</a></td><td>upcoming</td></tr> <tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks & Metrics</a></td><td>✓ done</td></tr> <tr><td>8</td><td>Auditing & Interpretability</td><td>upcoming</td></tr> </table>