class: center, middle, inverse, title-slide .title[ # Module 2: Linear Models ] .subtitle[ ## Driver Acceptance & Proxy Discrimination ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>✓ done</td></tr> <tr><td><b>2</b></td><td><b>Linear Models</b> <i>(you are here)</i></td><td>← current</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation & Selection</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks & Metrics</a></td><td>✓ done</td></tr> <tr><td>8</td><td>Auditing & Interpretability</td><td>upcoming</td></tr> </table> --- # Linear Regression **Example:** predict the **trip duration** for an Uber ride at request time (before the ride starts). `$$\underbrace{\text{duration}}_{Y} = \beta_0 + \beta_1 \cdot \text{distance} + \beta_2 \cdot \text{hour} + \beta_3 \cdot \text{is\_rush\_hour} + \varepsilon$$` -- Each `\(\beta_j\)` tells you how much duration changes per unit of that feature, holding the others fixed. E.g., `\(\beta_1 = 2.5\)` means **2.5 extra minutes per mile**. -- Fit by minimizing the **residual sum of squares** (gap between actual and predicted duration): `$$\hat{\beta} = \arg\min_\beta \sum_i \left( y_i - \hat{y}_i \right)^2$$` -- Closed form: `\(\hat{\beta} = (X^\top X)^{-1} X^\top Y\)`. No iteration needed. -- Evaluate with **MSE** (or RMSE / MAE) — *not* accuracy. "Accuracy" only makes sense when there are discrete classes to be right or wrong about; for continuous `\(Y\)` you measure how *far off* you are. --- # Logistic Regression For binary `Y ∈ {0, 1}`, model the **log-odds** as linear: `$$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$` -- Equivalently: `\(p = \sigma(\beta_0 + \sum_j \beta_j X_j)\)` where `\(\sigma\)` is the sigmoid. -- Fit by **maximum likelihood** (no closed form — iterative). -- Coefficients are **log-odds ratios**: a one-unit change in `\(X_j\)` multiplies the odds of `\(Y=1\)` by `\(\exp(\beta_j)\)`. -- To turn `\(p\)` into a 0/1 prediction, threshold at 0.5 (or wherever you choose). The simplest scalar to evaluate the result is **accuracy**: `$$\text{accuracy} = \frac{\#\text{correct predictions}}{\#\text{total predictions}}$$` The same formula works for any **classification** task — binary or multi-class — since "correct" is a yes/no question. (Module 3 will explain why accuracy alone is rarely enough.) --- # Quick Refresher: Maximum Likelihood **Idea:** pick the parameters that make the **observed data most probable**. -- For each ride request `\(i\)`, the model says `\(P(Y_i = 1) = p_i(\beta)\)`. The probability of seeing exactly the labels we observed is: `$$L(\beta) = \prod_i p_i^{y_i} (1 - p_i)^{1 - y_i}$$` -- We maximize this — or equivalently, the **log-likelihood** (sums are easier than products): `$$\ell(\beta) = \sum_i \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right]$$` -- This is the **negative cross-entropy loss** from Module 1 — same thing. Minimize the loss = maximize the likelihood. -- No closed form for `\(\hat{\beta}\)` → use iterative optimization (Newton-Raphson, gradient descent). `glm()` does it for you. --- # Why Regularize? **Regularization** = adding a penalty on the size of the coefficients to the loss function, to prevent the model from fitting noise. -- Plain regression (no penalty) has problems when: -- - `\(p\)` (number of features) is large relative to `\(n\)` (number of observations) → overfitting -- - Features are correlated → unstable coefficients (multicollinearity) -- - You want **automatic feature selection** -- **Solution:** instead of just minimizing the loss, minimize **loss + penalty on `\(\beta\)`**. The penalty discourages large coefficients, shrinking the model toward a simpler one. --- # Ridge (L2) Recall the **residual sum of squares**: `\(\text{RSS} = \sum_i (y_i - \hat{y}_i)^2\)` `$$\hat{\beta}_{\text{ridge}} = \arg\min_\beta \left[ \text{RSS} + \lambda \sum_j \beta_j^2 \right]$$` -- - Shrinks coefficients toward zero, but **never exactly zero** - Handles multicollinearity well - All features stay in the model --- # Ridge: How `\(\hat{\beta}\)` Changes with `\(\lambda\)` Same data, three values of `\(\lambda\)`. Watch the coefficients shrink: | | λ = 0| λ = 1| λ = 10| λ = 100| |:-------------------|------:|------:|------:|-------:| |β₁ (x1) | 2.040| 2.008| 1.801| 1.064| |β₂ (x2, correlated) | 1.386| 1.396| 1.406| 0.954| |β₃ (x3) | -0.694| -0.687| -0.630| -0.349| -- - `\(\lambda = 0\)`: ordinary least squares — large, possibly unstable coefficients - `\(\lambda \to \infty\)`: all coefficients shrink toward 0 (but never exactly 0) - The penalty trades a bit of bias for a lot less variance --- # Lasso (L1) `$$\hat{\beta}_{\text{lasso}} = \arg\min_\beta \left[ \text{RSS} + \lambda \sum_j |\beta_j| \right]$$` -- - Shrinks **and** sets some coefficients exactly to zero - Performs **feature selection** automatically - Picks one of a group of correlated features arbitrarily --- # Lasso: How `\(\hat{\beta}\)` Changes with `\(\lambda\)` Same data as the Ridge slide. Watch coefficients hit **exactly zero**: | | λ = 0.01| λ = 0.1| λ = 0.3| λ = 1| |:-------------------|--------:|-------:|-------:|-----:| |β₁ (x1) | 2.036| 1.979| 1.853| 1.442| |β₂ (x2, correlated) | 1.377| 1.306| 1.148| 0.568| |β₃ (x3) | -0.685| -0.599| -0.409| 0.000| -- - Compared to Ridge, Lasso pushes some coefficients **all the way to 0** - At `\(\lambda = 1\)`, only one feature survives — automatic feature selection - Note how it picks one of the correlated pair (x1 or x2) somewhat arbitrarily --- # Elastic Net Mixes L1 and L2: `$$\hat{\beta}_{\text{enet}} = \arg\min_\beta \left[ \text{RSS} + \lambda \left( \alpha \sum_j |\beta_j| + (1-\alpha) \sum_j \beta_j^2 \right) \right]$$` -- - `\(\alpha = 1 \Rightarrow\)` Lasso, `\(\alpha = 0 \Rightarrow\)` Ridge - Best of both: groups correlated features (Ridge) **and** still selects (Lasso) -- Choose `\(\lambda\)` via cross-validation (`glmnet::cv.glmnet()`). --- # Elastic Net: Same Data, `\(\alpha = 0.5\)` | | λ = 0.01| λ = 0.1| λ = 0.3| λ = 1| |:-------------------|--------:|-------:|-------:|------:| |β₁ (x1) | 2.033| 1.959| 1.813| 1.406| |β₂ (x2, correlated) | 1.383| 1.361| 1.300| 1.046| |β₃ (x3) | -0.688| -0.636| -0.525| -0.185| -- - Behavior sits **between Ridge and Lasso**: some shrinkage, some sparsity - The correlated features (x1, x2) tend to **stay together** instead of one being arbitrarily dropped - `\(\alpha\)` controls how Lasso-like vs Ridge-like the result is --- # How Are the `\(\hat{\beta}\)` Computed? | Method | Algorithm | |--------|-----------| | **OLS** | Closed form: `\(\hat{\beta} = (X^\top X)^{-1} X^\top y\)` | | **Ridge** | Closed form: `\(\hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top y\)` | | **Lasso / Elastic Net** | **Iterative optimization** (no closed form — the L1 penalty is not differentiable at 0) | -- For Lasso, start at `\(\beta = 0\)` and update each step using the **gradient of the loss**: `$$\beta_{\text{new}} = \beta_{\text{old}} - \eta \cdot \underbrace{\frac{1}{n} X^\top (X\beta_{\text{old}} - y)}_{\text{gradient of RSS}}$$` then **shrink** any coefficient smaller than `\(\eta\lambda\)` down to 0 (this is what the L1 penalty does). -- `\(\eta\)` is a small **step size**. Repeat until coefficients stop changing. --- # Lasso Iterates: Same Data as Before | | iter 1| iter 5| iter 20| iter 200| |:-------------------|------:|------:|-------:|--------:| |β₁ (x1) | 0.112| 0.486| 1.228| 1.767| |β₂ (x2, correlated) | 0.103| 0.444| 1.097| 1.238| |β₃ (x3) | -0.020| -0.092| -0.261| -0.407| -- - **iter 1**: barely moved from zero - **iter 5–20**: rapidly approaching the solution - **iter 200**: converged — these are the final `\(\hat{\beta}\)` - Each step is cheap; after enough steps, you get the same answer as `glmnet` --- class: inverse, center, middle # Discrimination Angle ### Removing Race ≠ Race-Blind --- # The Proxy Problem Suppose you build a driver-acceptance model and decide **not** to use race as a feature. -- Problem solved? **No.** -- The model will happily learn from **proxy variables** that correlate with race: -- - **Pickup neighborhood** — strongly correlated with racial demographics - **Zip code** — same problem - **Phone area code** — geographic, hence demographic - **Trip distance + pickup time** — encodes "lives in poor area, works late" -- A linear model assigns coefficients to these proxies that effectively **recreate racial discrimination**, even though "race" never appears in the data. --- # Why Regularization Doesn't Save You **Lasso** selects features. If neighborhood is the strongest predictor, it gets selected. -- The model is sparse and interpretable — but explicitly bases decisions on a racially-correlated feature. -- **Ridge** keeps all features but shrinks them. The discrimination gets spread across many proxies, **harder to detect** by inspection. -- **Regularization is about overfitting — NOT about fairness.** --- class: inverse, center, middle # Exercise ### Driver Acceptance in 4 Neighborhoods --- # Part 1: Simulate the Data ```r set.seed(2024); n <- 5000 neighborhoods <- tibble( neighborhood = c("Downtown", "Midtown", "Eastside", "Southside"), base_accept = c(0.85, 0.78, 0.55, 0.40), pct_minority = c(0.20, 0.35, 0.70, 0.85) ) requests <- tibble( request_id = 1:n, neighborhood = sample(neighborhoods$neighborhood, n, replace = TRUE, prob = c(0.35, 0.30, 0.20, 0.15)) ) |> left_join(neighborhoods, by = "neighborhood") |> mutate( hour = sample(0:23, n, replace = TRUE), is_night = as.integer(hour >= 22 | hour <= 5), trip_distance_mi = pmax(rlnorm(n, 1.2, 0.6), 0.3), rider_rating = pmin(pmax(rnorm(n, 4.7, 0.4), 1), 5) ) ``` --- # Part 1: Generate Acceptance + Demographics ```r requests <- requests |> mutate( logit_p = qlogis(base_accept) - 0.6 * is_night - 0.15 * abs(trip_distance_mi - 4) + 0.5 * (rider_rating - 4.7), accepted = rbinom(n, 1, plogis(logit_p)), # Race is a HIDDEN variable used only for auditing is_minority = rbinom(n, 1, pct_minority) ) ``` -- | is_minority| n| accept_rate| |-----------:|----:|-----------:| | 0| 2738| 0.683| | 1| 2262| 0.485| The data already shows disparate acceptance rates across groups. --- # Part 2: Race-Blind Logistic Regression ```r set.seed(123) train_idx <- sample(1:n, 0.7 * n) train <- requests[train_idx, ]; test <- requests[-train_idx, ] # Note: is_minority is NOT in the formula fit_logit <- glm( accepted ~ neighborhood + is_night + trip_distance_mi + rider_rating, data = train, family = binomial ) ``` --- # Coefficients |term | estimate| std.error| statistic| p.value| |:---------------------|--------:|---------:|---------:|-------:| |(Intercept) | -0.043| 0.533| -0.081| 0.936| |neighborhoodEastside | -1.555| 0.104| -14.925| 0.000| |neighborhoodMidtown | -0.639| 0.095| -6.714| 0.000| |neighborhoodSouthside | -2.198| 0.122| -18.019| 0.000| |is_night | -0.616| 0.079| -7.850| 0.000| |trip_distance_mi | -0.047| 0.014| -3.451| 0.001| |rider_rating | 0.367| 0.113| 3.238| 0.001| -- Look at the **neighborhood** coefficients: large negative log-odds for Eastside and Southside. The model has learned "this area = lower acceptance" — a proxy for race. --- # Part 3: Audit the Race-Blind Model ```r test$pred_prob <- predict(fit_logit, test, type = "response") ``` | is_minority| n| actual_accept| predicted_accept| |-----------:|---:|-------------:|----------------:| | 0| 816| 0.663| 0.678| | 1| 684| 0.503| 0.494| -- Race never entered the model, yet the **predicted acceptance rate for minority riders is much lower**. The neighborhood feature did the discriminating. --- # Part 4: Try Lasso ```r x_train <- model.matrix( accepted ~ neighborhood + is_night + trip_distance_mi + rider_rating, data = train)[, -1] y_train <- train$accepted set.seed(456) cv_lasso <- cv.glmnet(x_train, y_train, family = "binomial", alpha = 1, nfolds = 10) ``` | | Coefficient| |:---------------------|-----------:| |(Intercept) | 0.396| |neighborhoodEastside | -1.188| |neighborhoodMidtown | -0.309| |neighborhoodSouthside | -1.788| |is_night | -0.465| |trip_distance_mi | -0.024| |rider_rating | 0.187| --- # Lasso Coefficient Paths <img src="slides_files/figure-html/lasso-path-1.png" style="display: block; margin: auto;" /> .small[ - **Why log(λ)?** `cv.glmnet` sweeps λ over many orders of magnitude (e.g. 0.0001 → 1). On a linear axis nearly everything happens crammed near zero — log spreads it out evenly. - **Dashed vertical line:** `lambda.1se` — the largest λ whose CV error is within 1 standard error of the minimum. A slightly sparser model than `lambda.min`, usually preferred. ] Even at strong regularization, neighborhood survives — it's the strongest predictor. --- # Part 5: Compare All Models | is_minority| plain_logit| lasso| ridge| elastic_net| |-----------:|-----------:|-----:|-----:|-----------:| | 0| 0.678| 0.666| 0.660| 0.657| | 1| 0.494| 0.507| 0.516| 0.518| -- All four models predict roughly the same rates per group. **Regularization doesn't fix discrimination** — the proxies are still doing the work. --- # Part 6: What If We Drop Neighborhood? ```r fit_no_nbhd <- glm( accepted ~ is_night + trip_distance_mi + rider_rating, data = train, family = binomial ) test$pred_no_nbhd <- predict(fit_no_nbhd, test, type = "response") ``` | is_minority| actual_accept| predicted_accept| |-----------:|-------------:|----------------:| | 0| 0.663| 0.598| | 1| 0.503| 0.597| -- The disparate impact shrinks dramatically. But so does accuracy. --- # Visualizing the Tradeoff <img src="slides_files/figure-html/tradeoff-plot-1.png" style="display: block; margin: auto;" /> .small[**Dollar cost.** Dropping `neighborhood` costs ~8.3 pp of accuracy. At 25M daily requests × $5 platform revenue, that's ~2,066,667 wrong decisions/day, or **~$3.8B/year** in lost revenue. This is why platforms resist 'fairness through unawareness' — the proxies are *worth real money*.] --- class: inverse # The Key Questions <br> ### 1. Is removing the protected attribute enough? (No.) -- <br> ### 2. Does sparsity make a model fair? (No — it just makes the discrimination visible.) -- <br> ### 3. What does it mean for a feature to be a "proxy"? Where do you draw the line? --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>✓ done</td></tr> <tr><td>2</td><td>Linear Models <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>3</b></td><td><b><a href="../module-03/slides.html">Model Evaluation & Selection →</a></b></td><td>next</td></tr> <tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks & Metrics</a></td><td>✓ done</td></tr> <tr><td>8</td><td>Auditing & Interpretability</td><td>upcoming</td></tr> </table> Say **"start module 3"** when ready.