Module 3: Model Evaluation & Selection

class: center, middle, inverse, title-slide

.title[
# Module 3: Model Evaluation & Selection
]
.subtitle[
## Auditing Driver Acceptance for Unequal Error Rates
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>✓ done</td></tr>
<tr><td><b>3</b></td><td><b>Model Evaluation &amp; Selection</b> <i>(you are here)</i></td><td>← current</td></tr>
<tr><td>4</td><td><a href="../module-04/slides.html">Tree-Based Methods</a></td><td>✓ done</td></tr>
<tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr>
<tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks &amp; Metrics</a></td><td>✓ done</td></tr>
<tr><td>8</td><td>Auditing &amp; Interpretability</td><td>upcoming</td></tr>
</table>

---

# One Number Is Not Enough

In Module 1 we used **MSE**. In Module 2 we used **accuracy**. Both are scalars — they hide *where* and *for whom* the model is wrong.

**Quick example**: if only 1% of events are positive, "always predict negative" gets **99% accuracy** — and tells you nothing.

We saw this hint in Module 2: every model had ~75% accuracy yet predicted very different acceptance rates per group. We need finer metrics to see *why*.

For classification, look at the **confusion matrix**...

---
background-image: url(figs/confusion_matrix.png)
background-size: contain

---

# The Confusion Matrix

.big-table[
|              | Predicted = 1 | Predicted = 0 |
|--------------|---------------|---------------|
| **Actual = 1**   | TP            | FN            |
| **Actual = 0**   | FP            | TN            |
]

---

# Metrics Derived from the Confusion Matrix

<table class="cm-corner">
<tr><th></th><th>Pred = 1</th><th>Pred = 0</th></tr>
<tr><th>Actual = 1</th><td class="cm-tp">TP</td><td class="cm-fn">FN</td></tr>
<tr><th>Actual = 0</th><td class="cm-fp">FP</td><td class="cm-tn">TN</td></tr>
</table>

- **Accuracy** = `$\dfrac{\color{#1b5e20}{TP} + \color{#0d47a1}{TN}}{\color{#1b5e20}{TP} + \color{#e65100}{FN} + \color{#b71c1c}{FP} + \color{#0d47a1}{TN}}$` — overall fraction correct

- **Precision** = `$\dfrac{\color{#1b5e20}{TP}}{\color{#1b5e20}{TP} + \color{#b71c1c}{FP}}$` — of what I predicted positive, how many were real?

- **True positive rate (recall)** = `$\dfrac{\color{#1b5e20}{TP}}{\color{#1b5e20}{TP} + \color{#e65100}{FN}}$` — of the real positives, how many did I predict positive?

- **False positive rate** = `$\dfrac{\color{#b71c1c}{FP}}{\color{#b71c1c}{FP} + \color{#0d47a1}{TN}}$` — of the real negatives, how many did I predict positive?

- **F1** = `$2 \cdot \dfrac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$` — harmonic mean (penalizes lopsided models)

**Worked example.** Driver-acceptance model on 1000 ride requests (positive = "accepted"): TP = 480, FN = 220, FP = 130, TN = 170.

`$$\text{accuracy} = \tfrac{480 + 170}{1000} = 0.65 \quad
\text{precision} = \tfrac{480}{480 + 130} \approx 0.79 \quad
\text{recall} = \tfrac{480}{480 + 220} \approx 0.69 \quad
\text{FPR} = \tfrac{130}{300} \approx 0.43$$`

---

# Which Metric When?

*"Classes" = the possible values of `$Y$` (e.g., accepted vs not accepted).*

- **Imbalanced classes** → recall + precision
  - At 4 AM only 5% of requests get accepted. "Always predict reject" → accuracy = 0.95, recall = 0, precision undefined (0/0). Useless.
- **The probability value itself matters** → log loss
  - Dispatch sends a driver only if `$P(\text{accept}) > 0.6$`. If the model says 0.7 but the true rate is 0.4, the rule misfires — even if the model gets the *ordering* of requests right.
- **Cost-sensitive** → weighted loss
  - Predicting "accept" when the driver actually rejects → wasted dispatch (~$1). Predicting "reject" when they would have accepted → lost ride (~$5). False negatives cost **5× more**.

---

# ROC Curves and AUC

**ROC** = *Receiver Operating Characteristic* (from WWII radar). **AUC** = *Area Under the Curve*.

A classifier outputs a probability. Pick a **threshold** to turn it into 0/1; sweeping the threshold traces the ROC curve. AUC = area under it.

- **AUC = 0.5** → random; **AUC = 1.0** → perfect ranking
- Equivalent reading: probability that a random positive scores higher than a random negative

---

# Same Story, with Threshold on the X-Axis

For each classifier, **two curves** of TPR (green) and FPR (red) vs threshold. The ROC curve from the previous slide is built by plotting these two curves *against each other*, eliminating the threshold from the picture.

---

# k-Fold Cross-Validation

CV doesn't estimate model **parameters** ($\beta$ etc. — those are fit inside each fold). It estimates the model's **generalization error** by averaging over many train/validation splits.

1. Split training data into `$k$` folds
2. For each fold: train on the other `$k-1$`, predict on the held-out one
3. Average the metric across folds

Then use the CV error to pick a **hyperparameter** (like Ridge's `$\lambda$`, polynomial degree, tree depth):

```
for each candidate λ:
    for each of the k folds:
        fit model with this λ on the other k-1 folds
        record error on held-out fold
    average the k errors
pick the λ with the lowest average CV error
```

- `$k = 5$` or `$k = 10$` are standard
- Module 1 used a *single* validation split for polynomial degree; CV is the more robust version

---

# A Perfect AUC Can Still Be Useless

Consider a model that outputs these scores on 5 ride requests (actual label in parens):

| score | label    |
|-------|----------|
| 0.21  | accepted |
| 0.20  | accepted |
| 0.19  | rejected |
| 0.18  | accepted |
| 0.17  | rejected |

Every accepted request scored higher than every rejected one → **AUC = 1.0** (perfect ranking).

But the predicted probabilities sit around **0.2**, even though the actual acceptance rate is 60%. If your dispatch rule is "send a driver when `$P(\text{accept}) > 0.5$`", **you'd dispatch zero drivers** — despite the model perfectly knowing which requests would be accepted.

AUC only cares about **order**. It doesn't care that the *numbers are wrong*. We need a different concept for that → **calibration**.

---

# Calibration

A classifier is **well-calibrated** if, **for every** `$x \in [0, 1]$`, among ride requests where the model predicted `$P(\text{accept}) \approx x$`, the fraction actually accepted is also `$\approx x$`.

`$$P(Y = 1 \mid \hat{p}(X) = x) = x \quad \text{for all } x$$`

**Worked example at `$x = 0.8$`.** Take all ride requests where the model predicted `$\hat{p} \in [0.75, 0.85]$` — say there are 200 such requests. Count the actual acceptances:

- ≈ 160 accepted → ratio ≈ 0.80 → **well-calibrated** ✓
- only ≈ 100 accepted → ratio ≈ 0.50 → **overconfident** (model says 0.8, reality is 0.5)
- ≈ 190 accepted → ratio ≈ 0.95 → **underconfident**

For the model to be calibrated *overall*, the same property must hold at every `$x$` — that's what the reliability diagram below checks.

- Check it with a **reliability diagram**: bin predicted probabilities, plot bin mean vs observed rate. Diagonal = perfect.
- Fix it with **Platt scaling** (logistic regression on the scores) or **isotonic regression** after the fact.

---

# Reliability Diagrams: Three Examples

<div class="two-col">
<div class="col-narrow">
<ul>
<li><b style="color:darkgreen">Green</b> lies on the diagonal → predicted = observed at every <i>x</i> → <b>calibrated</b></li>
<li><b style="color:firebrick">Red</b> is <i>below</i> the diagonal → predicts higher than reality → <b>overconfident</b></li>
<li><b style="color:steelblue">Blue</b> is <i>above</i> the diagonal → predicts lower than reality → <b>underconfident</b></li>
</ul>
</div>
<div class="col-wide">

</div>
</div>

---

# Fixing Calibration After the Fact

Both methods are **post-fit recipes**: don't retrain the original model — train a tiny second model that maps `old score → calibrated probability`, using a held-out calibration set.

**Platt scaling** — assume the distortion is a smooth sigmoid:

1. Take held-out scores `$s$` and labels `$y$`
2. Fit a 1-D logistic regression `$y \sim s$` → get `$a, b$`
3. Replace each score with `$\hat p_{\text{new}} = \sigma(a \cdot s + b)$`

**Isotonic regression** — let the data choose the shape, only require it to be monotonic:

1. Same held-out `$(s, y)$`
2. Fit a monotonic step function `$g(s)$` to `$y$`
3. Replace each score with `$\hat p_{\text{new}} = g(s)$`

| | Platt | Isotonic |
|---|---|---|
| Shape | Sigmoid (parametric) | Any monotonic (non-parametric) |
| Data needed | A few hundred points | A few thousand points |
| Risk | Underfit if distortion isn't sigmoid | Overfit when data is small |

R: `probably::cal_estimate_logistic()` / `probably::cal_estimate_isotonic()`

---
class: inverse, center, middle

# Discrimination Angle
### Equal AUC ≠ Equal Treatment

---

# Unequal Error Rates

A classifier can have **great overall accuracy** and still be terrible for a specific group. Two distinct ways:

1. **Different precision** — the same flag means different things: a "suspicious" label has 90% precision for one group and 50% for another

2. **Different recall** (or FPR) — the model catches positives for one group but misses them for another, or flags innocent people more in one group

3. **Different calibration** — a "score of 0.8" actually means 60% risk for group A and 90% for group B

---

# The COMPAS Story (2016)

ProPublica audited COMPAS, a recidivism risk-scoring tool used in U.S. courts.

- Overall AUC was **similar** for Black and white defendants
- Among defendants who did **not** re-offend, Black defendants were classified high-risk **twice as often** (higher FPR)
- Among defendants who **did** re-offend, white defendants were classified low-risk more often (lower TPR for whites)

The COMPAS makers responded: the model was **calibrated equally** across groups (a score of 7 meant the same recidivism rate for both).

**Both sides were right.** This is the *impossibility result* we'll see in Module 11: you can't satisfy all fairness criteria simultaneously.

---

# What "Auditing" Actually Means

A real audit is just: **compute every metric you care about, broken down by demographic group.** No new statistics — just discipline.

In R, the entire audit is one pipe. The test set has columns `is_minority`, `truth` (actual), `pred_class` (0/1 prediction), and `pred_prob` (predicted probability):

```r
test |>
  group_by(is_minority) |>
  summarise(
    accuracy  = accuracy_vec(truth, pred_class),
    precision = precision_vec(truth, pred_class, event_level = "second"),
    recall    = recall_vec(truth, pred_class, event_level = "second"),
    auc       = roc_auc_vec(truth, pred_prob, event_level = "second")
  )
```

| is_minority | accuracy | precision | recall | auc  |
|-------------|----------|-----------|--------|------|
| 0           | 0.78     | 0.86      | 0.85   | 0.83 |
| 1           | 0.69     | 0.61      | 0.62   | 0.81 |

Same model, very different per-group experience: **AUC is essentially equal**, but accuracy drops 9 points, precision falls from 0.86 → 0.61, and recall from 0.85 → 0.62.

---
class: inverse, center, middle

# Exercise
### Auditing the Driver-Acceptance Model

---

# Part 1: Reuse the Module 2 Setup

```r
set.seed(2024); n <- 5000
neighborhoods <- tibble(
  neighborhood   = c("Downtown","Midtown","Eastside","Southside"),
  base_accept    = c(0.85, 0.78, 0.55, 0.40),
  pct_minority   = c(0.20, 0.35, 0.70, 0.85)
)
requests <- tibble(
  request_id = 1:n,
  neighborhood = sample(neighborhoods$neighborhood, n, TRUE,
                        prob = c(0.35, 0.30, 0.20, 0.15))
) |>
  left_join(neighborhoods, by = "neighborhood") |>
  mutate(
    hour = sample(0:23, n, replace = TRUE),
    is_night = as.integer(hour >= 22 | hour <= 5),
    trip_distance_mi = pmax(rlnorm(n, 1.2, 0.6), 0.3),
    rider_rating = pmin(pmax(rnorm(n, 4.7, 0.4), 1), 5)
  )
```

---

# Part 1: Generate Acceptance + Demographics

```r
requests <- requests |>
  mutate(
    logit_p = qlogis(base_accept) - 0.6 * is_night -
              0.15 * abs(trip_distance_mi - 4) +
              0.5 * (rider_rating - 4.7),
    accepted = rbinom(n, 1, plogis(logit_p)),
    is_minority = rbinom(n, 1, pct_minority)
  )
```

| is_minority|    n| accept_rate|
|-----------:|----:|-----------:|
|           0| 2738|       0.683|
|           1| 2262|       0.485|

---

# Part 2: Fit the Race-Blind Classifier

```r
set.seed(123)
train_idx <- sample(1:n, 0.7 * n)
train <- requests[train_idx, ]; test <- requests[-train_idx, ]

# is_minority is NOT in the formula (same as Module 2)
fit <- glm(
  accepted ~ neighborhood + is_night + trip_distance_mi + rider_rating,
  data = train, family = binomial
)

test$pred_prob  <- predict(fit, test, type = "response")
test$pred_class <- factor(as.integer(test$pred_prob >= 0.5), levels = c(0, 1))
test$truth      <- factor(test$accepted, levels = c(0, 1))
```

---

# Part 3: Overall Metrics

.small[

|.metric   |.estimator | .estimate|
|:---------|:----------|---------:|
|accuracy  |binary     |     0.692|
|precision |binary     |     0.707|
|recall    |binary     |     0.817|
|f_meas    |binary     |     0.758|
|roc_auc   |binary     |     0.730|
]

Looks fine. Now disaggregate.

---

# Part 4: Per-Group Metrics

.small[

| is_minority|   n| accuracy| precision| recall|   fpr|   auc|
|-----------:|---:|--------:|---------:|------:|-----:|-----:|
|           0| 816|    0.713|     0.715|  0.943| 0.738| 0.710|
|           1| 684|    0.667|     0.687|  0.619| 0.285| 0.715|
]

AUC is similar across groups, but the **false positive rate** and **flag rate** are very different.

---

# Per-Group ROC Curves

---

# Per-Group Calibration

---

# Disparate Impact at Every Threshold

---
class: inverse

# The Key Questions

<br>

### 1. If two models have the same AUC, are they equally fair?

<br>

### 2. Should we pick the threshold that maximizes accuracy, or the one that minimizes the FPR gap?

<br>

### 3. What does "calibration" buy us if it can coexist with unequal FPRs?

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>✓ done</td></tr>
<tr><td>3</td><td>Model Evaluation &amp; Selection <i>(just finished)</i></td><td>✓ done</td></tr>
<tr><td><b>4</b></td><td><a href="../module-04/slides.html"><b>Tree-Based Methods</b></a></td><td>next</td></tr>
<tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr>
<tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr>
<tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks &amp; Metrics</a></td><td>✓ done</td></tr>
<tr><td>8</td><td>Auditing &amp; Interpretability</td><td>upcoming</td></tr>
</table>

Say **"start module 4"** when ready.