class: center, middle, inverse, title-slide .title[ # Module 7: Fairness Frameworks & Metrics ] .subtitle[ ## When You Can’t Be Fair in More Than One Way ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } .two-col { display: flex; align-items: flex-start; gap: 1em; } .col-narrow { flex: 1; } .col-wide { flex: 2; } .metrics-ref { position: absolute; bottom: 12px; right: 90px; background: #fff8dc; border: 1px solid #d4b400; border-radius: 4px; padding: 1px 6px; font-size: 11px; text-decoration: none; color: #6b5a00; font-weight: bold; z-index: 100; } .metrics-ref:hover { background: #ffeeaa; text-decoration: none; } .cm-corner { position: absolute; top: 90px; right: 40px; border-collapse: collapse; font-size: 90%; } .cm-corner th, .cm-corner td { border: 1px solid #888; padding: 8px 14px; text-align: center; font-weight: bold; } .cm-tp { background: #c8e6c9; color: #1b5e20; } .cm-fn { background: #ffe0b2; color: #e65100; } .cm-fp { background: #ffcdd2; color: #b71c1c; } .cm-tn { background: #bbdefb; color: #0d47a1; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation & Selection</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td><b>7</b></td><td><b>Fairness Frameworks & Metrics</b> <i>(you are here)</i></td><td>← current</td></tr> <tr><td>8</td><td>Auditing & Interpretability</td><td>upcoming</td></tr> </table> --- # From Module 3: We Saw the Gaps <a class="metrics-ref" href="#metrics-ref">CM</a> Module 3's audit showed that the same model could have: - Equal AUC across groups - Different precision, recall, FPR per group -- That tells us **the model isn't behaving the same for everyone**. But it doesn't tell us: -- - Which gap matters? - What does "fair" actually mean? - Can we close all the gaps at once? -- This module gives you the formal answers, and one **mathematical impossibility result** that says some of those answers exclude each other. --- # The Setup For a binary classifier: - `\(Y \in \{0, 1\}\)` — actual outcome (e.g. driver accepted) - `\(\hat{Y} \in \{0, 1\}\)` — model's prediction - `\(A\)` — protected attribute (e.g. minority status), used **only** for auditing -- A "fairness criterion" is some equality between groups defined on `\(Y\)`, `\(\hat{Y}\)`, and `\(A\)`. -- There are several. The famous ones turn out to be **mutually incompatible**. --- # Three Fairness Criteria | Criterion | Equality required | |---|---| | **Demographic parity** | `\(P(\hat Y = 1 \mid A = 0) = P(\hat Y = 1 \mid A = 1)\)` | | **Equalized odds** | `\(P(\hat Y = 1 \mid Y = y, A = 0) = P(\hat Y = 1 \mid Y = y, A = 1)\)` for `\(y \in \{0, 1\}\)` | | **Predictive parity** | `\(P(Y = 1 \mid \hat Y = 1, A = 0) = P(Y = 1 \mid \hat Y = 1, A = 1)\)` | -- Each one says something different about what "fair" means. Let's read each. --- # 1. Demographic Parity `$$P(\hat Y = 1 \mid A = 0) = P(\hat Y = 1 \mid A = 1)$$` **The model says "accept" at the same rate for both groups.** -- - Cares only about the model's outputs, **ignores the actual labels** - Makes sense when the positive decision is itself the resource (a job offer, a loan, an Uber dispatch) -- **Failure mode:** if base rates genuinely differ, hitting equal positive rates forces the model to *miss real positives* in one group or *flag real negatives* in the other. --- # 2. Equalized Odds <a class="metrics-ref" href="#metrics-ref">CM</a> `$$P(\hat Y = 1 \mid Y = 1, A = 0) = P(\hat Y = 1 \mid Y = 1, A = 1) \quad \text{(equal TPR)}$$` `$$P(\hat Y = 1 \mid Y = 0, A = 0) = P(\hat Y = 1 \mid Y = 0, A = 1) \quad \text{(equal FPR)}$$` **Among the actually-positive, catch the same fraction in each group. Among the actually-negative, falsely flag at the same rate.** -- - Allows different positive rates *only if the truth says so* - Weaker version, **equal opportunity**, only requires equal TPR -- **Makes sense when:** there's a real outcome you care about and missing it has direct human cost (medical, fraud, dispatch). -- **Failure mode:** assumes the labels `\(Y\)` are an unbiased measure of the truth. If `\(Y\)` itself is a biased proxy (e.g., past *arrests* used as a stand-in for past *crimes*), enforcing equal TPR/FPR just reproduces the labelling bias — and the only fix is post-processing with group-specific thresholds, which uses the protected attribute at decision time. --- # 3. Predictive Parity <a class="metrics-ref" href="#metrics-ref">CM</a> `$$P(Y = 1 \mid \hat Y = 1, A = 0) = P(Y = 1 \mid \hat Y = 1, A = 1)$$` **A "positive" prediction means the same thing in each group** — equal precision. -- **Worked example.** A driver-acceptance model flags 200 ride requests as "will accept" in each group. After observing actual outcomes: | | Predicted accept | Actually accepted | Precision | |---|---|---|---| | Non-minority | 200 | 170 | 170/200 = **0.85** | | Minority | 200 | 168 | 168/200 = **0.84** | These precisions are essentially equal → **predictive parity holds**. When the model says "accept", it's right ~85% of the time *regardless of group*. -- - **Stronger version:** *calibration within groups*. At every probability score `\(s\)`, the actual rate equals `\(s\)` in **both** groups (not just at the threshold). - This is exactly the per-group calibration check from Module 3 — re-read as a fairness criterion. -- **Failure mode:** says nothing about the *innocent* (the negatives). Two models can have identical precision yet flag innocent minority riders at twice the rate — predictive parity is satisfied while equalized odds is not. --- # The Impossibility Theorem > **Chouldechova (2017) / Kleinberg, Mullainathan & Raghavan (2017):** if base rates differ between groups, `\(P(Y = 1 \mid A = 0) \neq P(Y = 1 \mid A = 1)\)`, and the classifier is not perfect, then **at most one of {demographic parity, equalized odds, predictive parity}** can hold. -- This is **not a quirk of any model**. It is a mathematical fact about the joint distribution of `\(Y\)`, `\(\hat Y\)`, `\(A\)`. -- If you enforce two, the third **must** break. -- **This is why fairness is a choice, not a calculation.** --- # The COMPAS Argument, Revisited <a class="metrics-ref" href="#metrics-ref">CM</a> Module 3 showed how COMPAS triggered a famous fairness debate: -- - **ProPublica** said: COMPAS is unfair — it has a **higher FPR for Black defendants** (innocent people flagged at twice the rate). Equalized odds is broken. -- - **Northpointe** said: COMPAS is fair — a score of 7 means the same recidivism risk for Black and white defendants. **Predictive parity holds**. -- - **Both were right.** Base rates differ; the impossibility theorem says you cannot satisfy both at once. Choosing which to break is a value judgment, not a math error. --- # The Same Argument, with Driver Acceptance <a class="metrics-ref" href="#metrics-ref">CM</a> 1,000 ride requests per group (positive = "actually accepted"). **Different base rates** — that's what makes the theorem bite. <div class="two-col"> <div class="col-narrow"> <p><b>Non-minority</b> (60% accept)</p> <table> <tr><th></th><th>Pred = 1</th><th>Pred = 0</th></tr> <tr><th>Actual = 1</th><td>540</td><td>60</td></tr> <tr><th>Actual = 0</th><td>95</td><td>305</td></tr> </table> </div> <div class="col-narrow"> <p><b>Minority</b> (40% accept)</p> <table> <tr><th></th><th>Pred = 1</th><th>Pred = 0</th></tr> <tr><th>Actual = 1</th><td>240</td><td>160</td></tr> <tr><th>Actual = 0</th><td>40</td><td>560</td></tr> </table> </div> </div> **Precision** = TP/(TP+FP): non-min `\(\tfrac{540}{635} \approx 0.85\)`, min `\(\tfrac{240}{280} \approx 0.86\)` → **predictive parity ✓** (Northpointe) **TPR** = TP/(TP+FN): non-min `\(\tfrac{540}{600} = 0.90\)`, min `\(\tfrac{240}{400} = 0.60\)` → **equalized odds ✗** (ProPublica) **TPR gap of 0.30 on the same model — both sides right.** --- # When Predictive Parity Wins: A Cancer Screen <a class="metrics-ref" href="#metrics-ref">CM</a> Cancer screening test, two populations: A (5% prevalence), B (10%). Same model. -- **Equalized odds.** Equal TPR and FPR across groups. Then a "positive" test result means **different actual probabilities** in each group — perhaps `\(P(\text{cancer} \mid +) = 0.30\)` in A, `\(0.50\)` in B. -- The doctor now has to **mentally re-weight every score by the patient's group** — exactly the explicit demographic-aware decision fairness was supposed to prevent. -- **Predictive parity.** Force `\(P(\text{cancer} \mid +)\)` equal in both groups. The number on the page means what it says, regardless of who the patient is. Cost: more healthy patients in B get false alarms (more sick patients in B → any threshold catches more of both). --- # When Predictive Parity Wins: The Rule of Thumb **Which criterion to prioritize depends on how the score is consumed.** -- - **Fixed cutoff with concentrated harm on false positives → equalized odds** - COMPAS, hiring screens, fraud flags, no-fly lists - "Flagged" is a near-binary action with a sharp downside -- - **Score read as a calibrated probability → predictive parity** - Medical risk, credit risk, weather forecasts, insurance pricing - The number itself drives the decision; if it doesn't mean the same thing across groups, the decision-maker must use the protected attribute explicitly -- **The COMPAS reframe.** ProPublica's intuition fit COMPAS because judges *used* the score as a near-binary "high vs low" cutoff — the FPR gap translated directly into more Black defendants being detained. -- Had COMPAS been used as a calibrated probability fed into a more nuanced decision rule, **Northpointe's defense would have been the stronger one**. ProPublica was right *for COMPAS specifically*, not as a general principle. --- # Empirical Demonstration: Same Data, Sweep the Threshold <a class="metrics-ref" href="#metrics-ref">CM</a> <img src="slides_files/figure-html/impossibility-plot-1.png" style="display: block; margin: auto;" /> --- # How Do You Enforce a Criterion? Three injection points: -- **Pre-processing** — massage the training data (reweight, resample, transform features) before training. - Pros: model-agnostic. Cons: throws away signal; the model can relearn proxies. -- **In-processing** — modify the training objective (fairness penalty, adversarial debiasing). - Pros: principled. Cons: needs custom training code. -- **Post-processing** — adjust the **decision rule** per group on the trained model's scores. - Pros: simplest, model-agnostic. Cons: explicitly uses the protected attribute at decision time, often illegal. --- # Post-Processing: Group-Specific Thresholds <a class="metrics-ref" href="#metrics-ref">CM</a> The simplest fix: pick a different threshold for each group. -- |Metric | Before| After| |:----------------------|------:|------:| |Demographic parity gap | -0.421| 0.000| |TPR gap (eq. opp.) | -0.324| 0.030| |Precision gap | -0.028| -0.162| -- We **closed the demographic parity gap** by using thresholds 0.66 (majority) and 0.35 (minority). But the **TPR gap got worse** — exactly what the impossibility theorem predicts. --- # Accuracy vs Fairness Frontier <img src="slides_files/figure-html/frontier-plot-1.png" style="display: block; margin: auto;" /> Every fairness intervention sits **somewhere on this curve**. Math draws the curve; humans pick the point. --- # The Frontier in Dollars <img src="slides_files/figure-html/frontier-dollars-1.png" style="display: block; margin: auto;" /> .small[ **Assumptions** (deliberately rough): Uber processes ~25M daily ride requests; the platform earns ~$5 of revenue per accepted ride; each percentage point of lost accuracy maps 1:1 to lost matches. So 1 pp accuracy ≈ 25M × $5 × 365 ≈ **$45.6B/year** at the limit. Closing the entire DP gap costs roughly **$3.5B/year** in this back-of-envelope calculation. Real platforms negotiate this number against legal exposure, brand risk, and the political cost of sustained disparities. ] --- class: inverse # The Key Questions <br> ### 1. Which fairness criterion are you protecting? Why that one? -- <br> ### 2. What does enforcing it cost — in accuracy, *and* in the other criteria? -- <br> ### 3. Who gets to make that choice? (Spoiler: it's not the data scientist alone.) --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation & Selection</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Tree-Based Methods</td><td>upcoming</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td>7</td><td>Fairness Frameworks & Metrics <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>8</b></td><td><b>Auditing & Interpretability</b></td><td>next</td></tr> </table> Say **"start module 12"** when ready. --- name: metrics-ref # Reference: Metrics from the Confusion Matrix <table class="cm-corner"> <tr><th></th><th>Pred = 1</th><th>Pred = 0</th></tr> <tr><th>Actual = 1</th><td class="cm-tp">TP</td><td class="cm-fn">FN</td></tr> <tr><th>Actual = 0</th><td class="cm-fp">FP</td><td class="cm-tn">TN</td></tr> </table> - **Accuracy** = `\(\dfrac{\color{#1b5e20}{TP} + \color{#0d47a1}{TN}}{\color{#1b5e20}{TP} + \color{#e65100}{FN} + \color{#b71c1c}{FP} + \color{#0d47a1}{TN}}\)` — overall fraction correct - **Precision** = `\(\dfrac{\color{#1b5e20}{TP}}{\color{#1b5e20}{TP} + \color{#b71c1c}{FP}}\)` — of what I predicted positive, how many were real? - **True positive rate (recall)** = `\(\dfrac{\color{#1b5e20}{TP}}{\color{#1b5e20}{TP} + \color{#e65100}{FN}}\)` — of the real positives, how many did I predict positive? - **False positive rate** = `\(\dfrac{\color{#b71c1c}{FP}}{\color{#b71c1c}{FP} + \color{#0d47a1}{TN}}\)` — of the real negatives, how many did I predict positive? - **F1** = `\(2 \cdot \dfrac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\)` — harmonic mean (penalizes lopsided models) **Worked example.** 1000 ride requests, positive = "accepted": TP = 480, FN = 220, FP = 130, TN = 170. `$$\text{accuracy} = \tfrac{480 + 170}{1000} = 0.65 \quad \text{precision} = \tfrac{480}{480 + 130} \approx 0.79 \quad \text{recall} = \tfrac{480}{480 + 220} \approx 0.69 \quad \text{FPR} = \tfrac{130}{300} \approx 0.43$$`