Module 2: Audit & Correspondence Studies

class: center, middle, inverse, title-slide

.title[
# Module 2: Audit & Correspondence Studies
]
.subtitle[
## Ge et al. (2016) and the Workhorse of the Modern Literature
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">Theory Primer</a></td><td>✓ done</td></tr>
<tr><td><b>2</b></td><td><b>Audit & Correspondence Studies</b> <i>(you are here)</i></td><td>← current</td></tr>
<tr><td>3</td><td>Decomposition Methods</td><td>upcoming</td></tr>
<tr><td>4</td><td>Algorithmic Audits</td><td>upcoming</td></tr>
<tr><td>5</td><td>Modern Methods</td><td>upcoming</td></tr>
</table>

---

# Why Audit Studies Exist

Module 1 ended with a frustration: **regression with controls cannot distinguish**

> "no discrimination" from "discrimination perfectly mediated by a proxy"

If race influences neighborhood and neighborhood influences acceptance, then *controlling for neighborhood* removes the very channel through which race operates.

The fix: **manufacture variation in the protected attribute** that is uncorrelated with everything else.

That's exactly what audit studies do. They are the only design that gives a clean causal estimate without strong structural assumptions.

---

# The Matched-Pair Audit Design

1. Construct two profiles, `$A$` and `$B$`, **identical on everything observable** except the protected attribute (or its proxy)

2. Submit both to the same gatekeeper (employer, driver, landlord)

3. Record the decision (interview / acceptance / approval)

4. Compare rates

**Identification:** the only difference between profiles is the protected attribute, so any difference in outcomes is causally attributable to it. **No controls needed** — the design handles confounding by construction.

---

# The Canonical Example: Bertrand-Mullainathan (2004)

**Setting:** ~5,000 résumés sent in response to job postings in Boston and Chicago.

**Randomization:**
- **Name** (white-sounding: Emily, Greg vs African-American–sounding: Lakisha, Jamal)
- **Quality** (high vs low credentials)

**Result:** white-named résumés got **~50% more callbacks** than identical Black-named résumés.

**Bonus finding:** the "credential premium" was much larger for white-named résumés. Black candidates couldn't compensate via résumé quality.

---

# Ge, Knittel, MacKenzie, Zoepf (2016)

The canonical ride-sharing audit study. The **centerpiece of this module**.

**Setting:** Boston and Seattle, on UberX and Lyft.

**Design:** RAs created rider profiles with stereotypically white and stereotypically African-American–sounding names. They followed pre-specified routes and submitted ride requests in matched pairs.

**Sample:** ~1,500 ride requests across the two cities.

**Outcomes measured:** cancellation rate, wait time, trip time, cost.

---

# The Headline Results

| Outcome | Setting | White | African-American | Ratio |
|---|---|---|---|---|
| Cancellation rate | Boston, UberX, male | 4.9% | 10.1% | **~2×** |
| Wait time | Seattle, all | baseline | +30% | longer |
| Travel time | Boston, female | baseline | longer | (indirect routes) |

**The discrimination is large, statistically significant, and consistent across cities and profiles.**

The differences look small in absolute terms (5pp on cancellation), but for the affected riders this is a meaningfully degraded service experience.

---

# The Clever Bit: Seattle

In Seattle, Uber **doesn't show rider names or photos** to drivers before the driver accepts the request.

The discrimination still happens. **How?**

The likely answer: drivers are using **pickup neighborhood** as a proxy for the rider's race, and the destination address (which they see after accepting) for further inference.

This is **exactly the Phelps story** from Module 1, embedded in real data:

- Drivers don't see race directly
- They use a correlated feature (location) to infer it
- The result is racially disparate outcomes via a "race-blind" matching mechanism

This finding is what made the paper a foundational reference for the algorithmic-fairness literature that came after.

---

# What Audits Identify (and Don't)

| Question | Audit answers? |
|---|---|
| Is there a *causal* effect of perceived group membership? | **Yes** |
| Is the discrimination taste-based or statistical? | **No** |
| Does the discrimination persist in equilibrium? | **No** |
| Would policy X eliminate it? | **No** |
| Is the gap "fair" by some normative criterion? | **No** |
| What's the aggregate welfare loss? | **No** |

The right way to read an audit study: **clean documentation that the discrimination exists and is large**, *without* strong claims about mechanism or policy.

---

# Standard Threats to Validity

**1.** The manipulation must work (gatekeeper sees what we intended)

**2.** Profiles must be matched on everything else (subtle confounders like socioeconomic inferences from names)

**3.** The sample of gatekeepers must be representative (one city ≠ the world)

**4.** The decision being measured must be the one that matters (callback ≠ hire ≠ wage)

**5.** Ethical / legal: audit studies essentially defraud the gatekeeper. IRB approval is non-trivial; some jurisdictions push back.

---
class: inverse, center, middle

# Exercise
### A Stylized Ge et al. Replication

---

# Setup

```r
set.seed(2026)
n_rides_per_group <- 500
baseline_cancel_A <- 0.05
discrimination    <- 0.05  # B sees +5pp cancellation rate

audit <- bind_rows(
  tibble(profile = "A",
         cancelled = rbinom(n_rides_per_group, 1, baseline_cancel_A)),
  tibble(profile = "B",
         cancelled = rbinom(n_rides_per_group, 1,
                            baseline_cancel_A + discrimination))
)

audit |>
  group_by(profile) |>
  summarise(n = n(),
            cancellation = round(mean(cancelled), 3),
            .groups = "drop") |>
  knitr::kable()
```

|profile |   n| cancellation|
|:-------|---:|------------:|
|A       | 500|        0.040|
|B       | 500|        0.112|

---

# Estimating the Audit Effect

```r
counts <- audit |>
  group_by(profile) |>
  summarise(x = sum(cancelled), n = n(), .groups = "drop")

prop_test <- prop.test(x = counts$x, n = counts$n, correct = FALSE)
gap_estimate <- diff(rev(prop_test$estimate))
ci <- c(-prop_test$conf.int[2], -prop_test$conf.int[1])
```

|Statistic             |Value            |
|:---------------------|:----------------|
|Estimated gap (B - A) |-0.072           |
|95% CI                |[0.0395, 0.1045] |
|p-value               |1.74e-05         |

The point estimate matches the true effect (≈ 0.05), and the CI excludes zero.

---

# How Many Rides Do You Need? (Power Calculation)

```r
power_test <- power.prop.test(
  p1 = 0.05, p2 = 0.10,
  sig.level = 0.05, power = 0.8
)
ceiling(power_test$n)
```

```
## [1] 435
```

To detect a **5 pp gap** at 80% power, you need ~470 rides **per profile**, or ~940 total.

For a more realistic **2 pp gap**:

```
## 2213 per group → 4426 total
```

That's why most field audit studies are **expensive** — and why algorithmic / scraped audits (Module 4) became dominant.

---

# Power vs Sample Size (Monte Carlo)

---
class: inverse

# The Key Takeaways

<br>

### 1. Audits are the only design that gives clean causal identification of discrimination without strong structural assumptions.

<br>

### 2. Ge et al. (2016) is the canonical ride-sharing example. Cancellation rates 2× higher for African-American riders.

<br>

### 3. The Seattle finding showed that even "anonymized" matching produces racial disparities via correlated features — the empirical anchor for the algorithmic-fairness literature.

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">Theory Primer</a></td><td>✓ done</td></tr>
<tr><td>2</td><td>Audit & Correspondence Studies <i>(just finished)</i></td><td>✓ done</td></tr>
<tr><td><b>3</b></td><td><b>Decomposition Methods (Cook et al. 2021)</b></td><td>next</td></tr>
<tr><td>4</td><td>Algorithmic Audits</td><td>upcoming</td></tr>
<tr><td>5</td><td>Modern Methods</td><td>upcoming</td></tr>
</table>

Say **"start module 3"** when ready.