Module 2: JOINs

class: center, middle, inverse, title-slide

.title[
# Module 2: JOINs
]
.subtitle[
## The single most important SQL skill after SELECT
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr>
<tr><td>2</td><td>JOINs (you are here)</td><td>← current</td><td><a href="#sample">M2</a></td></tr>
<tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>upcoming</td><td></td></tr>
<tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr>
<tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr>
</table>

---

# The Four Flavors

| JOIN type | Returns |
|---|---|
| `INNER JOIN` (or `JOIN`) | Only rows matched in both tables |
| `LEFT JOIN` | All left rows; right rows where matched (NULL otherwise) |
| `RIGHT JOIN` | All right rows; left where matched (rarely used — flip and use LEFT) |
| `FULL OUTER JOIN` | All rows from both; NULLs where no match |

**The fifth pattern:** anti-join — "rows in A that have NO match in B". There's no `ANTI JOIN` keyword. You do it with `LEFT JOIN ... WHERE B.id IS NULL` or `NOT EXISTS`.

---

# Syntax

```sql
SELECT  r.ride_id, d.gender
FROM    rides AS r
JOIN    drivers AS d
  ON    r.driver_id = d.driver_id
WHERE   ...;
```

The `ON` clause specifies which column links the two tables — the same column you'd put in `by =` in dplyr's `left_join(rides, drivers, by = "driver_id")`.

Three things:

1. **Always use table aliases** (`r`, `d`, etc.). Shorter, and disambiguates columns that exist in both tables (like `driver_id`).
2. **`ON` is the join condition** — anything that evaluates TRUE/FALSE works, including non-equality.
3. **`USING (col)`** is shorthand for `ON r.col = d.col` when the column has the same name in both tables.

---

# Inner JOIN: Table 1

**rides** — we want the ride info. Notice the `driver_id` column:

```sql
SELECT ride_id, driver_id, fare_usd FROM rides LIMIT 5;
```

Table: 5 records

|ride_id | driver_id| fare_usd|
|:-------|---------:|--------:|
|1       |       429|    16.52|
|2       |        83|    19.78|
|3       |       535|    26.97|
|4       |       101|    13.68|
|5       |        85|    16.65|

We want to attach each driver's `gender` to their rides. But `gender` lives in a different table...

---
count: false

# Inner JOIN: Table 2

**drivers** — has the `gender` column, linked by the same `driver_id`:

```sql
SELECT driver_id, gender FROM drivers LIMIT 5;
```

Table: 5 records

|driver_id |gender |
|:---------|:------|
|1         |M      |
|2         |M      |
|3         |M      |
|4         |M      |
|5         |M      |

The `driver_id` column appears in **both** tables — that's the link. The JOIN will match rows where they agree.

---
count: false

# Inner JOIN: The Result

```sql
SELECT r.ride_id, r.driver_id, d.gender, r.fare_usd
FROM   rides r
JOIN   drivers d ON r.driver_id = d.driver_id  -- swap to "LEFT JOIN" and rides with no matching driver stay (NULL gender)
LIMIT 5;
```

Table: 5 records

|ride_id | driver_id|gender | fare_usd|
|:-------|---------:|:------|--------:|
|1       |       429|F      |    16.52|
|2       |        83|M      |    19.78|
|3       |       535|M      |    26.97|
|4       |       101|M      |    13.68|
|5       |        85|M      |    16.65|

Each ride now has the driver's `gender` attached. The `ON r.driver_id = d.driver_id` told SQL which rows to match — same as `inner_join(rides, drivers, by = "driver_id")` in dplyr.

Only rows where `driver_id` exists in **both** tables are kept.

---

# LEFT JOIN: Table 1

**requests** — all ride requests, including ones that were rejected:

```sql
SELECT request_id, rider_id, accepted FROM requests LIMIT 5;
```

Table: 5 records

|request_id | rider_id| accepted|
|:----------|--------:|--------:|
|1          |     3632|        1|
|2          |      912|        1|
|3          |     1119|        1|
|4          |     2529|        1|
|5          |        7|        1|

Some requests have `accepted = 0` — no ride was completed. What happens to those when we join?

---
count: false

# LEFT JOIN: Table 2

**rides** — only exists for requests that were actually accepted:

```sql
SELECT request_id, ride_id, fare_usd FROM rides LIMIT 5;
```

Table: 5 records

|request_id | ride_id| fare_usd|
|:----------|-------:|--------:|
|1          |       1|    16.52|
|2          |       2|    19.78|
|3          |       3|    26.97|
|4          |       4|    13.68|
|5          |       5|    16.65|

Rejected requests have **no matching row** in `rides`. A LEFT JOIN keeps them anyway (with NULLs for the missing ride columns).

---
count: false

# LEFT JOIN: The Result

```sql
SELECT req.request_id, req.accepted, r.fare_usd
FROM   requests req
LEFT JOIN rides r ON req.request_id = r.request_id  -- swap to "JOIN" (no LEFT) and rejected requests disappear
LIMIT 5;
```

Table: 5 records

|request_id | accepted| fare_usd|
|:----------|--------:|--------:|
|1          |        1|    16.52|
|2          |        1|    19.78|
|3          |        1|    26.97|
|4          |        1|    13.68|
|5          |        1|    16.65|

**All** requests are kept (LEFT side). If a request was not accepted, there's no matching ride → `fare_usd` is **NULL**.

This is the key difference from INNER JOIN: unmatched left-side rows stay in the result (with NULLs for the right side) instead of being dropped.

---

# Anti-Join: "Who has NO match?"

**The question:** which drivers have never completed a ride?

In dplyr: `anti_join(drivers, rides, by = "driver_id")` — one line, done.

SQL has **no `ANTI JOIN` keyword**. You fake it in two steps:

1. **LEFT JOIN** — keep all drivers; NULLs where no ride exists
2. **Filter for NULLs** — keep only the unmatched ones

```sql
SELECT d.driver_id, d.gender
FROM   drivers d
LEFT JOIN rides r ON d.driver_id = r.driver_id
WHERE  r.ride_id IS NULL
LIMIT 5;
```

Table: 0 records

| driver_id|gender |
|---------:|:------|

The `WHERE r.ride_id IS NULL` keeps only drivers whose LEFT JOIN produced no match — i.e., drivers with zero rides. That's the anti-join.

---
name: self-join

# Self-Join: "Compare to another row of the same table"

**Q:** For each ride, what was the driver's *previous* ride?
**A:** Join `rides` against itself, with a time inequality, then keep the most-recent earlier match.

<pre style="background:#f5f5f5;padding:0.6em 0.9em;border-radius:4px;font-family:'Source Code Pro',Menlo,monospace;font-size:85%;line-height:1.4;margin:0;">
③ SELECT r1.ride_id, r1.driver_id, r1.started_at,
 MAX(r2.started_at) AS previous_ride
① FROM rides r1
 LEFT JOIN rides r2 ON r1.driver_id = r2.driver_id
 AND r2.started_at &lt; r1.started_at
② GROUP BY r1.ride_id
④ LIMIT 5;
</pre>

.small[

Table: 5 records

|ride_id | driver_id|started_at                 |previous_ride              |
|:-------|---------:|:--------------------------|:--------------------------|
|1       |       429|2025-08-19 00:07:40.051507 |2025-07-11 00:05:42.188794 |
|2       |        83|2025-08-19 00:09:02.160463 |2025-08-17 00:07:54.489205 |
|3       |       535|2025-01-04 00:05:37.120311 |2025-01-01 00:05:31.354864 |
|4       |       101|2025-08-05 00:03:57.718588 |2025-07-25 00:07:44.473374 |
|5       |        85|2025-06-13 00:03:00.10871  |2025-06-08 00:09:19.094344 |
]

---
count: false

# Self-Join: reading the aliases

`r1` and `r2` are **two copies of the same table**, joined together. Treat them as two separate tables that just happen to share rows.

| Alias | Role |
|---|---|
| `r1` | the **anchor** ride — the row we're computing "previous" for |
| `r2` | a **candidate** previous ride — any earlier ride by the same driver |

The two `ON` conditions narrow `r2` to *eligible* matches:

1. `r1.driver_id = r2.driver_id` — same driver
2. `r2.started_at < r1.started_at` — earlier in time

Then `MAX(r2.started_at)` after `GROUP BY r1.ride_id` picks the **most recent** of those earlier rides — the immediate predecessor. A driver's *first* ride has no eligible `r2`, so `previous_ride` is `NULL` (LEFT JOIN preserves it).

**Window-function alternative** (Module 5): `LAG(started_at) OVER (PARTITION BY driver_id ORDER BY started_at)` does the same thing in one expression — cheaper and clearer. Self-join is the *pre*-window-function pattern; it still shows up in interviews because not every dialect/version has window functions.

---
name: multi-chain

# Pattern: Multi-table chain

**Q:** I need a column that lives several joins away. How do I get it?
**A:** Chain joins one hop at a time. Numerals = hop order.

<pre style="background:#f5f5f5;padding:0.6em 0.9em;border-radius:4px;font-family:'Source Code Pro',Menlo,monospace;font-size:90%;line-height:1.4;margin:0;">
③ SELECT c.name AS city,
 COUNT(*) AS n_rides,
 ROUND(AVG(r.fare_usd), 2) AS avg_fare
① FROM rides r
 JOIN requests req ON r.request_id = req.request_id
 JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id
 JOIN cities c ON n.city_id = c.city_id
② GROUP BY c.name
④ ORDER BY n_rides DESC;
</pre>

.small[

Table: 5 records

|city          | n_rides| avg_fare|
|:-------------|-------:|--------:|
|Chicago       |    7946|    12.03|
|Boston        |    7906|    12.04|
|Austin        |    7780|    12.03|
|Seattle       |    7700|    11.94|
|San Francisco |    7668|    12.02|
]

---
count: false

# Pattern: Multi-table chain — reading the aliases

The single-letter aliases are shorthand for table names:

| Alias | Table | What it contributes |
|---|---|---|
| `r` | `rides` | fare, distance, timing |
| `req` | `requests` | links rides to pickup neighborhood |
| `n` | `neighborhoods` | links neighborhood to city |
| `c` | `cities` | the city name we want |

So `c.name` = "the `name` column from the `cities` table" — like `cities$name` in R.

<details><summary>Click here to see the same query in R / dplyr</summary>
<pre>
rides |>
 left_join(requests, by = "request_id") |>
 left_join(neighborhoods, by = c("pickup_nbhd_id" = "nbhd_id")) |>
 left_join(cities, by = "city_id") |>
 group_by(city = name) |>
 summarise(n_rides = n(), avg_fare = round(mean(fare_usd), 2)) |>
 arrange(desc(n_rides))
</pre>
</details>

---

# Common Mistakes

**1. Cartesian explosion.** Forget the `ON` clause and you get every row in A paired with every row in B. With 39k rides × 800 drivers = 31M rows. **Always check the join before running.**

**2. Inner join silently dropping rows.** If 5% of rides have a `driver_id` that doesn't exist in `drivers`, an inner join drops them silently. The honest pattern is LEFT JOIN + check for NULLs.

**3. Aggregating after a join.** Joining can multiply rows. Always check: one-to-one, one-to-many, or many-to-many?

**4. Filtering on the right side of a LEFT JOIN.** Putting `WHERE r.fare_usd > 20` after a `LEFT JOIN rides r` silently turns it into an INNER JOIN — because it filters out the NULL rows. Put the filter in `ON` instead.

---
class: inverse, center, middle

# Interview Questions
### <a href="../module-01/slides.html#schema">Schema reference</a>

---

# Q1. For each rider, how many unique drivers have they ridden with?

<a class="write-ref" href="../module-03/slides.html#write-order">WO</a>
*Hint: JOIN `rides` to `requests` (to get `rider_id`), then `COUNT(DISTINCT driver_id)`.*

```sql
SELECT req.rider_id,
       COUNT(DISTINCT r.driver_id) AS n_unique_drivers,
       COUNT(*)                    AS n_rides
FROM   rides r
JOIN   requests req ON r.request_id = req.request_id
GROUP BY req.rider_id
ORDER BY n_unique_drivers DESC
LIMIT 5;
```

Table: 5 records

| rider_id| n_unique_drivers| n_rides|
|--------:|----------------:|-------:|
|     2340|               23|      23|
|     1163|               21|      21|
|      854|               21|      21|
|      458|               21|      21|
|     3085|               20|      20|

---

# Q2. For each ride, how many days had the driver been on the platform?

<a class="write-ref" href="../module-03/slides.html#write-order">WO</a>
*Hint: JOIN `rides` to `drivers`, then `julianday(started_at) - julianday(signup_date)`.*

.small[

```sql
SELECT r.ride_id, r.driver_id,
       ROUND(julianday(r.started_at) - julianday(d.signup_date)) AS tenure_days
FROM rides r JOIN drivers d ON r.driver_id = d.driver_id
LIMIT 5;
```

Table: 5 records

|ride_id | driver_id| tenure_days|
|:-------|---------:|-----------:|
|1       |       429|        1034|
|2       |        83|        1220|
|3       |       535|        -138|
|4       |       101|         972|
|5       |        85|         698|
]

---

# Q3. Which rider-driver pairs have had 3 or more 5-star rides together?

<a class="write-ref" href="../module-03/slides.html#write-order">WO</a>
*Hint: JOIN `rides` to `requests`, filter for `rider_rating = 5`, GROUP BY the pair, HAVING >= 3.*

```sql
SELECT req.rider_id, r.driver_id, COUNT(*) AS n_5star
FROM   rides r
JOIN   requests req ON r.request_id = req.request_id
WHERE  r.rider_rating = 5
GROUP BY req.rider_id, r.driver_id
HAVING COUNT(*) >= 3
ORDER BY n_5star DESC
LIMIT 5;
```

Table: 0 records

| rider_id| driver_id|n_5star |
|--------:|---------:|:-------|

.small[
**Why `HAVING COUNT(*) >= 3` instead of `HAVING n_5star >= 3`?** Because `n_5star` is a column alias defined in SELECT, and HAVING runs *before* SELECT. You have to repeat the expression. (SQLite/MySQL are lenient about this; PostgreSQL is not — use the safe version.)
]

---
class: inverse

# The Key Takeaways

### 1. INNER, LEFT, anti-join — you'll use these three constantly. RIGHT and FULL OUTER, almost never.

### 2. The most common bug: filtering the right side of a LEFT JOIN in WHERE. Always put it in ON.

### 3. Build multi-table joins one step at a time. Don't try to write 5-table joins from scratch.

---
class: sample-code

name: sample
# Sample Code: Rides & Riders Per City/Gender

```sql
-- INNER JOIN + LEFT JOIN + anti-join + multi-table chain
SELECT c.name AS city, d.gender,
 COUNT(*) AS n_rides,
 COUNT(DISTINCT req.rider_id) AS n_riders
FROM rides r
JOIN drivers d ON r.driver_id = d.driver_id -- inner
JOIN requests req ON r.request_id = req.request_id -- inner
JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id -- chain
JOIN cities c ON n.city_id = c.city_id -- chain
WHERE d.driver_id NOT IN ( -- anti-join
 SELECT driver_id FROM rides
 WHERE fare_usd < 5)
GROUP BY c.name, d.gender
ORDER BY c.name, n_rides DESC;
```

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr>
<tr><td>2</td><td>JOINs (just finished)</td><td>✓ done</td></tr>
<tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>next</td><td></td></tr>
<tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr>
<tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr>
</table>