class: center, middle, inverse, title-slide .title[ # Module 2: JOINs ] .subtitle[ ## The single most important SQL skill after SELECT ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .write-ref { position: absolute; bottom: 12px; right: 90px; background: #f0f8ff; border: 1px solid #aac; border-radius: 4px; padding: 1px 6px; font-size: 11px; text-decoration: none; color: #336; font-weight: bold; z-index: 100; } .write-ref:hover { background: #ddeeff; text-decoration: none; } .sample-code h1 { font-size: 1.2em; margin-bottom: 0.2em; } .sample-code .remark-code { font-size: 110%; line-height: 1.35; } .small { font-size: 80%; } .two-col { display: flex; align-items: flex-start; gap: 2em; } .col-narrow { flex: 1; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td><b>2</b></td><td><b>JOINs</b> <i>(you are here)</i></td><td>← current</td><td><a href="#sample">M2</a></td></tr> <tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>upcoming</td><td></td></tr> <tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table> --- # The Four Flavors | JOIN type | Returns | |---|---| | `INNER JOIN` (or `JOIN`) | Only rows matched in both tables | | `LEFT JOIN` | All left rows; right rows where matched (NULL otherwise) | | `RIGHT JOIN` | All right rows; left where matched (rarely used — flip and use LEFT) | | `FULL OUTER JOIN` | All rows from both; NULLs where no match | -- **The fifth pattern:** anti-join — "rows in A that have NO match in B". There's no `ANTI JOIN` keyword. You do it with `LEFT JOIN ... WHERE B.id IS NULL` or `NOT EXISTS`. --- # Syntax ```sql SELECT r.ride_id, d.gender FROM rides AS r JOIN drivers AS d ON r.driver_id = d.driver_id WHERE ...; ``` The `ON` clause specifies which column links the two tables — the same column you'd put in `by =` in dplyr's `left_join(rides, drivers, by = "driver_id")`. -- Three things: 1. **Always use table aliases** (`r`, `d`, etc.). Shorter, and disambiguates columns that exist in both tables (like `driver_id`). 2. **`ON` is the join condition** — anything that evaluates TRUE/FALSE works, including non-equality. 3. **`USING (col)`** is shorthand for `ON r.col = d.col` when the column has the same name in both tables. --- # Inner JOIN: Table 1 **rides** — we want the ride info. Notice the `driver_id` column: ```sql SELECT ride_id, driver_id, fare_usd FROM rides LIMIT 5; ``` Table: 5 records |ride_id | driver_id| fare_usd| |:-------|---------:|--------:| |1 | 429| 16.52| |2 | 83| 19.78| |3 | 535| 26.97| |4 | 101| 13.68| |5 | 85| 16.65| We want to attach each driver's `gender` to their rides. But `gender` lives in a different table... --- count: false # Inner JOIN: Table 2 **drivers** — has the `gender` column, linked by the same `driver_id`: ```sql SELECT driver_id, gender FROM drivers LIMIT 5; ``` Table: 5 records |driver_id |gender | |:---------|:------| |1 |M | |2 |M | |3 |M | |4 |M | |5 |M | The `driver_id` column appears in **both** tables — that's the link. The JOIN will match rows where they agree. --- count: false # Inner JOIN: The Result ```sql SELECT r.ride_id, r.driver_id, d.gender, r.fare_usd FROM rides r JOIN drivers d ON r.driver_id = d.driver_id -- swap to "LEFT JOIN" and rides with no matching driver stay (NULL gender) LIMIT 5; ``` Table: 5 records |ride_id | driver_id|gender | fare_usd| |:-------|---------:|:------|--------:| |1 | 429|F | 16.52| |2 | 83|M | 19.78| |3 | 535|M | 26.97| |4 | 101|M | 13.68| |5 | 85|M | 16.65| -- Each ride now has the driver's `gender` attached. The `ON r.driver_id = d.driver_id` told SQL which rows to match — same as `inner_join(rides, drivers, by = "driver_id")` in dplyr. Only rows where `driver_id` exists in **both** tables are kept. --- # LEFT JOIN: Table 1 **requests** — all ride requests, including ones that were rejected: ```sql SELECT request_id, rider_id, accepted FROM requests LIMIT 5; ``` Table: 5 records |request_id | rider_id| accepted| |:----------|--------:|--------:| |1 | 3632| 1| |2 | 912| 1| |3 | 1119| 1| |4 | 2529| 1| |5 | 7| 1| Some requests have `accepted = 0` — no ride was completed. What happens to those when we join? --- count: false # LEFT JOIN: Table 2 **rides** — only exists for requests that were actually accepted: ```sql SELECT request_id, ride_id, fare_usd FROM rides LIMIT 5; ``` Table: 5 records |request_id | ride_id| fare_usd| |:----------|-------:|--------:| |1 | 1| 16.52| |2 | 2| 19.78| |3 | 3| 26.97| |4 | 4| 13.68| |5 | 5| 16.65| Rejected requests have **no matching row** in `rides`. A LEFT JOIN keeps them anyway (with NULLs for the missing ride columns). --- count: false # LEFT JOIN: The Result ```sql SELECT req.request_id, req.accepted, r.fare_usd FROM requests req LEFT JOIN rides r ON req.request_id = r.request_id -- swap to "JOIN" (no LEFT) and rejected requests disappear LIMIT 5; ``` Table: 5 records |request_id | accepted| fare_usd| |:----------|--------:|--------:| |1 | 1| 16.52| |2 | 1| 19.78| |3 | 1| 26.97| |4 | 1| 13.68| |5 | 1| 16.65| -- **All** requests are kept (LEFT side). If a request was not accepted, there's no matching ride → `fare_usd` is **NULL**. This is the key difference from INNER JOIN: unmatched left-side rows stay in the result (with NULLs for the right side) instead of being dropped. --- # Anti-Join: "Who has NO match?" **The question:** which drivers have never completed a ride? In dplyr: `anti_join(drivers, rides, by = "driver_id")` — one line, done. -- SQL has **no `ANTI JOIN` keyword**. You fake it in two steps: 1. **LEFT JOIN** — keep all drivers; NULLs where no ride exists 2. **Filter for NULLs** — keep only the unmatched ones ```sql SELECT d.driver_id, d.gender FROM drivers d LEFT JOIN rides r ON d.driver_id = r.driver_id WHERE r.ride_id IS NULL LIMIT 5; ``` Table: 0 records | driver_id|gender | |---------:|:------| -- The `WHERE r.ride_id IS NULL` keeps only drivers whose LEFT JOIN produced no match — i.e., drivers with zero rides. That's the anti-join. --- name: self-join # Self-Join: "Compare to another row of the same table" **Q:** For each ride, what was the driver's *previous* ride? **A:** Join `rides` against itself, with a time inequality, then keep the most-recent earlier match. <pre style="background:#f5f5f5;padding:0.6em 0.9em;border-radius:4px;font-family:'Source Code Pro',Menlo,monospace;font-size:85%;line-height:1.4;margin:0;"> <span style="color:#ff7f00;"><b>③ SELECT</b> r1.ride_id, r1.driver_id, r1.started_at, <b>MAX</b>(r2.started_at) <b>AS</b> previous_ride</span> <span style="color:#1f78b4;"><b>① FROM</b> rides r1 <b>LEFT JOIN</b> rides r2 <b>ON</b> r1.driver_id = r2.driver_id <b>AND</b> r2.started_at < r1.started_at</span> <span style="color:#33a02c;"><b>② GROUP BY</b> r1.ride_id</span> <span style="color:#e31a1c;"><b>④ LIMIT</b> 5;</span> </pre> .small[ Table: 5 records |ride_id | driver_id|started_at |previous_ride | |:-------|---------:|:--------------------------|:--------------------------| |1 | 429|2025-08-19 00:07:40.051507 |2025-07-11 00:05:42.188794 | |2 | 83|2025-08-19 00:09:02.160463 |2025-08-17 00:07:54.489205 | |3 | 535|2025-01-04 00:05:37.120311 |2025-01-01 00:05:31.354864 | |4 | 101|2025-08-05 00:03:57.718588 |2025-07-25 00:07:44.473374 | |5 | 85|2025-06-13 00:03:00.10871 |2025-06-08 00:09:19.094344 | ] --- count: false # Self-Join: reading the aliases `r1` and `r2` are **two copies of the same table**, joined together. Treat them as two separate tables that just happen to share rows. | Alias | Role | |---|---| | `r1` | the **anchor** ride — the row we're computing "previous" for | | `r2` | a **candidate** previous ride — any earlier ride by the same driver | The two `ON` conditions narrow `r2` to *eligible* matches: 1. `r1.driver_id = r2.driver_id` — same driver 2. `r2.started_at < r1.started_at` — earlier in time Then `MAX(r2.started_at)` after `GROUP BY r1.ride_id` picks the **most recent** of those earlier rides — the immediate predecessor. A driver's *first* ride has no eligible `r2`, so `previous_ride` is `NULL` (LEFT JOIN preserves it). **Window-function alternative** (Module 5): `LAG(started_at) OVER (PARTITION BY driver_id ORDER BY started_at)` does the same thing in one expression — cheaper and clearer. Self-join is the *pre*-window-function pattern; it still shows up in interviews because not every dialect/version has window functions. --- name: multi-chain # Pattern: Multi-table chain **Q:** I need a column that lives several joins away. How do I get it? **A:** Chain joins one hop at a time. Numerals = hop order. <pre style="background:#f5f5f5;padding:0.6em 0.9em;border-radius:4px;font-family:'Source Code Pro',Menlo,monospace;font-size:90%;line-height:1.4;margin:0;"> <span style="color:#ff7f00;"><b>③ SELECT</b> c.name <b>AS</b> city, <b>COUNT</b>(*) <b>AS</b> n_rides, <b>ROUND</b>(<b>AVG</b>(r.fare_usd), 2) <b>AS</b> avg_fare</span> <span style="color:#1f78b4;"><b>① FROM</b> rides r <b>JOIN</b> requests req <b>ON</b> r.request_id = req.request_id <b>JOIN</b> neighborhoods n <b>ON</b> req.pickup_nbhd_id = n.nbhd_id <b>JOIN</b> cities c <b>ON</b> n.city_id = c.city_id</span> <span style="color:#33a02c;"><b>② GROUP BY</b> c.name</span> <span style="color:#e31a1c;"><b>④ ORDER BY</b> n_rides <b>DESC</b>;</span> </pre> .small[ Table: 5 records |city | n_rides| avg_fare| |:-------------|-------:|--------:| |Chicago | 7946| 12.03| |Boston | 7906| 12.04| |Austin | 7780| 12.03| |Seattle | 7700| 11.94| |San Francisco | 7668| 12.02| ] --- count: false # Pattern: Multi-table chain — reading the aliases The single-letter aliases are shorthand for table names: | Alias | Table | What it contributes | |---|---|---| | `r` | `rides` | fare, distance, timing | | `req` | `requests` | links rides to pickup neighborhood | | `n` | `neighborhoods` | links neighborhood to city | | `c` | `cities` | the city name we want | So `c.name` = "the `name` column from the `cities` table" — like `cities$name` in R. <details><summary>Click here to see the same query in R / dplyr</summary> <pre> rides |> left_join(requests, by = "request_id") |> left_join(neighborhoods, by = c("pickup_nbhd_id" = "nbhd_id")) |> left_join(cities, by = "city_id") |> group_by(city = name) |> summarise(n_rides = n(), avg_fare = round(mean(fare_usd), 2)) |> arrange(desc(n_rides)) </pre> </details> --- # Common Mistakes **1. Cartesian explosion.** Forget the `ON` clause and you get every row in A paired with every row in B. With 39k rides × 800 drivers = 31M rows. **Always check the join before running.** -- **2. Inner join silently dropping rows.** If 5% of rides have a `driver_id` that doesn't exist in `drivers`, an inner join drops them silently. The honest pattern is LEFT JOIN + check for NULLs. -- **3. Aggregating after a join.** Joining can multiply rows. Always check: one-to-one, one-to-many, or many-to-many? -- **4. Filtering on the right side of a LEFT JOIN.** Putting `WHERE r.fare_usd > 20` after a `LEFT JOIN rides r` silently turns it into an INNER JOIN — because it filters out the NULL rows. Put the filter in `ON` instead. --- class: inverse, center, middle # Interview Questions ### <a href="../module-01/slides.html#schema">Schema reference</a> --- # Q1. For each rider, how many unique drivers have they ridden with? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: JOIN `rides` to `requests` (to get `rider_id`), then `COUNT(DISTINCT driver_id)`.* -- ```sql SELECT req.rider_id, COUNT(DISTINCT r.driver_id) AS n_unique_drivers, COUNT(*) AS n_rides FROM rides r JOIN requests req ON r.request_id = req.request_id GROUP BY req.rider_id ORDER BY n_unique_drivers DESC LIMIT 5; ``` Table: 5 records | rider_id| n_unique_drivers| n_rides| |--------:|----------------:|-------:| | 2340| 23| 23| | 1163| 21| 21| | 854| 21| 21| | 458| 21| 21| | 3085| 20| 20| --- # Q2. For each ride, how many days had the driver been on the platform? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: JOIN `rides` to `drivers`, then `julianday(started_at) - julianday(signup_date)`.* -- .small[ ```sql SELECT r.ride_id, r.driver_id, ROUND(julianday(r.started_at) - julianday(d.signup_date)) AS tenure_days FROM rides r JOIN drivers d ON r.driver_id = d.driver_id LIMIT 5; ``` Table: 5 records |ride_id | driver_id| tenure_days| |:-------|---------:|-----------:| |1 | 429| 1034| |2 | 83| 1220| |3 | 535| -138| |4 | 101| 972| |5 | 85| 698| ] --- # Q3. Which rider-driver pairs have had 3 or more 5-star rides together? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: JOIN `rides` to `requests`, filter for `rider_rating = 5`, GROUP BY the pair, HAVING >= 3.* -- ```sql SELECT req.rider_id, r.driver_id, COUNT(*) AS n_5star FROM rides r JOIN requests req ON r.request_id = req.request_id WHERE r.rider_rating = 5 GROUP BY req.rider_id, r.driver_id HAVING COUNT(*) >= 3 ORDER BY n_5star DESC LIMIT 5; ``` Table: 0 records | rider_id| driver_id|n_5star | |--------:|---------:|:-------| .small[ **Why `HAVING COUNT(*) >= 3` instead of `HAVING n_5star >= 3`?** Because `n_5star` is a column alias defined in SELECT, and HAVING runs *before* SELECT. You have to repeat the expression. (SQLite/MySQL are lenient about this; PostgreSQL is not — use the safe version.) ] --- class: inverse # The Key Takeaways <br> ### 1. INNER, LEFT, anti-join — you'll use these three constantly. RIGHT and FULL OUTER, almost never. -- <br> ### 2. The most common bug: filtering the right side of a LEFT JOIN in WHERE. Always put it in ON. -- <br> ### 3. Build multi-table joins one step at a time. Don't try to write 5-table joins from scratch. --- class: sample-code name: sample # Sample Code: Rides & Riders Per City/Gender ```sql -- INNER JOIN + LEFT JOIN + anti-join + multi-table chain SELECT c.name AS city, d.gender, COUNT(*) AS n_rides, COUNT(DISTINCT req.rider_id) AS n_riders FROM rides r JOIN drivers d ON r.driver_id = d.driver_id -- inner JOIN requests req ON r.request_id = req.request_id -- inner JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id -- chain JOIN cities c ON n.city_id = c.city_id -- chain WHERE d.driver_id NOT IN ( -- anti-join SELECT driver_id FROM rides WHERE fare_usd < 5) GROUP BY c.name, d.gender ORDER BY c.name, n_rides DESC; ``` --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td>JOINs <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>3</b></td><td><b>GROUP BY, HAVING, Conversion Funnels</b></td><td>next</td><td></td></tr> <tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table>