class: center, middle, inverse, title-slide .title[ # Module 4: Subqueries and CTEs ] .subtitle[ ## Organizing multi-step queries ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .write-ref { position: absolute; bottom: 12px; right: 90px; background: #f0f8ff; border: 1px solid #aac; border-radius: 4px; padding: 1px 6px; font-size: 11px; text-decoration: none; color: #336; font-weight: bold; z-index: 100; } .write-ref:hover { background: #ddeeff; text-decoration: none; } .sample-code h1 { font-size: 1.2em; margin-bottom: 0.2em; } .sample-code .remark-code { font-size: 110%; line-height: 1.35; } .small { font-size: 80%; } </style> # Course Map <a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a> <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">GROUP BY, HAVING, Conversion Funnels</a></td><td>✓ done</td><td><a href="../module-03/slides.html#sample">M3</a></td></tr> <tr><td><b>4</b></td><td><b>Subqueries and CTEs</b> <i>(you are here)</i></td><td>← current</td><td><a href="#sample">M4</a></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table> --- # Why You Need Them Once a query gets past ~5 lines, you need a way to organize intermediate results. SQL gives you two: -- - **Subqueries** — a query inside another query - **Common Table Expressions (CTEs)** — named subqueries, top-to-bottom -- The database engine converts both into the same internal strategy (called an "execution plan") — so **performance is the same**. Choose between them for readability, not speed. **Default to CTEs.** --- # Three Subquery Flavors **Scalar:** returns one value, used like a constant. ```sql SELECT * FROM rides WHERE fare_usd > (SELECT AVG(fare_usd) FROM rides); ``` The inner `(SELECT AVG(...) FROM rides)` runs first, returns one number (e.g. 12.03), then the outer query uses it: `WHERE fare_usd > 12.03`. In R: `filter(fare_usd > mean(fare_usd))` — SQL can't use `mean()` inline, so you wrap it in a subquery. --- # Subquery Flavor 2: Table Subquery The inner query produces a temporary table; the outer query uses it as its data source. ```sql SELECT * FROM ( SELECT driver_id, COUNT(*) AS n FROM rides GROUP BY driver_id ) t WHERE t.n > 100; ``` -- Reading inside-out: 1. **Inner** (in parentheses): `SELECT driver_id, COUNT(*) ... GROUP BY driver_id` → produces a table of driver_id + ride count 2. **`t`**: names that temporary table so the outer query can reference it 3. **Outer**: `SELECT * FROM t WHERE t.n > 100` → filters for 100+ rides -- In R: `rides |> count(driver_id) |> filter(n > 100)`. SQL can't chain like that, so you nest instead. --- # Subquery Flavor 3: Correlated Subquery The inner query **references the outer query** — it runs once per outer row. Convenient but slow. ```sql SELECT d.driver_id, d.gender, (SELECT COUNT(*) FROM rides r WHERE r.driver_id = d.driver_id) AS n_rides FROM drivers d; ``` -- What happens under the hood — the database loops through drivers one by one: ``` driver 1 → SELECT COUNT(*) FROM rides WHERE driver_id = 1 → 47 driver 2 → SELECT COUNT(*) FROM rides WHERE driver_id = 2 → 112 driver 3 → SELECT COUNT(*) FROM rides WHERE driver_id = 3 → 0 ... (800 separate queries, one per driver) ``` -- The **faster way** — JOIN + GROUP BY does it in one pass: ```sql SELECT d.driver_id, d.gender, COUNT(r.ride_id) AS n_rides FROM drivers d LEFT JOIN rides r ON d.driver_id = r.driver_id GROUP BY d.driver_id, d.gender; ``` Same result, but the database scans `rides` once instead of 800 times. --- # CTEs ```sql WITH driver_rides AS ( SELECT driver_id, COUNT(*) AS n_rides FROM rides GROUP BY driver_id ) SELECT * FROM driver_rides WHERE n_rides > 100; ``` -- Multiple CTEs: ```sql WITH cte1 AS (...), cte2 AS (...), cte3 AS (...) SELECT * FROM cte3; ``` -- Each CTE = one logical step. Read top-to-bottom like a script. --- # Q: For each rider, find the date of their first ride and the city it was in. <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: find each rider's earliest ride time with `MIN()`, then re-join to get the city.* -- **Step 1.** CTE: find the earliest `started_at` per rider: ```sql WITH first_ride AS ( SELECT req.rider_id, MIN(r.started_at) AS first_started_at FROM rides r JOIN requests req ON r.request_id = req.request_id GROUP BY req.rider_id ) ``` -- **Step 2.** Re-join back through `rides → requests` to get the city — matching on **both** `rider_id` and `started_at` so you don't accidentally grab another rider's ride at the same timestamp: ```sql SELECT fr.rider_id, fr.first_started_at, c.name AS city FROM first_ride fr JOIN requests req ON req.rider_id = fr.rider_id JOIN rides r ON r.request_id = req.request_id AND r.started_at = fr.first_started_at JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id JOIN cities c ON n.city_id = c.city_id; ``` --- count: false # Q: First ride per rider — full query + result <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> .small[ ```sql WITH first_ride AS ( SELECT req.rider_id, MIN(r.started_at) AS first_started_at FROM rides r JOIN requests req ON r.request_id = req.request_id GROUP BY req.rider_id ) SELECT fr.rider_id, fr.first_started_at, c.name AS city FROM first_ride fr JOIN requests req ON req.rider_id = fr.rider_id JOIN rides r ON r.request_id = req.request_id AND r.started_at = fr.first_started_at JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id JOIN cities c ON n.city_id = c.city_id LIMIT 3; ``` Table: 3 records |rider_id |first_started_at |city | |:--------|:--------------------------|:-------| |1 |2025-01-02 00:02:41.797299 |Seattle | |2 |2025-02-24 00:06:54.242272 |Chicago | |3 |2025-01-01 00:06:35.287335 |Seattle | ] The CTE finds the extremum; the outer query re-joins to attach the rest of the row. In Module 5 you'll see `ROW_NUMBER` — a one-liner that replaces this whole pattern. --- # Q: Build a December funnel (request → ride → 5-star) using a chain of CTEs. <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> A **funnel** counts how many users survive each step: here, how many December requests became completed rides, and how many of those got a 5-star rating. -- .small[ ```sql WITH dec_requests AS ( SELECT request_id, rider_id FROM requests WHERE date(requested_at) BETWEEN '2025-12-01' AND '2025-12-31' ), dec_rides AS ( SELECT dr.*, r.ride_id, r.rider_rating -- dr.* = all columns from dr FROM dec_requests dr LEFT JOIN rides r ON dr.request_id = r.request_id ), funnel AS ( SELECT COUNT(*) AS n_requests, SUM(CASE WHEN ride_id IS NOT NULL THEN 1 ELSE 0 END) AS n_completed, SUM(CASE WHEN rider_rating = 5 THEN 1 ELSE 0 END) AS n_5star FROM dec_rides ) SELECT * FROM funnel; ``` ] --- count: false # Q: December funnel — result <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> ```sql WITH dec_requests AS ( SELECT request_id, rider_id FROM requests WHERE date(requested_at) BETWEEN '2025-12-01' AND '2025-12-31' ), dec_rides AS ( SELECT dr.*, r.ride_id, r.rider_rating FROM dec_requests dr LEFT JOIN rides r ON dr.request_id = r.request_id ), funnel AS ( SELECT COUNT(*) AS n_requests, SUM(CASE WHEN ride_id IS NOT NULL THEN 1 ELSE 0 END) AS n_completed, SUM(CASE WHEN rider_rating = 5 THEN 1 ELSE 0 END) AS n_5star FROM dec_rides ) SELECT * FROM funnel; ``` Table: 1 records | n_requests| n_completed| n_5star| |----------:|-----------:|-------:| | 4259| 3293| 1986| Why `ride_id IS NOT NULL`? After a LEFT JOIN, unmatched rows get NULLs in the right-side columns. So `ride_id IS NULL` = the request was never completed. Counting non-NULLs = completed rides. --- # Q: Find riders active in each of the last 4 weeks of December. <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: collapse to rider × week, then `HAVING COUNT(DISTINCT week) >= 4`.* -- .small[ ```sql WITH active AS ( SELECT DISTINCT rider_id, strftime('%Y-%W', requested_at) AS week FROM requests WHERE date(requested_at) BETWEEN '2025-12-01' AND '2025-12-31' ) SELECT rider_id, COUNT(DISTINCT week) AS weeks_active FROM active GROUP BY rider_id HAVING COUNT(DISTINCT week) >= 4 LIMIT 3; ``` Table: 3 records | rider_id| weeks_active| |--------:|------------:| | 11| 4| | 16| 5| | 58| 4| ] --- # CTE vs Subquery | Situation | Use | |---|---| | Reference the same intermediate twice | **CTE** | | ≥ 3 logical steps | **CTE** | | Need a single value in WHERE | **Scalar subquery** | | One-line intermediate | **Subquery** | | Need recursion | **Recursive CTE** | -- In modern SQL, **default to CTEs**. --- # Common Mistakes (1/3): Forgetting to JOIN back The CTE has `rider_id` and `first_started_at`, but you also wanted the city — you need a second join. .small[ **Wrong** — city isn't in the CTE: ```sql WITH first_ride AS ( SELECT rider_id, MIN(started_at) AS first_started_at FROM rides r JOIN requests req ON r.request_id = req.request_id GROUP BY rider_id ) SELECT rider_id, first_started_at, city FROM first_ride; -- ERROR: no column "city" ``` ] -- .small[ **Right** — re-join to get the city: ```sql SELECT fr.rider_id, fr.first_started_at, c.name AS city FROM first_ride fr JOIN requests req ON req.rider_id = fr.rider_id JOIN rides r ON r.request_id = req.request_id AND r.started_at = fr.first_started_at JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id JOIN cities c ON n.city_id = c.city_id; ``` ] --- # Common Mistakes (2/3): Naming collisions `WITH rides AS (...)` shadows the real `rides` table — the outer query can no longer see it. .small[ **Wrong** — CTE name collides with the table: ```sql WITH rides AS ( SELECT driver_id, COUNT(*) AS n FROM rides GROUP BY driver_id ) SELECT * FROM rides r JOIN rides orig ON ...; -- "rides" is the CTE, not the table! ``` ] -- .small[ **Right** — use a distinct name: ```sql WITH driver_rides AS ( SELECT driver_id, COUNT(*) AS n FROM rides GROUP BY driver_id ) SELECT * FROM driver_rides dr JOIN rides r ON ...; -- no ambiguity ``` ] --- class: inverse # The Key Takeaways <br> ### 1. CTEs let you write 30-line queries as a chain of named steps. Default to them. -- <br> ### 2. The first/last pattern: a CTE finds the extremum, then a join attaches the rest of the row. -- <br> ### 3. Module 5 (window functions) replaces many of these patterns with one-liners — but the CTE approach works in any dialect. --- class: sample-code name: sample # Sample Code: Top 3 Rides for High-Earning Drivers ```sql -- Scalar subquery + table subquery + CTE chain + first-per-group WITH high_earners AS ( -- CTE 1 SELECT driver_id, SUM(fare_usd) AS total FROM rides GROUP BY driver_id HAVING SUM(fare_usd) > (SELECT AVG(fare_usd) -- scalar subquery * 100 FROM rides) ), ranked AS ( -- CTE 2 SELECT h.driver_id, r.ride_id, r.fare_usd, r.started_at, ROW_NUMBER() OVER (PARTITION BY h.driver_id ORDER BY r.fare_usd DESC) AS rk FROM high_earners h JOIN rides r ON h.driver_id = r.driver_id ) SELECT driver_id, ride_id, fare_usd, started_at FROM ranked WHERE rk <= 3; -- top 3 per driver ``` --- # Course Map <a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a> <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">GROUP BY, HAVING, Conversion Funnels</a></td><td>✓ done</td><td><a href="../module-03/slides.html#sample">M3</a></td></tr> <tr><td>4</td><td>Subqueries and CTEs <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>5</b></td><td><b>Window Functions</b></td><td>next</td><td></td></tr> </table>