class: center, middle, inverse, title-slide .title[ # Module 3: GROUP BY, HAVING, Conversion Funnels ] .subtitle[ ## The patterns you’ll write over and over ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .write-ref { position: absolute; bottom: 12px; right: 90px; background: #f0f8ff; border: 1px solid #aac; border-radius: 4px; padding: 1px 6px; font-size: 11px; text-decoration: none; color: #336; font-weight: bold; z-index: 100; } .write-ref:hover { background: #ddeeff; text-decoration: none; } .sample-code h1 { font-size: 1.2em; margin-bottom: 0.2em; } .sample-code .remark-code { font-size: 110%; line-height: 1.35; } .small { font-size: 80%; } </style> # Course Map <a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a> <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr> <tr><td><b>3</b></td><td><b>GROUP BY, HAVING, Conversion Funnels</b> <i>(you are here)</i></td><td>← current</td><td><a href="#sample">M3</a></td></tr> <tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table> --- # Six Patterns to Memorize 1. **City-level dashboard** (one row per city, many metrics) 2. **Compare-to-global** (HAVING with a subquery) 3. **Conversion funnel** (CASE WHEN counts at each step) 4. **Per-group ranking** (correlated subquery; window function later) 5. **Cohort retention** (signup-month × activity-month) 6. **Bottom-N within group** If you can write all six from memory, you're past the bar for any data job interview. --- name: write-order # How to Write SQL (for R Users) In dplyr you write **top-to-bottom in the order things happen**: data → join → group → summarise → sort. SQL's written order is **scrambled** — you write the output (`SELECT`) first, even though it runs last. This is confusing. -- **Trick: write SQL from the middle out.** | Step | What to write | Where it goes | |---|---|---| | 1. | `FROM rides r` | Middle | | 2. | `JOIN requests req ON ...` | Middle | | 3. | `GROUP BY c.name` | Below the joins | | 4. | Now go to the top: `SELECT c.name, COUNT(*)` | Top | | 5. | `ORDER BY ... LIMIT ...` | Bottom | -- Start with the data source (FROM), add joins, add GROUP BY. **Only then** go back up and write SELECT — by that point you know what columns are available. Feels weird for a week, then it's automatic. --- # P1: Build a city-level dashboard with rides, drivers, avg fare, and revenue. <a class="write-ref" href="#write-order">WO</a> **Step 1.** Start from the `rides` table — that's where fares live. ```sql FROM rides r ``` -- **Step 2.** We need city names, but `rides` doesn't have them. Chain through 3 joins: ```sql FROM rides r JOIN requests req ON r.request_id = req.request_id -- get pickup neighborhood JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id -- get city_id JOIN cities c ON n.city_id = c.city_id -- get city name ``` -- **Step 3.** Now we have all the columns. Group by city, compute the aggregates: ```sql SELECT c.name AS city, COUNT(DISTINCT r.driver_id) AS drivers, -- unique drivers COUNT(*) AS n_rides, -- total rides ROUND(AVG(r.fare_usd), 2) AS avg_fare, -- average fare ROUND(SUM(r.fare_usd), 0) AS revenue -- total revenue ... GROUP BY c.name ORDER BY revenue DESC; ``` --- count: false # P1: City-level dashboard — full query + result <a class="write-ref" href="#write-order">WO</a> .small[ ```sql SELECT c.name AS city, COUNT(DISTINCT r.driver_id) AS drivers, COUNT(*) AS n_rides, ROUND(AVG(r.fare_usd), 2) AS avg_fare, ROUND(SUM(r.fare_usd), 0) AS revenue FROM rides r JOIN requests req ON r.request_id = req.request_id JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id JOIN cities c ON n.city_id = c.city_id GROUP BY c.name ORDER BY revenue DESC; ``` Table: 5 records |city | drivers| n_rides| avg_fare| revenue| |:-------------|-------:|-------:|--------:|-------:| |Chicago | 800| 7946| 12.03| 95614| |Boston | 800| 7906| 12.04| 95211| |Austin | 800| 7780| 12.03| 93611| |San Francisco | 800| 7668| 12.02| 92140| |Seattle | 800| 7700| 11.94| 91929| ] --- # P2: Which cities have an average fare above the global average? <a class="write-ref" href="#write-order">WO</a> Let's build this step by step. -- **Step 1.** What's the global average fare? A standalone query: ```sql SELECT ROUND(AVG(fare_usd), 2) FROM rides; -- returns a single number, e.g., 12.02 ``` -- **Step 2.** Compute the average fare **per city** (same JOIN chain as Pattern 1): ```sql SELECT c.name AS city, ROUND(AVG(r.fare_usd), 2) AS avg_fare FROM rides r JOIN requests req ON r.request_id = req.request_id JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id JOIN cities c ON n.city_id = c.city_id GROUP BY c.name; ``` --- count: false # P2: Which cities have an average fare above the global average? <a class="write-ref" href="#write-order">WO</a> **Step 3.** Keep only cities whose avg fare beats the global avg. Paste Step 1's query inside `HAVING`: ```sql HAVING AVG(r.fare_usd) > (SELECT AVG(fare_usd) FROM rides) ``` The `(SELECT AVG(fare_usd) FROM rides)` is a **scalar subquery** — a mini-query that returns one number, used here like a constant. Note it says `fare_usd`, not `r.fare_usd`. The subquery is **self-contained** — it has its own `FROM rides` and doesn't know about the outer query's `r` alias. It scans the whole table independently and returns one number. --- count: false # P2: Compare to global — full query + result <a class="write-ref" href="#write-order">WO</a> .pull-left[ .small[ ```sql SELECT c.name AS city, ROUND(AVG(r.fare_usd), 2) AS avg_fare, (SELECT ROUND(AVG(fare_usd), 2) FROM rides) AS global_avg FROM rides r JOIN requests req ON r.request_id = req.request_id JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id JOIN cities c ON n.city_id = c.city_id GROUP BY c.name HAVING AVG(r.fare_usd) > (SELECT AVG(fare_usd) FROM rides) ORDER BY avg_fare DESC; ``` ] ] .pull-right[ Table: 4 records |city | avg_fare| global_avg| |:-------------|--------:|----------:| |Boston | 12.04| 12.01| |Chicago | 12.03| 12.01| |Austin | 12.03| 12.01| |San Francisco | 12.02| 12.01| Only cities with `avg_fare > global_avg` survive the `HAVING` filter. ] --- # P3: Build a funnel — how many requests become rides, and how many get 5 stars? <a class="write-ref" href="#write-order">WO</a> **Step 1.** Start with `requests` and LEFT JOIN `rides` — LEFT because not every request becomes a ride: ```sql FROM requests req LEFT JOIN rides r ON req.request_id = r.request_id ``` -- **Step 2.** Count everything (= total requests): ```sql COUNT(*) AS step1_requests ``` -- **Step 3.** Count only the rows where a ride actually happened. The trick: `CASE WHEN condition THEN 1 ELSE 0 END` inside `SUM()` — this counts rows that meet the condition (like `sum(x == TRUE)` in R): ```sql SUM(CASE WHEN req.accepted = 1 THEN 1 ELSE 0 END) AS step2_accepted SUM(CASE WHEN r.ride_id IS NOT NULL THEN 1 ELSE 0 END) AS step3_completed SUM(CASE WHEN r.rider_rating = 5 THEN 1 ELSE 0 END) AS step4_5star ``` --- count: false # P3: Funnel — full query + result <a class="write-ref" href="#write-order">WO</a> .small[ ```sql SELECT COUNT(*) AS step1_requests, SUM(CASE WHEN req.accepted = 1 THEN 1 ELSE 0 END) AS step2_accepted, SUM(CASE WHEN r.ride_id IS NOT NULL THEN 1 ELSE 0 END) AS step3_completed, SUM(CASE WHEN r.rider_rating = 5 THEN 1 ELSE 0 END) AS step4_5star, ROUND(1.0 * SUM(CASE WHEN r.rider_rating = 5 THEN 1 ELSE 0 END) / COUNT(*), 3) AS overall_conv FROM requests req LEFT JOIN rides r ON req.request_id = r.request_id; ``` Table: 1 records | step1_requests| step2_accepted| step3_completed| step4_5star| overall_conv| |--------------:|--------------:|---------------:|-----------:|------------:| | 50000| 39000| 39000| 23536| 0.471| ] Read each column as: "of all the rows, how many passed this filter?" The numbers should decrease at each step — that's the funnel. **Why `1.0 *`?** Without it, SQLite does integer division: `3 / 10 = 0` (truncated). Multiplying by `1.0` forces float math: `1.0 * 3 / 10 = 0.3`. In R this never comes up because R uses doubles by default. **Why not `step4_5star / step1_requests`?** You can't reference column aliases within the same SELECT — they don't exist yet (SELECT runs as one step). You have to repeat the full expressions, or use a CTE to compute the columns first and divide in the outer query. --- # P4: For each signup-month cohort, how many riders were active in each subsequent month? <a class="write-ref" href="#write-order">WO</a> A **CTE** lets you name a temporary result. Read `WITH cohort AS (...)` as **"let `cohort` = this query"** — like `cohort <- riders |> ...` in R. (Covered fully in Module 4.) **Step 1.** CTE for each rider's cohort (signup month): ```sql WITH cohort AS ( SELECT rider_id, strftime('%Y-%m', signup_date) AS cohort_month FROM riders ) ``` -- **Step 2.** CTE for each rider's activity months (comma-separated from Step 1): ```sql activity AS ( SELECT DISTINCT rider_id, strftime('%Y-%m', requested_at) AS activity_month FROM requests ) ``` --- count: false # P4: Cohort retention — Step 3 <a class="write-ref" href="#write-order">WO</a> **Step 3.** Join the two CTEs and count distinct riders per (cohort, month): ```sql SELECT c.cohort_month, a.activity_month, COUNT(DISTINCT c.rider_id) AS active_riders FROM cohort c JOIN activity a ON c.rider_id = a.rider_id GROUP BY c.cohort_month, a.activity_month ``` --- count: false # P4: Cohort retention — full query + result <a class="write-ref" href="#write-order">WO</a> .small[ ```sql WITH cohort AS ( SELECT rider_id, strftime('%Y-%m', signup_date) AS cohort_month FROM riders ), activity AS ( SELECT DISTINCT rider_id, strftime('%Y-%m', requested_at) AS activity_month FROM requests ) SELECT c.cohort_month, a.activity_month, COUNT(DISTINCT c.rider_id) AS active_riders FROM cohort c JOIN activity a ON c.rider_id = a.rider_id GROUP BY c.cohort_month, a.activity_month ORDER BY c.cohort_month, a.activity_month LIMIT 6; ``` Table: 6 records |cohort_month |activity_month | active_riders| |:------------|:--------------|-------------:| |2022-01 |2025-01 | 63| |2022-01 |2025-02 | 59| |2022-01 |2025-03 | 57| |2022-01 |2025-04 | 55| |2022-01 |2025-05 | 65| |2022-01 |2025-06 | 56| ] Each row = one (cohort, month) pair. To see retention *rates*, divide `active_riders` by the cohort's total size. --- # Common Mistakes **1. Selecting non-aggregated columns without grouping.** ```sql -- WRONG: which driver_id does it pick? Arbitrary! SELECT city, driver_id, COUNT(*) AS n_rides FROM rides GROUP BY city; -- RIGHT: every non-aggregate column must be in GROUP BY SELECT city, driver_id, COUNT(*) AS n_rides FROM rides GROUP BY city, driver_id; ``` Strict SQL (Postgres) errors out. SQLite silently picks an arbitrary `driver_id` per city — which is worse because you get a result that *looks* fine but is wrong. -- **2. `COUNT` vs `COUNT(DISTINCT)`.** `COUNT(*)` counts rows. `COUNT(DISTINCT col)` counts unique values. Always be deliberate. -- **3. Counting "users" after a JOIN.** Joins multiply rows. `COUNT(rider_id)` after a join counts *rides*, not riders. Use `COUNT(DISTINCT rider_id)`. -- **4. Filtering aggregates in `WHERE`.** Doesn't work. Use `HAVING`. ```sql -- WRONG: aggregates don't exist in WHERE (runs before GROUP BY) SELECT city, COUNT(*) AS n FROM rides WHERE COUNT(*) > 1000 GROUP BY city; -- RIGHT: HAVING runs after GROUP BY, so aggregates exist SELECT city, COUNT(*) AS n FROM rides GROUP BY city HAVING COUNT(*) > 1000; ``` --- class: inverse # The Key Takeaways <br> ### 1. The six patterns above cover almost every analytical query you'll write. -- <br> ### 2. WHERE filters rows; HAVING filters groups. Aggregates can only live in HAVING. -- <br> ### 3. CASE WHEN inside an aggregate is the SQL idiom for `mean(condition)`. Memorize it. --- class: sample-code name: sample # Sample Code: Cohort Retention with Accept Rates .small[Dashboard + compare-to-global + funnel + cohort (all 4 patterns)] ```sql WITH cohort AS ( SELECT rider_id, strftime('%Y-%m', signup_date) AS cohort_month FROM riders ), monthly_activity AS ( SELECT req.rider_id, strftime('%Y-%m', req.requested_at) AS month, SUM(CASE WHEN req.accepted = 1 THEN 1 ELSE 0 END) AS n_accepted, COUNT(*) AS n_requests FROM requests req GROUP BY req.rider_id, month ) SELECT c.cohort_month, a.month, COUNT(DISTINCT c.rider_id) AS active_riders, ROUND(1.0 * SUM(a.n_accepted) / SUM(a.n_requests), 3) AS avg_accept_rate FROM cohort c JOIN monthly_activity a ON c.rider_id = a.rider_id GROUP BY c.cohort_month, a.month; ``` --- # Course Map <a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a> <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr> <tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>4</b></td><td><b>Subqueries and CTEs</b></td><td>next</td><td></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table>