Module 3: GROUP BY, HAVING, Conversion Funnels

class: center, middle, inverse, title-slide

.title[
# Module 3: GROUP BY, HAVING, Conversion Funnels
]
.subtitle[
## The patterns you’ll write over and over
]

---

# Course Map

<a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a>

<table>
<tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr>
<tr><td><b>3</b></td><td><b>GROUP BY, HAVING, Conversion Funnels</b> <i>(you are here)</i></td><td>← current</td><td><a href="#sample">M3</a></td></tr>
<tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr>
<tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr>
</table>

---

# Six Patterns to Memorize

1. **City-level dashboard** (one row per city, many metrics)
2. **Compare-to-global** (HAVING with a subquery)
3. **Conversion funnel** (CASE WHEN counts at each step)
4. **Per-group ranking** (correlated subquery; window function later)
5. **Cohort retention** (signup-month × activity-month)
6. **Bottom-N within group**

If you can write all six from memory, you're past the bar for any data job interview.

---
name: write-order

# How to Write SQL (for R Users)

In dplyr you write **top-to-bottom in the order things happen**: data → join → group → summarise → sort.

SQL's written order is **scrambled** — you write the output (`SELECT`) first, even though it runs last. This is confusing.

**Trick: write SQL from the middle out.**

| Step | What to write | Where it goes |
|---|---|---|
| 1. | `FROM rides r` | Middle |
| 2. | `JOIN requests req ON ...` | Middle |
| 3. | `GROUP BY c.name` | Below the joins |
| 4. | Now go to the top: `SELECT c.name, COUNT(*)` | Top |
| 5. | `ORDER BY ... LIMIT ...` | Bottom |

Start with the data source (FROM), add joins, add GROUP BY. **Only then** go back up and write SELECT — by that point you know what columns are available. Feels weird for a week, then it's automatic.

---

# P1: Build a city-level dashboard with rides, drivers, avg fare, and revenue.

<a class="write-ref" href="#write-order">WO</a>
**Step 1.** Start from the `rides` table — that's where fares live.

```sql
FROM rides r
```

**Step 2.** We need city names, but `rides` doesn't have them. Chain through 3 joins:

```sql
FROM rides r
JOIN requests req ON r.request_id = req.request_id     -- get pickup neighborhood
JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id -- get city_id
JOIN cities c ON n.city_id = c.city_id                  -- get city name
```

**Step 3.** Now we have all the columns. Group by city, compute the aggregates:

```sql
SELECT c.name AS city,
       COUNT(DISTINCT r.driver_id) AS drivers,  -- unique drivers
       COUNT(*) AS n_rides,                      -- total rides
       ROUND(AVG(r.fare_usd), 2) AS avg_fare,   -- average fare
       ROUND(SUM(r.fare_usd), 0) AS revenue      -- total revenue
...
GROUP BY c.name ORDER BY revenue DESC;
```

---
count: false

# P1: City-level dashboard — full query + result

<a class="write-ref" href="#write-order">WO</a>
.small[

```sql
SELECT c.name AS city, COUNT(DISTINCT r.driver_id) AS drivers,
       COUNT(*) AS n_rides, ROUND(AVG(r.fare_usd), 2) AS avg_fare,
       ROUND(SUM(r.fare_usd), 0) AS revenue
FROM rides r
JOIN requests req ON r.request_id = req.request_id
JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id
JOIN cities c ON n.city_id = c.city_id
GROUP BY c.name ORDER BY revenue DESC;
```

Table: 5 records

|city          | drivers| n_rides| avg_fare| revenue|
|:-------------|-------:|-------:|--------:|-------:|
|Chicago       |     800|    7946|    12.03|   95614|
|Boston        |     800|    7906|    12.04|   95211|
|Austin        |     800|    7780|    12.03|   93611|
|San Francisco |     800|    7668|    12.02|   92140|
|Seattle       |     800|    7700|    11.94|   91929|
]

---

# P2: Which cities have an average fare above the global average?

<a class="write-ref" href="#write-order">WO</a>
Let's build this step by step.

**Step 1.** What's the global average fare? A standalone query:

```sql
SELECT ROUND(AVG(fare_usd), 2) FROM rides;
-- returns a single number, e.g., 12.02
```

**Step 2.** Compute the average fare **per city** (same JOIN chain as Pattern 1):

```sql
SELECT c.name AS city, ROUND(AVG(r.fare_usd), 2) AS avg_fare
FROM rides r
JOIN requests req ON r.request_id = req.request_id
JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id
JOIN cities c ON n.city_id = c.city_id
GROUP BY c.name;
```

---
count: false

# P2: Which cities have an average fare above the global average?

<a class="write-ref" href="#write-order">WO</a>
**Step 3.** Keep only cities whose avg fare beats the global avg. Paste Step 1's query inside `HAVING`:

```sql
HAVING AVG(r.fare_usd) > (SELECT AVG(fare_usd) FROM rides)
```

The `(SELECT AVG(fare_usd) FROM rides)` is a **scalar subquery** — a mini-query that returns one number, used here like a constant.

Note it says `fare_usd`, not `r.fare_usd`. The subquery is **self-contained** — it has its own `FROM rides` and doesn't know about the outer query's `r` alias. It scans the whole table independently and returns one number.

---
count: false

# P2: Compare to global — full query + result

<a class="write-ref" href="#write-order">WO</a>
.pull-left[
.small[
```sql
SELECT c.name AS city,
  ROUND(AVG(r.fare_usd), 2) AS avg_fare,
  (SELECT ROUND(AVG(fare_usd), 2)
   FROM rides) AS global_avg
FROM rides r
JOIN requests req
  ON r.request_id = req.request_id
JOIN neighborhoods n
  ON req.pickup_nbhd_id = n.nbhd_id
JOIN cities c
  ON n.city_id = c.city_id
GROUP BY c.name
HAVING AVG(r.fare_usd) >
  (SELECT AVG(fare_usd) FROM rides)
ORDER BY avg_fare DESC;
```
]
]

.pull-right[

Table: 4 records

|city          | avg_fare| global_avg|
|:-------------|--------:|----------:|
|Boston        |    12.04|      12.01|
|Chicago       |    12.03|      12.01|
|Austin        |    12.03|      12.01|
|San Francisco |    12.02|      12.01|

Only cities with `avg_fare > global_avg` survive the `HAVING` filter.
]

---

# P3: Build a funnel — how many requests become rides, and how many get 5 stars?

**Step 1.** Start with `requests` and LEFT JOIN `rides` — LEFT because not every request becomes a ride:

```sql
FROM requests req
LEFT JOIN rides r ON req.request_id = r.request_id
```

**Step 2.** Count everything (= total requests):

```sql
COUNT(*) AS step1_requests
```

**Step 3.** Count only the rows where a ride actually happened. The trick: `CASE WHEN condition THEN 1 ELSE 0 END` inside `SUM()` — this counts rows that meet the condition (like `sum(x == TRUE)` in R):

```sql
SUM(CASE WHEN req.accepted = 1 THEN 1 ELSE 0 END) AS step2_accepted
SUM(CASE WHEN r.ride_id IS NOT NULL THEN 1 ELSE 0 END) AS step3_completed
SUM(CASE WHEN r.rider_rating = 5 THEN 1 ELSE 0 END) AS step4_5star
```

---
count: false

# P3: Funnel — full query + result

.small[

```sql
SELECT
  COUNT(*)                                                AS step1_requests,
  SUM(CASE WHEN req.accepted = 1 THEN 1 ELSE 0 END)       AS step2_accepted,
  SUM(CASE WHEN r.ride_id IS NOT NULL THEN 1 ELSE 0 END)  AS step3_completed,
  SUM(CASE WHEN r.rider_rating = 5 THEN 1 ELSE 0 END)     AS step4_5star,
  ROUND(1.0 * SUM(CASE WHEN r.rider_rating = 5 THEN 1 ELSE 0 END) /
        COUNT(*), 3) AS overall_conv
FROM requests req
LEFT JOIN rides r ON req.request_id = r.request_id;
```

Table: 1 records

| step1_requests| step2_accepted| step3_completed| step4_5star| overall_conv|
|--------------:|--------------:|---------------:|-----------:|------------:|
|          50000|          39000|           39000|       23536|        0.471|
]

Read each column as: "of all the rows, how many passed this filter?" The numbers should decrease at each step — that's the funnel.

**Why `1.0 *`?** Without it, SQLite does integer division: `3 / 10 = 0` (truncated). Multiplying by `1.0` forces float math: `1.0 * 3 / 10 = 0.3`. In R this never comes up because R uses doubles by default.

**Why not `step4_5star / step1_requests`?** You can't reference column aliases within the same SELECT — they don't exist yet (SELECT runs as one step). You have to repeat the full expressions, or use a CTE to compute the columns first and divide in the outer query.

---

# P4: For each signup-month cohort, how many riders were active in each subsequent month?

A **CTE** lets you name a temporary result. Read `WITH cohort AS (...)` as **"let `cohort` = this query"** — like `cohort <- riders |> ...` in R. (Covered fully in Module 4.)

**Step 1.** CTE for each rider's cohort (signup month):

```sql
WITH cohort AS (
  SELECT rider_id, strftime('%Y-%m', signup_date) AS cohort_month
  FROM riders
)
```

**Step 2.** CTE for each rider's activity months (comma-separated from Step 1):

```sql
activity AS (
  SELECT DISTINCT rider_id, strftime('%Y-%m', requested_at) AS activity_month
  FROM requests
)
```

---
count: false

# P4: Cohort retention — Step 3

**Step 3.** Join the two CTEs and count distinct riders per (cohort, month):

```sql
SELECT c.cohort_month, a.activity_month,
       COUNT(DISTINCT c.rider_id) AS active_riders
FROM cohort c JOIN activity a ON c.rider_id = a.rider_id
GROUP BY c.cohort_month, a.activity_month
```

---
count: false

# P4: Cohort retention — full query + result

.small[

```sql
WITH cohort AS (
  SELECT rider_id, strftime('%Y-%m', signup_date) AS cohort_month
  FROM riders
),
activity AS (
  SELECT DISTINCT rider_id, strftime('%Y-%m', requested_at) AS activity_month
  FROM requests
)
SELECT c.cohort_month, a.activity_month,
       COUNT(DISTINCT c.rider_id) AS active_riders
FROM cohort c JOIN activity a ON c.rider_id = a.rider_id
GROUP BY c.cohort_month, a.activity_month
ORDER BY c.cohort_month, a.activity_month
LIMIT 6;
```

Table: 6 records

|cohort_month |activity_month | active_riders|
|:------------|:--------------|-------------:|
|2022-01      |2025-01        |            63|
|2022-01      |2025-02        |            59|
|2022-01      |2025-03        |            57|
|2022-01      |2025-04        |            55|
|2022-01      |2025-05        |            65|
|2022-01      |2025-06        |            56|
]

Each row = one (cohort, month) pair. To see retention *rates*, divide `active_riders` by the cohort's total size.

---

# Common Mistakes

**1. Selecting non-aggregated columns without grouping.**

```sql
-- WRONG: which driver_id does it pick? Arbitrary!
SELECT city, driver_id, COUNT(*) AS n_rides
FROM rides GROUP BY city;

-- RIGHT: every non-aggregate column must be in GROUP BY
SELECT city, driver_id, COUNT(*) AS n_rides
FROM rides GROUP BY city, driver_id;
```

Strict SQL (Postgres) errors out. SQLite silently picks an arbitrary `driver_id` per city — which is worse because you get a result that *looks* fine but is wrong.

**2. `COUNT` vs `COUNT(DISTINCT)`.** `COUNT(*)` counts rows. `COUNT(DISTINCT col)` counts unique values. Always be deliberate.

**3. Counting "users" after a JOIN.** Joins multiply rows. `COUNT(rider_id)` after a join counts *rides*, not riders. Use `COUNT(DISTINCT rider_id)`.

**4. Filtering aggregates in `WHERE`.** Doesn't work. Use `HAVING`.

```sql
-- WRONG: aggregates don't exist in WHERE (runs before GROUP BY)
SELECT city, COUNT(*) AS n FROM rides
WHERE COUNT(*) > 1000 GROUP BY city;

-- RIGHT: HAVING runs after GROUP BY, so aggregates exist
SELECT city, COUNT(*) AS n FROM rides
GROUP BY city HAVING COUNT(*) > 1000;
```

---
class: inverse

# The Key Takeaways

<br>

### 1. The six patterns above cover almost every analytical query you'll write.

<br>

### 2. WHERE filters rows; HAVING filters groups. Aggregates can only live in HAVING.

<br>

### 3. CASE WHEN inside an aggregate is the SQL idiom for `mean(condition)`. Memorize it.

---
class: sample-code

name: sample
# Sample Code: Cohort Retention with Accept Rates

.small[Dashboard + compare-to-global + funnel + cohort (all 4 patterns)]

```sql
WITH cohort AS (
  SELECT rider_id, strftime('%Y-%m', signup_date) AS cohort_month FROM riders
),
monthly_activity AS (
  SELECT req.rider_id, strftime('%Y-%m', req.requested_at) AS month,
    SUM(CASE WHEN req.accepted = 1 THEN 1 ELSE 0 END) AS n_accepted,
    COUNT(*) AS n_requests
  FROM requests req GROUP BY req.rider_id, month
)
SELECT c.cohort_month, a.month,
  COUNT(DISTINCT c.rider_id) AS active_riders,
  ROUND(1.0 * SUM(a.n_accepted) / SUM(a.n_requests), 3) AS avg_accept_rate
FROM cohort c JOIN monthly_activity a ON c.rider_id = a.rider_id
GROUP BY c.cohort_month, a.month;
```

---

# Course Map

<a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a>

<table>
<tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr>
<tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels <i>(just finished)</i></td><td>✓ done</td></tr>
<tr><td><b>4</b></td><td><b>Subqueries and CTEs</b></td><td>next</td><td></td></tr>
<tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr>
</table>