class: center, middle, inverse, title-slide .title[ # Module 5: Window Functions ] .subtitle[ ## The single most powerful SQL feature for analytical work ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .write-ref { position: absolute; bottom: 12px; right: 90px; background: #f0f8ff; border: 1px solid #aac; border-radius: 4px; padding: 1px 6px; font-size: 11px; text-decoration: none; color: #336; font-weight: bold; z-index: 100; } .write-ref:hover { background: #ddeeff; text-decoration: none; } .sample-code h1 { font-size: 1.2em; margin-bottom: 0.2em; } .sample-code .remark-code { font-size: 110%; line-height: 1.35; } .small { font-size: 80%; } </style> # Course Map <a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a> <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">GROUP BY, HAVING, Conversion Funnels</a></td><td>✓ done</td><td><a href="../module-03/slides.html#sample">M3</a></td></tr> <tr><td>4</td><td><a href="../module-04/slides.html">Subqueries and CTEs</a></td><td>✓ done</td><td><a href="../module-04/slides.html#sample">M4</a></td></tr> <tr><td><b>5</b></td><td><b>Window Functions</b> <i>(you are here)</i></td><td>← current</td><td><a href="#sample">M5</a></td></tr> </table> --- # The Mental Model A window function: - Computes an aggregate (or rank, or lag) **across a set of rows related to the current row** (the "window") - Returns **one value per input row** (unlike GROUP BY) - Lives in the SELECT clause -- ```sql SELECT driver_id, fare_usd, AVG(fare_usd) OVER (PARTITION BY driver_id) AS driver_avg FROM rides; ``` Returns *every* ride along with its driver's average fare. **No GROUP BY, no subquery.** --- # The OVER Clause ```sql function() OVER ( PARTITION BY col1, col2 -- groups (like GROUP BY) ORDER BY col3 DESC -- order within partition ROWS BETWEEN ... AND ... -- frame ) ``` -- Three parts: - **PARTITION BY** — splits rows into groups - **ORDER BY** — orders within each partition (required for ranking; optional for plain aggregates) - **ROWS BETWEEN** — defines the window frame --- # Four Function Families **1. Aggregates** — `SUM`, `AVG`, `MIN`, `MAX`, `COUNT` -- **2. Ranking** | Function | Behavior on ties | |---|---| | `ROW_NUMBER()` | Always 1, 2, 3, ... | | `RANK()` | Same rank, gaps after (1, 2, 2, 4) | | `DENSE_RANK()` | Same rank, no gaps (1, 2, 2, 3) | | `NTILE(n)` | Splits into n equal buckets | -- **3. Offset** — `LAG(col, n, default)`, `LEAD(col, n, default)` -- **4. First / last** — `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE` --- # Q: For each driver, find their highest-fare ride. <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: `ROW_NUMBER() OVER (PARTITION BY driver_id ORDER BY fare_usd DESC)`, then filter for rank = 1.* -- ```sql WITH ranked AS ( SELECT driver_id, ride_id, fare_usd, ROW_NUMBER() OVER (PARTITION BY driver_id ORDER BY fare_usd DESC) AS rk FROM rides ) SELECT driver_id, ride_id, fare_usd FROM ranked WHERE rk = 1 ORDER BY fare_usd DESC LIMIT 5; ``` Table: 5 records | driver_id| ride_id| fare_usd| |---------:|-------:|--------:| | 744| 28699| 90.87| | 157| 11316| 73.29| | 708| 20070| 65.67| | 532| 313| 65.01| | 664| 21337| 60.96| -- **Use `ROW_NUMBER`** if you want exactly one row per group (ties broken arbitrarily). **Use `RANK` or `DENSE_RANK`** if you want all tied rows. --- # Q: For each ride, how many hours since the driver's previous ride? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: `LAG(started_at, 1) OVER (PARTITION BY driver_id ORDER BY started_at)`.* -- ```sql SELECT ride_id, driver_id, started_at, LAG(started_at, 1) OVER (PARTITION BY driver_id ORDER BY started_at) AS prev, ROUND( (julianday(started_at) - julianday(LAG(started_at, 1) OVER (PARTITION BY driver_id ORDER BY started_at)) ) * 24, 2 ) AS hours_since_last FROM rides LIMIT 5; ``` Table: 5 records | ride_id| driver_id|started_at |prev | hours_since_last| |-------:|---------:|:--------------------------|:--------------------------|----------------:| | 6262| 1|2025-01-04 00:05:03.253056 |NA | NA| | 2840| 1|2025-01-06 00:03:46.348345 |2025-01-04 00:05:03.253056 | 47.98| | 7704| 1|2025-01-12 00:04:46.916019 |2025-01-06 00:03:46.348345 | 144.02| | 25133| 1|2025-01-14 00:02:28.86687 |2025-01-12 00:04:46.916019 | 47.96| | 1303| 1|2025-02-03 00:04:51.056838 |2025-01-14 00:02:28.86687 | 480.04| -- For the first ride per driver, `LAG` returns NULL. --- # Q: Compute a 7-day rolling average of daily ride counts. <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: pre-aggregate to one row per day, then `AVG(...) OVER (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`.* -- ```sql WITH daily AS ( SELECT date(started_at) AS day, COUNT(*) AS rides_today FROM rides GROUP BY day ) SELECT day, rides_today, ROUND(AVG(rides_today) OVER ( ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ), 1) AS rolling_7d FROM daily ORDER BY day LIMIT 7; ``` Table: 7 records |day | rides_today| rolling_7d| |:----------|-----------:|----------:| |2025-01-01 | 104| 104.0| |2025-01-02 | 99| 101.5| |2025-01-03 | 113| 105.3| |2025-01-04 | 114| 107.5| |2025-01-05 | 124| 110.8| |2025-01-06 | 113| 111.2| |2025-01-07 | 104| 110.1| -- `ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` = 7-row window. Pre-aggregate to one row per day so the row count = day count. --- name: median-main # Q: What's the median fare per driver? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: median = middle row when sorted. Standard SQL has `PERCENTILE_CONT(0.5)`; SQLite doesn't, so we'll need a workaround (next slide).* -- **Standard SQL** (Postgres / MySQL 8+ / Snowflake — *not* SQLite): ```sql SELECT driver_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY fare_usd) AS median_fare FROM rides GROUP BY driver_id; ``` `WITHIN GROUP (ORDER BY ...)` is **ordered-set aggregate** syntax — the ordering is local to each `GROUP BY` partition. No `OVER (PARTITION BY ...)` needed. For other percentiles: `PERCENTILE_CONT(0.9)` for 90th, `PERCENTILE_CONT(0.25)` for Q1, etc. --- name: median-sqlite # Q: Median in SQLite — the `ROW_NUMBER` workaround <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: order each driver's rides, pick the middle row(s) by `ROW_NUMBER()`.* -- ```sql WITH ranked AS ( SELECT driver_id, fare_usd, ROW_NUMBER() OVER (PARTITION BY driver_id ORDER BY fare_usd) AS rn, COUNT(*) OVER (PARTITION BY driver_id) AS n FROM rides ) SELECT driver_id, ROUND(AVG(fare_usd), 2) AS median_fare FROM ranked WHERE rn IN ((n+1)/2, (n+2)/2) -- one row if n odd, two rows if even GROUP BY driver_id ORDER BY driver_id LIMIT 5; ``` Table: 5 records |driver_id | median_fare| |:---------|-----------:| |1 | 11.86| |2 | 11.31| |3 | 10.13| |4 | 11.37| |5 | 11.17| -- **Why the trick works** (integer division): - **Odd** `n=5`: `(5+1)/2 = 3`, `(5+2)/2 = 3` → one row matches (`rn=3`), the middle. - **Even** `n=6`: `(6+1)/2 = 3`, `(6+2)/2 = 4` → two rows; `AVG` returns the conventional median. **Other percentiles:** change the rank cutoff. 90th: `rn = CEIL(n * 0.9)`. Shorter approximations: `NTILE(2)` → keep ntile = 1 (floor median). --- # Q: Find drivers in the bottom 10% of their city by acceptance rate. <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: `NTILE(10) OVER (PARTITION BY city_id ORDER BY accept_rate)`, then filter for decile = 1.* -- ```sql WITH driver_city AS ( SELECT r.driver_id, n.city_id, AVG(CASE WHEN req.accepted = 1 THEN 1.0 ELSE 0 END) AS accept_rate FROM requests req LEFT JOIN rides r ON req.request_id = r.request_id JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id WHERE r.driver_id IS NOT NULL GROUP BY r.driver_id, n.city_id ), deciled AS ( SELECT *, NTILE(10) OVER (PARTITION BY city_id ORDER BY accept_rate) AS decile FROM driver_city ) SELECT * FROM deciled WHERE decile = 1 LIMIT 5; ``` Table: 5 records |driver_id | city_id| accept_rate| decile| |:---------|-------:|-----------:|------:| |1 | 1| 1| 1| |2 | 1| 1| 1| |3 | 1| 1| 1| |4 | 1| 1| 1| |5 | 1| 1| 1| -- `NTILE(10)` is a one-liner replacement for the gnarly correlated subquery from Module 3. --- # Q: What is each driver's longest streak of consecutive days with at least one ride? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: the "gaps and islands" trick — subtract `ROW_NUMBER()` from the date to get a constant for each consecutive run.* -- ```sql WITH driver_days AS ( SELECT DISTINCT driver_id, date(started_at) AS d FROM rides ), labeled AS ( SELECT driver_id, d, julianday(d) - ROW_NUMBER() OVER (PARTITION BY driver_id ORDER BY d) AS grp FROM driver_days ), streaks AS ( SELECT driver_id, grp, COUNT(*) AS streak_len FROM labeled GROUP BY driver_id, grp ) SELECT driver_id, MAX(streak_len) FROM streaks GROUP BY driver_id; ``` -- **The trick:** subtracting `ROW_NUMBER()` from the date gives a constant for any consecutive run. Group by that constant → each streak. Mind-bending the first time you see it. --- # Common Mistakes **1. Missing `ORDER BY` for ranking.** `ROW_NUMBER() OVER (PARTITION BY x)` without ORDER BY is non-deterministic. -- **2. Wrong default frame for `LAST_VALUE`.** Defaults to ending at `CURRENT ROW`, so it returns the current row's value. Always extend to `UNBOUNDED FOLLOWING`. -- **3. Window functions in `WHERE`.** Doesn't work — windows are computed after WHERE. Wrap in a CTE and filter in the outer query. -- **4. Rolling window with gaps.** `ROWS BETWEEN 6 PRECEDING` counts *rows*, not days. Pre-aggregate first. --- class: inverse # The Key Takeaways <br> ### 1. Window functions = aggregate + per-row output. The mental shift is: every row keeps its identity, but you compute things across related rows. -- <br> ### 2. The four families: aggregate-with-window, ranking, offset (LAG/LEAD), first/last. -- <br> ### 3. ROW_NUMBER + filter inside a CTE is the canonical "top-N per group" idiom. Memorize it. --- class: sample-code name: sample # Sample Code: Daily Revenue Rankings & Trends ```sql -- RANK + LAG + rolling AVG + NTILE (all window function families) WITH daily AS ( SELECT driver_id, date(started_at) AS day, COUNT(*) AS n_rides, SUM(fare_usd) AS revenue FROM rides GROUP BY driver_id, day ) SELECT driver_id, day, n_rides, revenue, RANK() OVER (PARTITION BY driver_id ORDER BY revenue DESC) AS revenue_rank, LAG(revenue) OVER (PARTITION BY driver_id ORDER BY day) AS prev_day_rev, ROUND(AVG(revenue) OVER (PARTITION BY driver_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), 2) AS rolling_7d_avg, NTILE(4) OVER (ORDER BY revenue DESC) AS quartile FROM daily LIMIT 10; ``` --- class: sample-code name: sample-full # Sample Code: City Dashboard (All Modules Combined) .small[M1: SELECT/WHERE/aggregates — M2: JOINs — M3: GROUP BY/CASE WHEN — M4: CTEs — M5: window] ```sql WITH city_stats AS ( -- M4: CTE wraps the prep SELECT c.name AS city, COUNT(*) AS n_rides, -- M1: project + COUNT ROUND(AVG(r.fare_usd), 2) AS avg_fare, -- M1: AVG aggregate SUM(CASE WHEN r.surge_mult > 1 -- M3: conditional count THEN 1 ELSE 0 END) AS n_surged -- via CASE WHEN FROM rides r -- M2: alias rides JOIN requests req ON r.request_id = req.request_id -- M2: chain hop 1 JOIN neighborhoods n ON req.pickup_nbhd_id = n.nbhd_id -- M2: chain hop 2 JOIN cities c ON n.city_id = c.city_id -- M2: chain hop 3 WHERE date(r.started_at) >= '2025-06-01' -- M1: pre-agg filter GROUP BY c.name -- M3: 1 row per city HAVING COUNT(*) > 500 -- M3: post-agg filter ) SELECT city, n_rides, avg_fare, n_surged, -- pick CTE columns RANK() OVER (ORDER BY avg_fare DESC) AS fare_rank, -- M5: window rank ROUND(1.0 * n_surged / n_rides, 3) AS surge_pct -- M1: rate (1.0* = float div) FROM city_stats -- read the CTE ORDER BY fare_rank; -- final sort ``` --- # Course Map <a href="../module-01/slides.html#schema" style="font-size:13px;background:#f0f8ff;border:1px solid #aac;border-radius:4px;padding:2px 8px;text-decoration:none;color:#336;">Schema ref</a> <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">SELECT, WHERE, Aggregates</a></td><td>✓ done</td><td><a href="../module-01/slides.html#sample">M1</a></td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">JOINs</a></td><td>✓ done</td><td><a href="../module-02/slides.html#sample">M2</a></td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">GROUP BY, HAVING, Conversion Funnels</a></td><td>✓ done</td><td><a href="../module-03/slides.html#sample">M3</a></td></tr> <tr><td>4</td><td><a href="../module-04/slides.html">Subqueries and CTEs</a></td><td>✓ done</td><td><a href="../module-04/slides.html#sample">M4</a></td></tr> <tr><td>5</td><td>Window Functions <i>(just finished)</i></td><td>✓ done</td><td><a href="#sample">M5</a></td></tr> </table> **You're done with the SQL course.** Time to drill the questions until you can write each one in two minutes flat.