class: center, middle, inverse, title-slide .title[ # Module 1: SELECT, WHERE, Aggregates ] .subtitle[ ## The 80% of SQL you’ll use every day ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .write-ref { position: absolute; bottom: 12px; right: 90px; background: #f0f8ff; border: 1px solid #aac; border-radius: 4px; padding: 1px 6px; font-size: 11px; text-decoration: none; color: #336; font-weight: bold; z-index: 100; } .write-ref:hover { background: #ddeeff; text-decoration: none; } .sample-code h1 { font-size: 1.2em; margin-bottom: 0.2em; } .sample-code .remark-code { font-size: 110%; line-height: 1.35; } .small { font-size: 80%; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td><b>1</b></td><td><b>SELECT, WHERE, Aggregates</b> <i>(you are here)</i></td><td>← current</td><td><a href="#sample">M1</a></td></tr> <tr><td>2</td><td>JOINs</td><td>upcoming</td><td></td></tr> <tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>upcoming</td><td></td></tr> <tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table> --- # Before We Start: Setup **Step 1.** Build the database (once, in your terminal): ```bash cd ~/Desktop/sandbox/sql-uber-prep Rscript data/setup.R ``` This creates `data/uber.sqlite` with 6 tables (~39K rides, 800 drivers, 4K riders). -- **Step 2.** Install **DB Browser for SQLite** — a free GUI app: Download from [sqlitebrowser.org/dl](https://sqlitebrowser.org/dl/), drag to Applications, then: ```bash open -a "DB Browser for SQLite" data/uber.sqlite ``` -- **Step 3.** Go to the **Execute SQL** tab. Type a query, hit `Cmd+Return` to run it. **Important:** only highlight and run one `SELECT ... ;` at a time. Don't paste whole files — the `.headers on` lines at the top of exercise files are terminal-only commands and will cause errors in DB Browser. --- # Why SQL? (You can do all of this in R) Yes — but there are three reasons you'll use SQL anyway: -- **1. Scale.** R loads everything into RAM. A 10GB dataset crashes your laptop. A SQL database handles it without blinking — the data stays on disk, and only the result comes back. -- **2. The data already lives there.** At most tech companies, you don't get a CSV. You get access to a database with billions of rows. SQL is how you talk to it. -- **3. It's the lingua franca.** Every data tool speaks SQL. Every data person knows SQL. When a PM asks "how many rides last week?", the answer starts with `SELECT COUNT(*)`. --- # Why SQL? It's not about speed — it's about scale For data that fits in your laptop's RAM, R/dplyr is actually **faster** than SQLite. That's fine. The point is what happens when it **doesn't** fit: | Dataset size | R (dplyr) | SQL database | |---|---|---| | 100K rows (~10 MB) | Fast | Fast | | 10M rows (~1 GB) | Slow but works | Fast | | 100M rows (~10 GB) | **Crashes** (out of RAM) | Fast | | 1B+ rows (~100 GB) | Impossible | Still fast (Postgres, Spark) | -- At a ride-sharing company, the `rides` table has **billions** of rows. You can't `read_csv` that. You write a SQL query, the database runs it on the server, and only the **result** (a few rows) comes back to your laptop. -- That's why every data job requires SQL — not because it's faster for small data, but because it's the **only way to touch big data**. --- # What This Course Is A 5-module SQL refresher targeted at **applied economist / data science** interviews in tech. -- **SQL** is the shared language for querying databases. There are several **engines** that run it: | Engine | What it is | When you'd use it | |---|---|---| | **SQLite** | A single file on your laptop, zero setup | Tutorials, small projects, mobile apps | | **PostgreSQL** | A full server, most feature-rich | Production analytics, data warehouses | | **MySQL** | A full server, widely used in web apps | Web backends, some analytics | ~95% of the queries are identical across all three. We use **SQLite** because it's a single file with no server to manage. The patterns transfer directly — if your interviewer uses a different dialect, just ask about syntax differences (mostly date functions). -- - **Database:** SQLite (`data/uber.sqlite`), built by `Rscript data/setup.R` - **Schema:** rides, requests, drivers, riders, neighborhoods, cities - **Format:** every module is concept → code → 3-5 interview-style questions - **Goal:** be able to write the canonical analytical queries from a cold start --- name: schema # The Schema A **schema** is the map of a database: what tables exist, what columns each has, and how they link. If you're used to R or Stata, think of it as having **multiple data frames stored together** with columns that let you `left_join` between them. ``` cities (city_id, name, country, lead_pm) neighborhoods (nbhd_id, city_id, name, pct_minority, median_income) drivers (driver_id, signup_date, gender, home_nbhd_id) riders (rider_id, signup_date, home_nbhd_id) requests (request_id, rider_id, pickup_nbhd_id, dropoff_nbhd_id, requested_at, accepted, accepted_by_driver_id) rides (ride_id, request_id, driver_id, started_at, ended_at, distance_mi, fare_usd, surge_mult, rider_rating) ``` -- The linking columns (e.g. `driver_id` appears in both `rides` and `drivers`) are what you'd use as the `by =` argument in dplyr. In SQL, you use `JOIN ... ON` instead. --- # The Core Syntax <pre style="background:#f5f5f5;padding:0.9em 1em;border-radius:4px;font-family:'Source Code Pro',Menlo,monospace;line-height:1.55;font-size:90%;margin:0;"> <span style="color:#6a3d9a;"><b>⑤ SELECT</b> col1, col2, agg_func(col3) <b>AS</b> result_name</span> <span style="color:#1f78b4;"><b>① FROM</b> table_name</span> <span style="color:#33a02c;"><b>② WHERE</b> some_condition</span> <span style="color:#ff7f00;"><b>③ GROUP BY</b> col1, col2</span> <span style="color:#e31a1c;"><b>④ HAVING</b> aggregate_condition</span> <span style="color:#b15928;"><b>⑥ ORDER BY</b> col1 DESC</span> <span style="color:#555;"><b>⑦ LIMIT</b> 10;</span> </pre> -- **Logical execution order** (different from written order — numbers ①–⑦ above match): <span style="color:#1f78b4;"><b>① FROM</b></span> → <span style="color:#33a02c;"><b>② WHERE</b></span> → <span style="color:#ff7f00;"><b>③ GROUP BY</b></span> → <span style="color:#e31a1c;"><b>④ HAVING</b></span> → <span style="color:#6a3d9a;"><b>⑤ SELECT</b></span> → <span style="color:#b15928;"><b>⑥ ORDER BY</b></span> → <span style="color:#555;"><b>⑦ LIMIT</b></span> -- A **column alias** is the name you give a computed result with `AS` — e.g., `fare_usd * 1.5 AS surge_fare`. It's like naming a new column in `mutate(surge_fare = fare_usd * 1.5)`. This execution order matters: `WHERE` happens **before** SELECT, so you can't use an alias in `WHERE` (it doesn't exist yet). `ORDER BY` happens **after** SELECT, so you can. --- # Five Things to Remember **1. `COUNT(*)` vs `COUNT(col)`** — `COUNT(*)` counts rows; `COUNT(col)` skips NULLs in `col`. -- **2. `COUNT(DISTINCT col)`** — the most common non-trivial aggregate. -- **3. `AVG` and `SUM` ignore NULLs.** Usually what you want. If not, wrap in `COALESCE(col, 0)`. -- **4. `WHERE` filters rows; `HAVING` filters groups.** No aggregates in `WHERE`. -- **5. `LIMIT` is *not* a sample.** It's the first N rows after `ORDER BY`. Without sorting, the order is undefined. --- # Date Handling in SQLite ```sql date(col) -- '2025-01-15' strftime('%Y-%m', col) -- '2025-01' strftime('%w', col) -- day of week (0=Sun) strftime('%H', col) -- hour of day date(col, 'start of month') -- first of the month date(col, '-7 days') -- a week ago julianday(b) - julianday(a) -- day difference ``` -- In Postgres or MySQL the syntax is different (`DATE_TRUNC`, `EXTRACT`), but the patterns are the same. If your interviewer uses a different dialect, just ask. --- # Pattern: Top-N ```sql SELECT ride_id, driver_id, distance_mi, fare_usd FROM rides ORDER BY distance_mi DESC LIMIT 5; ``` Table: 5 records | ride_id| driver_id| distance_mi| fare_usd| |-------:|---------:|-----------:|--------:| | 28699| 744| 36.82| 90.87| | 11316| 157| 30.26| 73.29| | 35411| 800| 29.30| 47.41| | 5731| 468| 29.11| 45.44| | 313| 532| 28.87| 65.01| -- The two essentials: an `ORDER BY` and a `LIMIT`. Forget the `ORDER BY` and you'll get five arbitrary rows. --- # Pattern: Filtered Aggregate ```sql SELECT COUNT(*) AS rides_dec, AVG(fare_usd) AS avg_fare FROM rides WHERE date(started_at) BETWEEN '2025-12-01' AND '2025-12-31'; ``` Table: 1 records | rides_dec| avg_fare| |---------:|--------:| | 3293| 11.84622| -- `WHERE` happens before aggregation, so this counts December rides and averages their fares. --- # Pattern: Group By + Having **Drivers with at least 100 rides in 2025**: ```sql SELECT driver_id, COUNT(*) AS n_rides FROM rides WHERE date(started_at) BETWEEN '2025-01-01' AND '2025-12-31' GROUP BY driver_id HAVING COUNT(*) >= 100 ORDER BY n_rides DESC LIMIT 5; ``` Table: 0 records | driver_id|n_rides | |---------:|:-------| -- **Don't put `COUNT(*) >= 100` in `WHERE`** — aggregates don't exist there yet. `HAVING` is the post-aggregation filter. --- # Pattern: Conditional Counts ```sql SELECT COUNT(*) AS total, SUM(CASE WHEN accepted = 1 THEN 1 ELSE 0 END) AS accepted_count, ROUND(AVG(CASE WHEN accepted = 1 THEN 1.0 ELSE 0 END), 3) AS accept_rate FROM requests; ``` Table: 1 records | total| accepted_count| accept_rate| |-----:|--------------:|-----------:| | 50000| 39000| 0.78| -- The `CASE WHEN ... THEN 1 ELSE 0 END` trick is the SQL version of `mean(x == "foo")` in R. **Memorize it** — you'll use it constantly. --- # Common Mistakes - **Integer division:** `SUM(x) / COUNT(*)` → 0 if both are integers and the result < 1. Always cast: `1.0 * SUM(x) / COUNT(*)` -- - **NULLs in `AVG`:** ignored, not zeroed. `AVG(fare_usd)` on 100 rows with 50 NULLs averages over the 50 non-NULL rows. -- - **Column aliases in `WHERE`:** doesn't work. Repeat the expression or wrap in a subquery. -- - **`BETWEEN` is inclusive on both ends.** `WHERE x BETWEEN 1 AND 10` is `x >= 1 AND x <= 10`. -- - **Case-sensitive string comparison.** Use `LOWER()` or `COLLATE NOCASE`. --- class: inverse, center, middle # Interview Questions ### How fast can you write each one? ### [Schema reference](#schema) --- # Q1. What is the average fare by hour of day? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: you need `strftime('%H', ...)` to extract the hour, `AVG()`, and `GROUP BY`.* -- ```sql SELECT strftime('%H', started_at) AS hour, ROUND(AVG(fare_usd), 2) AS avg_fare, COUNT(*) AS n_rides FROM rides GROUP BY hour ORDER BY hour LIMIT 6; ``` Table: 1 records |hour | avg_fare| n_rides| |:----|--------:|-------:| |00 | 12.01| 39000| --- # Q2. What is the monthly conversion rate (requests → accepted)? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: you need `SUM(accepted) / COUNT(*)` and watch out for integer division.* -- ```sql SELECT strftime('%Y-%m', requested_at) AS month, COUNT(*) AS n_requests, SUM(accepted) AS n_accepted, ROUND(1.0 * SUM(accepted) / COUNT(*), 3) AS accept_rate FROM requests GROUP BY month ORDER BY month LIMIT 6; ``` Table: 6 records |month | n_requests| n_accepted| accept_rate| |:-------|----------:|----------:|-----------:| |2025-01 | 4251| 3307| 0.778| |2025-02 | 3818| 2970| 0.778| |2025-03 | 4227| 3321| 0.786| |2025-04 | 4109| 3195| 0.778| |2025-05 | 4228| 3249| 0.768| |2025-06 | 4199| 3311| 0.789| --- # Q3. How often does surge pricing kick in, and what's the average multiplier? <a class="write-ref" href="../module-03/slides.html#write-order">WO</a> *Hint: use the `CASE WHEN ... THEN 1 ELSE 0 END` trick inside `SUM()` and `AVG()`.* -- ```sql SELECT COUNT(*) AS total_rides, SUM(CASE WHEN surge_mult > 1.0 THEN 1 ELSE 0 END) AS surged_rides, ROUND(1.0 * SUM(CASE WHEN surge_mult > 1.0 THEN 1 ELSE 0 END) / COUNT(*), 3) AS pct_surged, ROUND(AVG(surge_mult), 3) AS avg_surge_mult FROM rides; ``` Table: 1 records | total_rides| surged_rides| pct_surged| avg_surge_mult| |-----------:|------------:|----------:|--------------:| | 39000| 28949| 0.742| 1.246| --- class: inverse # The Key Takeaways <br> ### 1. SELECT / WHERE / GROUP BY / HAVING / ORDER BY / LIMIT — the six clauses cover ~80% of analytical work -- <br> ### 2. The CASE WHEN aggregate trick is the SQL equivalent of dplyr's `mean(x == "foo")`. Memorize it. -- <br> ### 3. Watch out for integer division, NULL handling, and column aliases in WHERE. --- class: sample-code name: sample # Sample Code: Monthly Rides, Drivers, Avg Fare & Surge Rate ```sql -- SELECT + WHERE + aggregates + GROUP BY + HAVING + ORDER BY + LIMIT SELECT strftime('%Y-%m', started_at) AS month, COUNT(*) AS n_rides, COUNT(DISTINCT driver_id) AS n_drivers, ROUND(AVG(fare_usd), 2) AS avg_fare, ROUND(1.0 * SUM(CASE WHEN surge_mult > 1 THEN 1 ELSE 0 END) / COUNT(*), 3) AS pct_surged FROM rides WHERE date(started_at) BETWEEN '2025-01-01' AND '2025-12-31' GROUP BY month HAVING COUNT(*) > 100 ORDER BY month LIMIT 12; ``` --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr> <tr><td>1</td><td>SELECT, WHERE, Aggregates <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>2</b></td><td><b>JOINs</b></td><td>next</td><td></td></tr> <tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>upcoming</td><td></td></tr> <tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr> <tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr> </table>