Module 1: SELECT, WHERE, Aggregates

class: center, middle, inverse, title-slide

.title[
# Module 1: SELECT, WHERE, Aggregates
]
.subtitle[
## The 80% of SQL you’ll use every day
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr>
<tr><td>1</td><td>SELECT, WHERE, Aggregates (you are here)</td><td>← current</td><td><a href="#sample">M1</a></td></tr>
<tr><td>2</td><td>JOINs</td><td>upcoming</td><td></td></tr>
<tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>upcoming</td><td></td></tr>
<tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr>
<tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr>
</table>

---

# Before We Start: Setup

**Step 1.** Build the database (once, in your terminal):

```bash
cd ~/Desktop/sandbox/sql-uber-prep
Rscript data/setup.R
```

This creates `data/uber.sqlite` with 6 tables (~39K rides, 800 drivers, 4K riders).

**Step 2.** Install **DB Browser for SQLite** — a free GUI app:

Download from [sqlitebrowser.org/dl](https://sqlitebrowser.org/dl/), drag to Applications, then:

```bash
open -a "DB Browser for SQLite" data/uber.sqlite
```

**Step 3.** Go to the **Execute SQL** tab. Type a query, hit `Cmd+Return` to run it.

**Important:** only highlight and run one `SELECT ... ;` at a time. Don't paste whole files — the `.headers on` lines at the top of exercise files are terminal-only commands and will cause errors in DB Browser.

---

# Why SQL? (You can do all of this in R)

Yes — but there are three reasons you'll use SQL anyway:

**1. Scale.** R loads everything into RAM. A 10GB dataset crashes your laptop. A SQL database handles it without blinking — the data stays on disk, and only the result comes back.

**2. The data already lives there.** At most tech companies, you don't get a CSV. You get access to a database with billions of rows. SQL is how you talk to it.

**3. It's the lingua franca.** Every data tool speaks SQL. Every data person knows SQL. When a PM asks "how many rides last week?", the answer starts with `SELECT COUNT(*)`.

---

# Why SQL? It's not about speed — it's about scale

For data that fits in your laptop's RAM, R/dplyr is actually **faster** than SQLite. That's fine.

The point is what happens when it **doesn't** fit:

| Dataset size | R (dplyr) | SQL database |
|---|---|---|
| 100K rows (~10 MB) | Fast | Fast |
| 10M rows (~1 GB) | Slow but works | Fast |
| 100M rows (~10 GB) | **Crashes** (out of RAM) | Fast |
| 1B+ rows (~100 GB) | Impossible | Still fast (Postgres, Spark) |

At a ride-sharing company, the `rides` table has **billions** of rows. You can't `read_csv` that. You write a SQL query, the database runs it on the server, and only the **result** (a few rows) comes back to your laptop.

That's why every data job requires SQL — not because it's faster for small data, but because it's the **only way to touch big data**.

---

# What This Course Is

A 5-module SQL refresher targeted at **applied economist / data science** interviews in tech.

**SQL** is the shared language for querying databases. There are several **engines** that run it:

| Engine | What it is | When you'd use it |
|---|---|---|
| **SQLite** | A single file on your laptop, zero setup | Tutorials, small projects, mobile apps |
| **PostgreSQL** | A full server, most feature-rich | Production analytics, data warehouses |
| **MySQL** | A full server, widely used in web apps | Web backends, some analytics |

~95% of the queries are identical across all three. We use **SQLite** because it's a single file with no server to manage. The patterns transfer directly — if your interviewer uses a different dialect, just ask about syntax differences (mostly date functions).

- **Database:** SQLite (`data/uber.sqlite`), built by `Rscript data/setup.R`
- **Schema:** rides, requests, drivers, riders, neighborhoods, cities
- **Format:** every module is concept → code → 3-5 interview-style questions
- **Goal:** be able to write the canonical analytical queries from a cold start

---
name: schema

# The Schema

A **schema** is the map of a database: what tables exist, what columns each has, and how they link. If you're used to R or Stata, think of it as having **multiple data frames stored together** with columns that let you `left_join` between them.

```
cities         (city_id, name, country, lead_pm)
neighborhoods  (nbhd_id, city_id, name, pct_minority, median_income)
drivers        (driver_id, signup_date, gender, home_nbhd_id)
riders         (rider_id, signup_date, home_nbhd_id)
requests       (request_id, rider_id, pickup_nbhd_id, dropoff_nbhd_id,
                requested_at, accepted, accepted_by_driver_id)
rides          (ride_id, request_id, driver_id, started_at, ended_at,
                distance_mi, fare_usd, surge_mult, rider_rating)
```

The linking columns (e.g. `driver_id` appears in both `rides` and `drivers`) are what you'd use as the `by =` argument in dplyr. In SQL, you use `JOIN ... ON` instead.

---

# The Core Syntax

<pre style="background:#f5f5f5;padding:0.9em 1em;border-radius:4px;font-family:'Source Code Pro',Menlo,monospace;line-height:1.55;font-size:90%;margin:0;">
⑤ SELECT col1, col2, agg_func(col3) AS result_name
① FROM table_name
② WHERE some_condition
③ GROUP BY col1, col2
④ HAVING aggregate_condition
⑥ ORDER BY col1 DESC
⑦ LIMIT 10;
</pre>

**Logical execution order** (different from written order — numbers ①–⑦ above match):

① FROM → ② WHERE → ③ GROUP BY → ④ HAVING → ⑤ SELECT → ⑥ ORDER BY → ⑦ LIMIT

A **column alias** is the name you give a computed result with `AS` — e.g., `fare_usd * 1.5 AS surge_fare`. It's like naming a new column in `mutate(surge_fare = fare_usd * 1.5)`.

This execution order matters: `WHERE` happens **before** SELECT, so you can't use an alias in `WHERE` (it doesn't exist yet). `ORDER BY` happens **after** SELECT, so you can.

---

# Five Things to Remember

**1. `COUNT(*)` vs `COUNT(col)`** — `COUNT(*)` counts rows; `COUNT(col)` skips NULLs in `col`.

**2. `COUNT(DISTINCT col)`** — the most common non-trivial aggregate.

**3. `AVG` and `SUM` ignore NULLs.** Usually what you want. If not, wrap in `COALESCE(col, 0)`.

**4. `WHERE` filters rows; `HAVING` filters groups.** No aggregates in `WHERE`.

**5. `LIMIT` is *not* a sample.** It's the first N rows after `ORDER BY`. Without sorting, the order is undefined.

---

# Date Handling in SQLite

```sql
date(col)                          -- '2025-01-15'
strftime('%Y-%m', col)             -- '2025-01'
strftime('%w', col)                -- day of week (0=Sun)
strftime('%H', col)                -- hour of day
date(col, 'start of month')        -- first of the month
date(col, '-7 days')               -- a week ago
julianday(b) - julianday(a)        -- day difference
```

In Postgres or MySQL the syntax is different (`DATE_TRUNC`, `EXTRACT`), but the patterns are the same. If your interviewer uses a different dialect, just ask.

---

# Pattern: Top-N

```sql
SELECT ride_id, driver_id, distance_mi, fare_usd
FROM   rides
ORDER BY distance_mi DESC
LIMIT 5;
```

Table: 5 records

| ride_id| driver_id| distance_mi| fare_usd|
|-------:|---------:|-----------:|--------:|
|   28699|       744|       36.82|    90.87|
|   11316|       157|       30.26|    73.29|
|   35411|       800|       29.30|    47.41|
|    5731|       468|       29.11|    45.44|
|     313|       532|       28.87|    65.01|

The two essentials: an `ORDER BY` and a `LIMIT`. Forget the `ORDER BY` and you'll get five arbitrary rows.

---

# Pattern: Filtered Aggregate

```sql
SELECT COUNT(*) AS rides_dec, AVG(fare_usd) AS avg_fare
FROM   rides
WHERE  date(started_at) BETWEEN '2025-12-01' AND '2025-12-31';
```

Table: 1 records

| rides_dec| avg_fare|
|---------:|--------:|
|      3293| 11.84622|

`WHERE` happens before aggregation, so this counts December rides and averages their fares.

---

# Pattern: Group By + Having

**Drivers with at least 100 rides in 2025**:

```sql
SELECT driver_id, COUNT(*) AS n_rides
FROM   rides
WHERE  date(started_at) BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY driver_id
HAVING COUNT(*) >= 100
ORDER BY n_rides DESC
LIMIT 5;
```

Table: 0 records

| driver_id|n_rides |
|---------:|:-------|

**Don't put `COUNT(*) >= 100` in `WHERE`** — aggregates don't exist there yet. `HAVING` is the post-aggregation filter.

---

# Pattern: Conditional Counts

```sql
SELECT
  COUNT(*) AS total,
  SUM(CASE WHEN accepted = 1 THEN 1 ELSE 0 END) AS accepted_count,
  ROUND(AVG(CASE WHEN accepted = 1 THEN 1.0 ELSE 0 END), 3) AS accept_rate
FROM requests;
```

Table: 1 records

| total| accepted_count| accept_rate|
|-----:|--------------:|-----------:|
| 50000|          39000|        0.78|

The `CASE WHEN ... THEN 1 ELSE 0 END` trick is the SQL version of `mean(x == "foo")` in R. **Memorize it** — you'll use it constantly.

---

# Common Mistakes

- **Integer division:** `SUM(x) / COUNT(*)` → 0 if both are integers and the result < 1. Always cast: `1.0 * SUM(x) / COUNT(*)`

- **NULLs in `AVG`:** ignored, not zeroed. `AVG(fare_usd)` on 100 rows with 50 NULLs averages over the 50 non-NULL rows.

- **Column aliases in `WHERE`:** doesn't work. Repeat the expression or wrap in a subquery.

- **`BETWEEN` is inclusive on both ends.** `WHERE x BETWEEN 1 AND 10` is `x >= 1 AND x <= 10`.

- **Case-sensitive string comparison.** Use `LOWER()` or `COLLATE NOCASE`.

---
class: inverse, center, middle

# Interview Questions
### How fast can you write each one?
### [Schema reference](#schema)

---

# Q1. What is the average fare by hour of day?

<a class="write-ref" href="../module-03/slides.html#write-order">WO</a>
*Hint: you need `strftime('%H', ...)` to extract the hour, `AVG()`, and `GROUP BY`.*

```sql
SELECT strftime('%H', started_at) AS hour,
       ROUND(AVG(fare_usd), 2)    AS avg_fare,
       COUNT(*)                   AS n_rides
FROM   rides
GROUP BY hour
ORDER BY hour
LIMIT 6;
```

Table: 1 records

|hour | avg_fare| n_rides|
|:----|--------:|-------:|
|00   |    12.01|   39000|

---

# Q2. What is the monthly conversion rate (requests → accepted)?

<a class="write-ref" href="../module-03/slides.html#write-order">WO</a>
*Hint: you need `SUM(accepted) / COUNT(*)` and watch out for integer division.*

```sql
SELECT strftime('%Y-%m', requested_at) AS month,
       COUNT(*)                         AS n_requests,
       SUM(accepted)                    AS n_accepted,
       ROUND(1.0 * SUM(accepted) / COUNT(*), 3) AS accept_rate
FROM   requests
GROUP BY month
ORDER BY month
LIMIT 6;
```

Table: 6 records

|month   | n_requests| n_accepted| accept_rate|
|:-------|----------:|----------:|-----------:|
|2025-01 |       4251|       3307|       0.778|
|2025-02 |       3818|       2970|       0.778|
|2025-03 |       4227|       3321|       0.786|
|2025-04 |       4109|       3195|       0.778|
|2025-05 |       4228|       3249|       0.768|
|2025-06 |       4199|       3311|       0.789|

---

# Q3. How often does surge pricing kick in, and what's the average multiplier?

<a class="write-ref" href="../module-03/slides.html#write-order">WO</a>
*Hint: use the `CASE WHEN ... THEN 1 ELSE 0 END` trick inside `SUM()` and `AVG()`.*

```sql
SELECT
  COUNT(*) AS total_rides,
  SUM(CASE WHEN surge_mult > 1.0 THEN 1 ELSE 0 END) AS surged_rides,
  ROUND(1.0 * SUM(CASE WHEN surge_mult > 1.0 THEN 1 ELSE 0 END) / COUNT(*), 3)
    AS pct_surged,
  ROUND(AVG(surge_mult), 3) AS avg_surge_mult
FROM rides;
```

Table: 1 records

| total_rides| surged_rides| pct_surged| avg_surge_mult|
|-----------:|------------:|----------:|--------------:|
|       39000|        28949|      0.742|          1.246|

---
class: inverse

# The Key Takeaways

### 1. SELECT / WHERE / GROUP BY / HAVING / ORDER BY / LIMIT — the six clauses cover ~80% of analytical work

### 2. The CASE WHEN aggregate trick is the SQL equivalent of dplyr's `mean(x == "foo")`. Memorize it.

### 3. Watch out for integer division, NULL handling, and column aliases in WHERE.

---
class: sample-code

name: sample
# Sample Code: Monthly Rides, Drivers, Avg Fare & Surge Rate

```sql
-- SELECT + WHERE + aggregates + GROUP BY + HAVING + ORDER BY + LIMIT
SELECT
  strftime('%Y-%m', started_at)                        AS month,
  COUNT(*)                                             AS n_rides,
  COUNT(DISTINCT driver_id)                            AS n_drivers,
  ROUND(AVG(fare_usd), 2)                             AS avg_fare,
  ROUND(1.0 * SUM(CASE WHEN surge_mult > 1 THEN 1
                        ELSE 0 END) / COUNT(*), 3)    AS pct_surged
FROM   rides
WHERE  date(started_at) BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY month
HAVING COUNT(*) > 100
ORDER BY month
LIMIT 12;
```

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th><th>Sample Code</th></tr>
<tr><td>1</td><td>SELECT, WHERE, Aggregates (just finished)</td><td>✓ done</td></tr>
<tr><td>2</td><td>JOINs</td><td>next</td><td></td></tr>
<tr><td>3</td><td>GROUP BY, HAVING, Conversion Funnels</td><td>upcoming</td><td></td></tr>
<tr><td>4</td><td>Subqueries and CTEs</td><td>upcoming</td><td></td></tr>
<tr><td>5</td><td>Window Functions</td><td>upcoming</td><td></td></tr>
</table>