Module 2: pandas Basics

class: center, middle, inverse, title-slide

.title[
# Module 2: pandas Basics
]
.subtitle[
## The dplyr → pandas translation
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr>
<tr><td><b>2</b></td><td><b>pandas basics</b> <i>(you are here)</i></td><td>← current</td></tr>
<tr><td>3</td><td>Joins, merges, group-by recipes</td><td>upcoming</td></tr>
<tr><td>4</td><td>Regression and A/B tests with statsmodels</td><td>upcoming</td></tr>
<tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr>
</table>

---

# Loading Data

```python
import pandas as pd
import numpy as np

rides = pd.read_csv("data/rides.csv", parse_dates=["pickup_at"])

rides.head()
rides.shape          # (30000, 9) — like dim() in R
rides.columns        # Index(['ride_id', 'rider_id', ...])
rides.dtypes         # column types
```

`parse_dates=` is essential. Without it, date columns are strings and `.dt.year` errors out.

---

# The dplyr → pandas Cheat Sheet

.small[
| What | dplyr | pandas |
|---|---|---|
| Filter | `filter(df, fare > 10)` | `df[df["fare"] > 10]` |
| Select | `select(df, city, fare)` | `df[["city", "fare"]]` |
| Mutate | `mutate(df, rate = fare/dist)` | `df["rate"] = df["fare"]/df["dist"]` |
| Sort | `arrange(df, desc(fare))` | `df.sort_values("fare", ascending=False)` |
| Summarise | `summarise(df, avg = mean(fare))` | `df["fare"].mean()` |
| Group + summarise | `group_by(df, city) %>% summarise(...)` | `df.groupby("city")["fare"].mean()` |
| Distinct | `distinct(df, id)` | `df.drop_duplicates("id")` |
]

In dplyr, verbs are free functions in a pipe. In pandas, **everything is a method on the DataFrame** — you chain with dots instead of `|>`.

---

# Filter

```python
# Single condition
rides[rides["fare_usd"] > 20]

# Multiple — use & and |, NOT and/or
rides[(rides["fare_usd"] > 20) & (rides["distance_mi"] < 5)]

# isin (matches dplyr's %in%)
rides[rides["city"].isin(["SF", "Chicago"])]

# Negation
rides[~rides["city"].isin(["SF"])]

# Query method (string-based, often cleaner)
rides.query("fare_usd > 20 and distance_mi < 5")
```

**The parentheses around each `&` condition are required.** Forgetting them is the most common pandas mistake.

---

# Mutate (assign)

```python
# Direct (mutates in place)
rides["fare_per_mile"] = rides["fare_usd"] / rides["distance_mi"]

# .assign() — closer to dplyr style, returns a new DataFrame
rides = rides.assign(
    fare_per_mile = rides["fare_usd"] / rides["distance_mi"],
    is_long_trip  = rides["distance_mi"] > 10,
    log_fare      = np.log(rides["fare_usd"])
)

# Conditional with np.where
rides["tier"] = np.where(rides["fare_usd"] > 30, "premium", "standard")
```

`assign()` is the dplyr-est way. Direct assignment is shorter but mutates in place.

---

# Group By + Summarise

```python
# One group, one stat — returns a Series
rides.groupby("city")["fare_usd"].mean()

# Multiple stats with named output (the dplyr-est syntax)
rides.groupby("city").agg(
    avg_fare    = ("fare_usd", "mean"),
    n_rides     = ("ride_id", "count"),
    p90_fare    = ("fare_usd", lambda x: x.quantile(0.9))
)

# To get a DataFrame back instead of grouped index
rides.groupby("city", as_index=False)["fare_usd"].mean()
```

**The `name = (column, function)` syntax** in `.agg()` is the closest pandas gets to `summarise(avg_fare = mean(fare_usd), ...)`.

---

# Date / Time Operations

Once parsed, every datetime column has a `.dt` accessor:

```python
rides["pickup_at"].dt.year
rides["pickup_at"].dt.month
rides["pickup_at"].dt.day_name()
rides["pickup_at"].dt.hour
rides["pickup_at"].dt.dayofweek      # 0 = Monday
```

Filter on dates:

```python
rides[rides["pickup_at"] >= "2025-12-01"]
```

Resample to a frequency:

```python
rides.set_index("pickup_at").resample("D")["fare_usd"].sum()    # daily
rides.set_index("pickup_at").resample("W")["fare_usd"].mean()   # weekly
```

---

# Pivot

```python
# Long → wide
hourly = (
    rides
    .assign(hour = rides["pickup_at"].dt.hour)
    .groupby(["city", "hour"])
    .size()
    .reset_index(name="n_rides")
)

wide = hourly.pivot(index="city", columns="hour", values="n_rides")
```

```python
# Wide → long
long = wide.reset_index().melt(
    id_vars="city", var_name="hour", value_name="n_rides"
)
```

These are `tidyr::pivot_wider` and `pivot_longer`.

---

# Common Traps

**1. Operator precedence in boolean filters.** Always parenthesize each condition before `&` / `|`.

**2. `SettingWithCopyWarning`.** Use `.loc[row_selector, col_selector]` to modify a subset, never chained `[]` indexing.

**3. NaN handling.** `df["col"] == np.nan` is always False. Use `df["col"].isna()`.

**4. The default index sticks around.** If you filter or group, the row index can become non-contiguous. Use `df.reset_index(drop=True)` for a fresh integer index.

---
class: inverse, center, middle

# Interview Questions

---

# Q1. Keep only weekday rides between 7-9am. Compute mean fare by city.

*Hint: `.dt.dayofweek`, `.dt.hour` for filtering, then `.groupby().mean()`.*

```python
morning_weekday = rides[
    (rides["pickup_at"].dt.dayofweek < 5) &
    (rides["pickup_at"].dt.hour >= 7) &
    (rides["pickup_at"].dt.hour < 9)
]
morning_weekday.groupby("city")["fare_usd"].mean().round(2)
```

Notice the parentheses around each `&` condition — required.

---

# Q2. Find the top 5 most expensive rides (by fare per mile) in each city.

*Hint: `.assign()` to create fare_per_mile, `.sort_values()`, `.groupby().head(5)`.*

```python
(rides
 .assign(fare_per_mile = rides["fare_usd"] / rides["distance_mi"])
 .sort_values(["city", "fare_per_mile"], ascending=[True, False])
 .groupby("city")
 .head(5)
)
```

`groupby().head(N)` = "top-N per group". Like `slice_max(n=5, by=city)` in dplyr.

---

# Q3. What share of rides per city are during peak hours (7-9am or 5-7pm)?

*Hint: create an `is_peak` column with `.between()`, then `.agg()` with count and sum.*

.small[
```python
rides["is_peak"] = (
    (rides["pickup_at"].dt.hour.between(7, 8)) |
    (rides["pickup_at"].dt.hour.between(17, 18))
).astype(int)

(rides.groupby("city")
 .agg(n_rides=("ride_id","count"), n_peak=("is_peak","sum"))
 .assign(peak_share = lambda d: (d["n_peak"] / d["n_rides"]).round(3)))
```
]

---

# Q4. Which are the top 3 cities by ride volume in the last 30 days of 2025?

*Hint: filter on date, `.groupby().size()`, `.sort_values().head(3)`.*

```python
recent = rides[rides["pickup_at"] >= "2025-12-02"]
recent.groupby("city").size().sort_values(ascending=False).head(3)
```

`.size()` = pandas equivalent of `dplyr::n()` inside a `summarise`.

---

# Q5. Pivot the hourly ride counts (city × hour → wide table with hours as columns).

*Hint: `.groupby().size()` to get the long table, then `.pivot()`.*

```python
long = (
    rides
    .assign(hour = rides["pickup_at"].dt.hour)
    .groupby(["city", "hour"])
    .size()
    .reset_index(name="n_rides")
)
wide = long.pivot(index="city", columns="hour", values="n_rides")
```

`pivot()` = long → wide. `melt()` = wide → long.

---
class: inverse

# The Key Takeaways

<br>

### 1. The dplyr 5 verbs all have direct pandas equivalents. The hardest part is remembering the method-call style instead of the pipe.

<br>

### 2. `.agg(name = (col, func))` is the cleanest summarise. Memorize it.

<br>

### 3. The two pandas-specific traps: parenthesize boolean conditions, and use `.loc` for any "modify a subset" operation.

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr>
<tr><td>2</td><td>pandas basics <i>(just finished)</i></td><td>✓ done</td></tr>
<tr><td><b>3</b></td><td><b>Joins, merges, group-by recipes</b></td><td>next</td></tr>
<tr><td>4</td><td>Regression and A/B tests with statsmodels</td><td>upcoming</td></tr>
<tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr>
</table>