class: center, middle, inverse, title-slide .title[ # Module 3: Joins, Merges, Group-By Recipes ] .subtitle[ ## The bridge from one-liners to real analysis ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">pandas basics</a></td><td>✓ done</td></tr> <tr><td><b>3</b></td><td><b>Joins, merges, group-by recipes</b> <i>(you are here)</i></td><td>← current</td></tr> <tr><td>4</td><td>Regression and A/B tests with statsmodels</td><td>upcoming</td></tr> <tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr> </table> --- # `merge`: pandas' join | dplyr | pandas | |---|---| | `inner_join(a, b, by = "id")` | `a.merge(b, on="id", how="inner")` | | `left_join(a, b, by = "id")` | `a.merge(b, on="id", how="left")` | | `full_join(a, b, by = "id")` | `a.merge(b, on="id", how="outer")` | | `semi_join(a, b, by = "id")` | `a[a["id"].isin(b["id"])]` | | `anti_join(a, b, by = "id")` | `a[~a["id"].isin(b["id"])]` | -- `merge` defaults to `how="inner"`. **Always specify `how=` explicitly** so the join intent is obvious to whoever reads your code. --- # A First Merge ```python merged = rides.merge( drivers, on="driver_id", how="left", suffixes=("_ride", "_driver") ) ``` -- The `suffixes=` argument disambiguates columns that exist in both tables (here, `city`). dplyr uses `.x` / `.y` by default; pandas uses `_x` / `_y` unless you override. --- # Different Names on Each Side ```python rides.merge( drivers, left_on = "driver_id", right_on = "id", how = "left" ) ``` dplyr equivalent: `left_join(rides, drivers, by = c("driver_id" = "id"))`. -- For very large tables, joining on the index is faster: ```python drivers_idx = drivers.set_index("driver_id") rides.merge(drivers_idx, left_on="driver_id", right_index=True, how="left") ``` --- # Anti-join ```python # Drivers who never had a ride no_ride_drivers = drivers[~drivers["driver_id"].isin(rides["driver_id"])] ``` -- The `~` is "NOT". `isin` is the pandas equivalent of `%in%`. --- # Per-Group Top: Most-Frequent Driver per Rider ```python top_driver_per_rider = ( rides .groupby(["rider_id", "driver_id"]) .size() .reset_index(name="n_rides") .sort_values(["rider_id", "n_rides"], ascending=[True, False]) .drop_duplicates("rider_id") ) ``` -- The pattern: count per (group, sub-group), sort, then `drop_duplicates` on the group keeps only the first (= largest count) row per group. --- # `transform`: Group Stat Broadcast Back to Rows The cleanest way to compute "deviation from group mean": ```python rides["city_avg_fare"] = ( rides.groupby("city")["fare_usd"].transform("mean") ) rides["fare_above_city_avg"] = rides["fare_usd"] - rides["city_avg_fare"] ``` -- `.transform("mean")` returns a Series the same length as the original — the group mean broadcast to every row in that group. **Memorize this idiom.** --- # Top-N per Group ```python # Method 1: rank-then-filter rides["fare_rank"] = ( rides .groupby("city")["fare_usd"] .rank(method="dense", ascending=False) ) top5 = rides[rides["fare_rank"] <= 5] ``` -- ```python # Method 2: sort + groupby + head (cleaner) top5 = ( rides .sort_values(["city", "fare_usd"], ascending=[True, False]) .groupby("city") .head(5) ) ``` -- Method 2 is what idiomatic pandas looks like. --- # `concat`: bind_rows / bind_cols ```python # Stacking by row (rbind / dplyr::bind_rows) combined = pd.concat([df1, df2, df3], ignore_index=True) # By column (cbind / dplyr::bind_cols) combined = pd.concat([df1, df2], axis=1) ``` -- `ignore_index=True` is important when stacking — otherwise pandas tries to align on the original index, which often produces NaNs. --- # Common Traps **1. Many-to-many merges.** If both sides have duplicates on the key, you get a Cartesian product within each key group. Use `validate="many_to_one"` to assert your assumption. -- **2. Default `how="inner"` silently drops rows.** Always specify `how=`. -- **3. Index alignment in `concat`.** Use `ignore_index=True` to avoid surprises. -- **4. `transform` requires same-length output.** For variable-length output use `apply()`. --- class: inverse, center, middle # Interview Questions --- # Q1. Merge rides with drivers on driver_id, keeping all rides. *Hint: `.merge(..., how="left")` + `suffixes=` for duplicate columns.* -- ```python merged = rides.merge(drivers, on="driver_id", how="left", suffixes=("_ride", "_driver")) ``` --- # Q2. For each rider, find the driver they've ridden with most often. *Hint: count per (rider, driver) pair, sort, keep first per rider.* -- ```python (rides .groupby(["rider_id", "driver_id"]).size() .reset_index(name="n_rides") .sort_values(["rider_id", "n_rides"], ascending=[True, False]) .drop_duplicates("rider_id")) ``` Count → sort → `drop_duplicates` keeps the first (= largest) per group. --- # Q3. Compute per-city average fare and add it as a column on each ride. *Hint: `.groupby().transform("mean")`.* -- ```python rides["city_avg_fare"] = rides.groupby("city")["fare_usd"].transform("mean") rides["fare_above_city_avg"] = rides["fare_usd"] - rides["city_avg_fare"] ``` dplyr equivalent: `mutate(city_avg_fare = mean(fare_usd), .by = city)`. --- # Q4. Find drivers in `drivers.csv` who never appear in `rides.csv`. *Hint: `~` + `.isin()` = anti-join.* -- ```python no_ride_drivers = drivers[~drivers["driver_id"].isin(rides["driver_id"])] ``` --- # Q5. For each driver, keep only their top 3 highest-fare rides. *Hint: `.sort_values()` + `.groupby().head(3)`.* -- ```python top3 = (rides .sort_values(["driver_id", "fare_usd"], ascending=[True, False]) .groupby("driver_id").head(3)) ``` `groupby().head(N)` = top-N per group. Like `slice_max(n=3, by=driver_id)` in dplyr. --- class: inverse # The Key Takeaways <br> ### 1. `merge(..., how="left")` is your bread and butter. Always specify `how=`. -- <br> ### 2. `transform("mean")` broadcasts a group statistic back to row level. Memorize it. -- <br> ### 3. `sort_values + groupby().head(N)` is the canonical "top-N per group" idiom. --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">pandas basics</a></td><td>✓ done</td></tr> <tr><td>3</td><td>Joins, merges, group-by recipes <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>4</b></td><td><b>Regression and A/B tests with statsmodels</b></td><td>next</td></tr> <tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr> </table>