class: center, middle, inverse, title-slide .title[ # Module 4: Tree-Based Methods ] .subtitle[ ## Surge Pricing, Neighborhood Effects, and Algorithmic Redlining ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } .tiny { font-size: 65%; } .two-col { display: flex; align-items: flex-start; gap: 1em; } .col-narrow { flex: 1; } .col-wide { flex: 2; } .highlight-box { background: #fff3e0; border-left: 4px solid #e65100; padding: 0.5em 1em; margin: 0.5em 0; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">The Learning Problem</a></td><td>done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">Linear Models</a></td><td>done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Model Evaluation & Selection</a></td><td>done</td></tr> <tr><td><b>4</b></td><td><b>Tree-Based Methods</b> <i>(you are here)</i></td><td>current</td></tr> <tr><td>5</td><td>Unsupervised Learning</td><td>upcoming</td></tr> <tr><td>6</td><td>Neural Networks Foundations</td><td>upcoming</td></tr> <tr><td>7</td><td><a href="../module-07/slides.html">Fairness Frameworks & Metrics</a></td><td>done</td></tr> <tr><td>8</td><td>Auditing & Interpretability</td><td>upcoming</td></tr> </table> --- # The Setup: Surge Pricing Ride-sharing platforms use **surge multipliers** to balance supply and demand. When demand is high in an area, prices go up. -- The model takes features like: - Time of day, day of week - Pickup neighborhood / zip code - Recent demand in the area - Number of available drivers nearby -- .highlight-box[ **The fairness question:** if the model learns that certain neighborhoods consistently have high demand and low supply, it charges more there. Those neighborhoods often correlate with race and income. Is the algorithm **redlining**? ] --- # Simulated Surge Data (1/2) .small[ ```r n <- 2000 surge <- tibble( hour = sample(0:23, n, replace = TRUE), day_of_week = sample(1:7, n, replace = TRUE), neighborhood = sample(c("Downtown", "Midtown", "Uptown", "Southside", "Westend"), n, replace = TRUE), demand_ratio = rnorm(n, mean = case_when( neighborhood == "Downtown" ~ 1.8, neighborhood == "Midtown" ~ 1.4, neighborhood == "Southside" ~ 1.6, neighborhood == "Westend" ~ 0.9, TRUE ~ 1.1), sd = 0.3), drivers_nearby = rpois(n, lambda = case_when( neighborhood == "Downtown" ~ 12, neighborhood == "Southside" ~ 4, neighborhood == "Westend" ~ 5, TRUE ~ 8)), is_weekend = day_of_week %in% c(6, 7), is_rush = hour %in% c(7:9, 17:19) ) ``` ] --- count: false # Simulated Surge Data (2/2) .small[ ```r surge <- surge |> mutate( surge_mult = 1 + 0.4 * demand_ratio - 0.05 * drivers_nearby + 0.3 * is_rush + 0.15 * is_weekend + rnorm(n, sd = 0.15), surge_mult = pmax(surge_mult, 1.0) # floor at 1x ) glimpse(surge) ``` ``` ## Rows: 2,000 ## Columns: 8 ## $ hour <int> 16, 4, 0, 9, 3, 17, 16, 14, 23, 6, 3, 4, 13, 19, 17, 14… ## $ day_of_week <int> 3, 3, 4, 4, 5, 5, 5, 4, 6, 6, 6, 7, 2, 6, 4, 1, 6, 5, 3… ## $ neighborhood <chr> "Southside", "Westend", "Downtown", "Midtown", "Uptown"… ## $ demand_ratio <dbl> 1.8926136, 0.5477677, 1.7313107, 1.5421331, 1.4351858, … ## $ drivers_nearby <int> 2, 5, 8, 7, 15, 9, 12, 9, 5, 4, 7, 2, 15, 4, 11, 13, 8,… ## $ is_weekend <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… ## $ is_rush <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, F… ## $ surge_mult <dbl> 1.685165, 1.000000, 1.273864, 1.715376, 1.000000, 1.000… ``` ] --- # Part 1: Decision Trees --- # A Single Decision Tree *A decision tree asks a sequence of yes/no questions to split the data.* .pull-left[ .small[ ```r tree_fit <- rpart( surge_mult ~ hour + demand_ratio + drivers_nearby + neighborhood + is_weekend + is_rush, data = surge, control = rpart.control(maxdepth = 2)) ``` - At the root, the tree tried all 6 features — `neighborhood != Southside` gave the lowest RSS - Within each child, it tried all 6 again (features **can** be reused). `is_rush` won at depth 2 ] ] .pull-right[ <img src="slides_files/figure-html/single-tree-plot-1.png" style="display: block; margin: auto;" /> ] --- # How Does a Split Get Chosen? For regression trees, at each node the algorithm tries every feature and every threshold, and picks the one that minimizes the **residual sum of squares** (RSS) in the two child nodes. -- For a candidate split into left ( `\(L\)` ) and right ( `\(R\)` ): `$$\small \text{RSS} = \sum_{i \in L} (y_i - \bar{y}_L)^2 + \sum_{i \in R} (y_i - \bar{y}_R)^2$$` -- **Example:** 6 rides with surge: **1.1** (Downtown), **1.2** (Midtown), **1.4** (Uptown), **1.5** (Westend), **2.0** (Southside), **2.1** (Southside) | | `!= Southside` | `!= Southside, Westend` | |---|---|---| | **Left** | 1.1, 1.2, 1.4, 1.5 | 1.1, 1.2, 1.4 | | **Right** | 2.0, 2.1 | 1.5, 2.0, 2.1 | | `\(\bar{y}_L\)` , `\(\bar{y}_R\)` | 1.30, 2.05 | 1.23, 1.87 | | **RSS** | 0.07 + 0.005 = **0.075** | 0.035 + 0.282 = 0.317 | The algorithm picks `!= Southside` — more homogeneous groups, lower RSS. Note: the tree is **greedy** — it picks the locally best split without looking ahead. --- # For Classification: Gini Impurity Our surge example predicts a number (regression → RSS). But when predicting a **class** (e.g., "high surge" vs "low surge"), the split criterion is **Gini impurity**. `\(p_k\)` = fraction of class `\(k\)` in the node: `$$G = 1 - \sum_k p_k^2$$` -- Gini measures how **mixed** a node is (lower = purer = better): | Node: 10 rides | Composition | Gini | |---|---|---| | All high surge | 10 high, 0 low → `\(p_{\text{high}}=1\)` | `\(1 - 1^2 = 0\)` (pure) | | Even mix | 5 high, 5 low → `\(p_{\text{high}}=0.5\)` | `\(1 - 0.5^2 - 0.5^2 = 0.5\)` (worst) | | Mostly one class | 9 high, 1 low → `\(p_{\text{high}}=0.9\)` | `\(1 - 0.9^2 - 0.1^2 = 0.18\)` | -- The tree picks the split that produces the largest **decrease in weighted Gini** — same logic as RSS, but for categories instead of numbers. --- count: false # For Classification: Gini — Worked Example 10 rides labeled "high" or "low" surge. Two candidate splits on `neighborhood`: | Ride | Neighborhood | Class | |---|---|---| | 1–3 | Downtown | high | | 4–5 | Midtown | low | | 6–7 | Southside | high | | 8–10 | Southside | low | -- .small[ | | `!= Southside` (L: 5, R: 5) | `!= Downtown` (L: 7, R: 3) | |---|---|---| | **Left** | 3 high, 2 low → `\(G = 1 - 0.6^2 - 0.4^2 = 0.48\)` | 2 high, 5 low → `\(G = 1 - 0.29^2 - 0.71^2 = 0.41\)` | | **Right** | 2 high, 3 low → `\(G = 1 - 0.4^2 - 0.6^2 = 0.48\)` | 3 high, 0 low → `\(G = 0\)` (pure!) | | **Weighted** | `\(\frac{5}{10}(0.48) + \frac{5}{10}(0.48) = 0.48\)` | `\(\frac{7}{10}(0.41) + \frac{3}{10}(0) = \textbf{0.29}\)` | The tree picks `!= Downtown` — it isolates a pure node (all high surge). ] --- # Overfitting: The Core Problem An unconstrained tree grows until every leaf is pure (or has one observation). It **memorizes** the training data. ```r deep_tree <- rpart(surge_mult ~ hour + demand_ratio + drivers_nearby + neighborhood + is_weekend + is_rush, data = surge, control = rpart.control(cp = 0, minsplit = 2)) cat("Leaves in unconstrained tree:", sum(deep_tree$frame$var == "<leaf>")) ``` ``` ## Leaves in unconstrained tree: 1771 ``` -- **Controls for complexity:** - `maxdepth` — how deep the tree can grow - `min_n` (= `minsplit`) — minimum observations to attempt a split - `cp` (complexity parameter) — prune splits that don't improve fit by at least this much --- # The Bias-Variance View .pull-left[ **Shallow tree (depth = 2)** - High bias: misses real patterns - Low variance: stable across samples <img src="slides_files/figure-html/shallow-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **Deep tree (depth = 8)** - Low bias: captures everything - High variance: different data `\(\to\)` very different tree <img src="slides_files/figure-html/deep-1.png" style="display: block; margin: auto;" /> ] --- # Part 2: Random Forests --- # From One Tree to Many A single tree is **unstable** — remove a few data points and you get a completely different tree. Random forests fix this by averaging many trees. Two key ideas: **bagging** and **feature randomization**. -- .pull-left[ **Bagging** (Bootstrap AGGregating) - Draw B bootstrap samples (with replacement, same size as original) - Fit one tree to each - Average the predictions ] .pull-right[ .small[ **Example:** predict surge for a Southside rush-hour ride | | Bootstrap sample | Best split | Pred. | |---|---|---|---| | Tree 1 | emphasizes Downtown | `demand > 1.5` | 1.50 | | Tree 2 | more Southside rides | `nbhd != South.` | 1.72 | | Tree 3 | more rush-hour rides | `is_rush = 1` | 1.61 | | **Bagged** | | | **1.61** | Pred. = predicted surge for one specific ride (the leaf mean it falls into). Each tree sees different data → different split → different prediction. Averaging smooths out the noise. ] ] --- # Feature Randomization Bootstrap diversifies the **data** each tree sees. Feature randomization diversifies the **splits** — at *every* node, a fresh random subset of `mtry` features is drawn. ``` Tree 1 (bootstrap sample #1): Root: random 3 of 6 → {demand, hour, is_weekend} → split: demand < 1.5 Left: random 3 of 6 → {neighborhood, drivers, rush} → split: is_rush = 1 Right: random 3 of 6 → {demand, neighborhood, rush} → split: nbhd != Southside Tree 2 (bootstrap sample #2): Root: random 3 of 6 → {drivers, neighborhood, rush} → split: nbhd != Southside Left: random 3 of 6 → {demand, hour, drivers} → split: drivers > 6 Right: random 3 of 6 → {hour, is_weekend, rush} → split: is_rush = 1 Tree 3 (bootstrap sample #3): Root: random 3 of 6 → {demand, neighborhood, hour} → split: demand < 1.3 Left: random 3 of 6 → {drivers, rush, is_weekend} → split: is_weekend = 1 Right: random 3 of 6 → {demand, drivers, rush} → split: drivers > 8 ``` -- Without feature randomization, every tree would split on `demand_ratio` first (the strongest predictor) and all trees would look nearly identical. Forcing a random subset at each node creates **decorrelated** trees. --- name: decorrelation # Why Decorrelation Matters <a href="#variance-proof" style="position:absolute; bottom:12px; left:40px; font-size:11px; background:#e8eaf6; padding:2px 8px; border-radius:3px; z-index:100;">Proof</a> .pull-left[ **Without feature randomization** (bagging only): Every tree splits on `demand_ratio` first. The trees are highly correlated. Averaging `\(B\)` correlated predictions: `$$\text{Var}(\bar{X}) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2$$` High `\(\rho\)` `\(\to\)` averaging barely helps. ] -- .pull-right[ **With feature randomization** (`mtry` < `\(p\)`): Some trees are forced to split on `drivers_nearby` or `hour` first. The trees are less correlated. Low `\(\rho\)` `\(\to\)` averaging **dramatically** reduces variance. This is the key insight of Breiman (2001). ] --- # Random Forest in R .pull-left[ .small[ ```r rf_spec <- rand_forest( mtry = 2, min_n = 5, # min obs per leaf trees = 500) |> set_engine("ranger", importance = "impurity") |> set_mode("regression") rf_wf <- workflow() |> add_recipe(recipe( surge_mult ~ hour + demand_ratio + drivers_nearby + neighborhood + is_weekend + is_rush, data = surge)) |> add_model(rf_spec) rf_fit <- fit(rf_wf, data = surge) ``` Unlike a single tree, RF wants **deep trees** — high variance per tree is fine because averaging cancels it out. ] ] .pull-right[ .small[ ```r extract_fit_parsnip(rf_fit)$fit ``` ``` ## Ranger result ## ## Call: ## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2, x), num.trees = ~500, min.node.size = min_rows(~5, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 500 ## Sample size: 2000 ## Number of independent variables: 6 ## Mtry: 2 ## Target node size: 5 ## Variable importance mode: impurity ## Splitrule: variance ## OOB prediction error (MSE): 0.02027697 ## R squared (OOB): 0.7237605 ``` ] ] --- # RF Hyperparameters | Parameter | What it controls | Typical default | Effect of increasing | |-----------|-----------------|----------------|---------------------| | `num.trees` (B) | Number of trees | 500 | More stable; no overfitting risk | | `mtry` | Features per split | `\(\sqrt{p}\)` (classif.) or `\(p/3\)` (regr.) | Stronger trees but more correlated | | `min.node.size` | Minimum n per leaf | 1 (classif.) or 5 (regr.) | Simpler trees, less overfitting | -- .highlight-box[ Random forests **don't overfit** as you add more trees — more trees just means more stable averaging. 500 is almost always enough. The main thing to tune is `mtry`. ] *Why no overfitting? Each individual tree does overfit (deep, memorizes its bootstrap sample). But each tree overfits to **different noise** (different bootstrap sample + different feature subsets). When you average them, the noise cancels out — only the real signal survives. Adding tree 501 is just one more independent vote; it can't make the average worse.* --- # Variable Importance (RF) <img src="slides_files/figure-html/rf-importance-1.png" style="display: block; margin: auto;" /> .small[ **How to read this:** the R² tells you the model explains a lot (or a little) of the total variance. The bars show how that predictive work is distributed across features. `drivers_nearby` and `demand_ratio` do most of the heavy lifting. **Caution:** this plot only shows **relative** importance (which features matter more than others). Two models can have identical importance rankings but very different R² — one explains 90% of variance, the other 5%. Always check R² first to know if the total prediction is meaningful. ] --- # Part 3: Gradient Boosting --- name: boosting # Boosting: Sequential Correction <a href="#gradient-proof" style="position:absolute; bottom:12px; left:40px; font-size:11px; background:#e8eaf6; padding:2px 8px; border-radius:3px; z-index:100;">Gradient proof</a> Random forests build trees **independently**. Boosting builds them **sequentially** — each tree fixes the errors of the ensemble so far. -- **Algorithm:** 1. Start with a constant prediction: `\(\hat{f}_0 = \bar{y}\)` 2. For `\(m = 1, \ldots, M\)`: - Compute residuals: `\(r_i = y_i - \hat{f}_{m-1}(x_i)\)` - Fit a shallow tree `\(h_m\)` to the residuals (typically depth 1–6; depth 1 = a single split, called a "stump") - Update: `\(\hat{f}_m = \hat{f}_{m-1} + \eta \cdot h_m\)` -- `\(\eta\)` is the **learning rate** — how much each new tree contributes. Smaller `\(\eta\)` = more conservative, needs more trees, but generalizes better. -- **Why "gradient"?** The residual `\(r_i = y_i - \hat{f}(x_i)\)` is the negative gradient of squared-error loss: `\(-\partial L / \partial \hat{f} = y_i - \hat{f}\)`. So "fit to residuals" = "fit to the negative gradient." For other losses (e.g., log-loss for classification), the gradient differs from a simple residual, but the idea is the same: gradient descent in function space. --- count: false # Boosting: Step-by-Step Example `\(h_m\)` is a tree — it takes features in, outputs a number (the leaf mean). `\(\eta \cdot h_m\)` just scales that number down. .small[ Tracking **one ride** (Southside, rush hour, actual = **1.80**). `\(\bar{y} = 1.35\)`, `\(\eta = 0.3\)`. | Step | Residual | Leaf this ride falls into | `\(h_m\)` (leaf mean) | Update | |---|---|---|---|---| | Start | 0.45 | | | `\(\hat{f}_0 = 1.35\)` | | Tree 1 | 0.45 | 3 rush rides: residuals 0.45, 0.56, 0.37 | 0.46 | `\(1.35 + 0.3 \times 0.46 = \textbf{1.49}\)` | | Tree 2 | 0.31 | 4 Southside rides: residuals 0.31, 0.28, 0.40, 0.37 | 0.34 | `\(1.49 + 0.3 \times 0.34 = \textbf{1.59}\)` | | Tree 3 | 0.21 | 3 high-demand rides: residuals 0.21, 0.15, 0.27 | 0.21 | `\(1.59 + 0.3 \times 0.21 = \textbf{1.65}\)` | `\(h_m\)` is not this ride's residual — it's the **average residual in the leaf** (which contains other rides too). Only `\(\eta = 30\%\)` of each correction is applied. The ensemble slowly converges: 1.35 → 1.49 → 1.59 → 1.65 → ... → 1.80. ] --- # Boosting Toy Example (Visual) <img src="slides_files/figure-html/boost-toy-1.png" style="display: block; margin: auto;" /> --- # GBM Hyperparameters | Parameter | What it controls | Typical range | Interaction | |-----------|-----------------|--------------|------------| | `trees` (M) | Boosting rounds | 100–1000 | More trees + low `\(\eta\)` = better but slower | | `tree_depth` | Complexity per tree | 1–6 | Depth 1 = main effects only; 6 = complex interactions | | `learn_rate` ( `\(\eta\)` ) | Shrinkage | 0.01–0.3 | Lower = needs more trees but generalizes better | -- .highlight-box[ **Unlike RF, boosting CAN overfit** with too many trees. That's why we tune `trees` via cross-validation — the optimal number depends on `tree_depth` and `learn_rate`. ] --- # XGBoost in R .pull-left[ .small[ ```r gbm_spec <- boost_tree( trees = 200, tree_depth = 3, learn_rate = 0.1) |> set_engine("xgboost", verbosity = 0) |> set_mode("regression") gbm_wf <- workflow() |> add_recipe(recipe( surge_mult ~ hour + demand_ratio + drivers_nearby + neighborhood + is_weekend + is_rush, data = surge) |> step_mutate(across(where(is.logical), as.numeric)) |> step_dummy(all_nominal_predictors())) |> add_model(gbm_spec) gbm_fit <- fit(gbm_wf, data = surge) ``` ] ] .pull-right[ .small[ - `step_dummy` converts `neighborhood` to 0/1 columns (XGBoost needs numeric input) - Same `workflow()` pattern as RF — recipe handles preprocessing - In practice, replace fixed values with `tune()` and use `tune_grid()` to find the best combination via CV ] ] --- # XGBoost Variable Importance <img src="slides_files/figure-html/xgb-importance-1.png" style="display: block; margin: auto;" /> --- # Part 4: Comparing the Three --- # Predictions Across Neighborhoods *Points on the dashed line = perfect prediction. Tighter cloud = better model. Color spread = neighborhood matters.* <img src="slides_files/figure-html/compare-preds-1.png" style="display: block; margin: auto;" /> Single tree: blocky predictions (few distinct values). RF and XGBoost: smoother, tighter around the diagonal. But all three show **neighborhood-colored clusters** — the model treats neighborhoods differently. --- # The Fairness Lens <img src="slides_files/figure-html/fairness-lens-1.png" style="display: block; margin: auto;" /> -- The model learned that **Southside** has high demand and few drivers `\(\to\)` high surge. Is this efficient pricing or algorithmic redlining? That depends on *why* there are few drivers — which is Module 7's question. --- # Part 5: SHAP Values --- # Beyond Importance: Why Did *This* Ride Get a High Price? Variable importance tells you which features matter **globally**. SHAP values tell you why the model made **this specific prediction**. -- For each prediction: `$$\hat{f}(x_i) = \phi_0 + \phi_{\text{demand}}(x_i) + \phi_{\text{drivers}}(x_i) + \phi_{\text{neighborhood}}(x_i) + \ldots$$` -- - `\(\phi_0\)` = average prediction across all rides - `\(\phi_j(x_i)\)` = how much feature `\(j\)` pushed *this* ride's prediction above or below average - All SHAP values sum exactly to the prediction (the **efficiency** property) --- # SHAP for Our Surge Model ```r # Extract raw xgboost object and prepare data matrix xgb_raw <- extract_fit_parsnip(gbm_fit)$fit surge_baked <- bake(extract_recipe(gbm_fit, estimated = TRUE), new_data = surge) |> select(-surge_mult) |> mutate(across(where(is.logical), as.numeric)) |> as.matrix() dtrain_shap <- xgb.DMatrix(surge_baked) shap_contrib <- predict(xgb_raw, newdata = dtrain_shap, predcontrib = TRUE) # Last column is BIAS (intercept) — drop it to keep only feature contributions shap_mat <- shap_contrib[, -ncol(shap_contrib), drop = FALSE] shap_df <- as_tibble(shap_mat) |> set_names(colnames(surge_baked)) # Mean absolute SHAP per feature shap_importance <- shap_df |> summarise(across(everything(), ~ mean(abs(.)))) |> pivot_longer(everything(), names_to = "feature", values_to = "mean_abs_shap") |> mutate(feature = fct_reorder(feature, mean_abs_shap)) shap_importance |> slice_max(mean_abs_shap, n = 10) |> ggplot(aes(mean_abs_shap, feature)) + geom_col(fill = "steelblue") + labs(title = "Mean |SHAP| — Which Features Drive Predictions Most?", x = "Mean |SHAP value|", y = NULL) ``` <img src="slides_files/figure-html/shap-compute-1.png" style="display: block; margin: auto;" /> --- # SHAP for Individual Rides .small[ ```r # Pick a high-surge ride and a low-surge ride high_idx <- which.max(surge$surge_mult) low_idx <- which.min(surge$surge_mult) bind_rows( tibble(feature = colnames(surge_baked), shap = as.numeric(shap_mat[high_idx, ]), ride = paste("High surge:", round(surge$surge_mult[high_idx], 2))), tibble(feature = colnames(surge_baked), shap = as.numeric(shap_mat[low_idx, ]), ride = paste("Low surge:", round(surge$surge_mult[low_idx], 2))) ) |> group_by(ride) |> slice_max(abs(shap), n = 6) |> mutate(feature = fct_reorder(feature, abs(shap))) |> ggplot(aes(shap, feature, fill = shap > 0)) + geom_col(show.legend = FALSE) + facet_wrap(~ride, scales = "free_y") + scale_fill_manual(values = c("TRUE" = "firebrick", "FALSE" = "steelblue")) + labs(title = "SHAP Decomposition: Why Did This Ride Get This Price?", x = "SHAP value (contribution to prediction)", y = NULL) ``` <img src="slides_files/figure-html/shap-individual-1.png" style="display: block; margin: auto;" /> ] --- # The Audit Question SHAP lets you answer: **"Is neighborhood driving the prediction, after controlling for demand and supply?"** -- If `neighborhoodSouthside` has a large positive SHAP value even for rides where `demand_ratio` and `drivers_nearby` are average, that's evidence the model is using neighborhood as more than a proxy for supply/demand. -- .highlight-box[ This is the bridge to Module 7 (fairness) and Module 8 (auditing): tree-based models are powerful predictors but can encode geographic discrimination. SHAP is how you detect it. ] --- # Key Takeaways 1. **Decision trees** are interpretable but overfit. Control with depth and pruning. 2. **Random forests** average many decorrelated trees `\(\to\)` low variance without much bias increase. Main hyperparameter: `mtry`. 3. **Gradient boosting** (XGBoost) builds trees sequentially, correcting errors. Strongest off-the-shelf classifier, but **can overfit** — tune `trees`, `tree_depth`, `learn_rate` via CV. 4. **Variable importance** tells you what matters globally. **SHAP** tells you why each individual prediction was made. 5. All tree-based methods naturally handle **interactions** and **non-linearities** — no feature engineering needed. --- # Exercise Preview In the exercise you will: 1. Simulate a richer surge pricing dataset with explicit demographic correlations 2. Fit decision tree, random forest, and XGBoost models 3. Tune RF and XGBoost hyperparameters with cross-validation using `tidymodels` 4. Compare performance (RMSE, MAE) across models and neighborhoods 5. Compute SHAP values and check whether neighborhood effects persist after controlling for supply/demand 6. Write a 3-sentence "audit summary" of whether the model is redlining See `exercise.R` for the starter code. --- class: center, middle, inverse # Backup Slides --- name: variance-proof # Backup: Variance of Averaged Correlated Predictions <a href="#decorrelation" style="position:absolute; bottom:12px; left:40px; font-size:11px; background:#e8eaf6; padding:2px 8px; border-radius:3px; z-index:100;">Back</a> We have `\(B\)` trees, each producing prediction `\(X_i\)` with `\(\text{Var}(X_i) = \sigma^2\)` and pairwise correlation `\(\text{Cor}(X_i, X_j) = \rho\)` for all `\(i \neq j\)`. The ensemble prediction is `\(\bar{X} = \frac{1}{B}\sum_{i=1}^B X_i\)`. Its variance: `$$\text{Var}(\bar{X}) = \text{Var}\!\left(\frac{1}{B}\sum_i X_i\right) = \frac{1}{B^2} \sum_i \sum_j \text{Cov}(X_i, X_j)$$` -- Split the double sum into diagonal ( `\(i = j\)` ) and off-diagonal ( `\(i \neq j\)` ) terms: `$$= \frac{1}{B^2}\left[\sum_i \sigma^2 + \sum_{i \neq j} \rho\sigma^2\right] = \frac{1}{B^2}\left[B\sigma^2 + B(B-1)\rho\sigma^2\right]$$` -- `$$= \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2 = \underbrace{\rho\sigma^2}\_{\text{can't reduce}} + \underbrace{\frac{(1-\rho)}{B}\sigma^2}\_{\to 0 \text{ as } B \to \infty}$$` As `\(B \to \infty\)`, only `\(\rho\sigma^2\)` remains. **Lower `\(\rho\)` (more decorrelated trees) = lower irreducible variance.** --- name: gradient-proof # Backup: Why "Gradient" Boosting? <a href="#boosting" style="position:absolute; bottom:12px; left:40px; font-size:11px; background:#e8eaf6; padding:2px 8px; border-radius:3px; z-index:100;">Back</a> .pull-left[ **Proof:** for squared-error loss `\(L = \frac{1}{2}(y_i - \hat{f}(x_i))^2\)`: `$$\frac{\partial L}{\partial \hat{f}(x_i)} = -(y_i - \hat{f}(x_i)) = -r_i$$` The **negative gradient** = the residual. So "fit a tree to the residuals" is gradient descent in function space: each tree steps in the direction that most reduces the loss. For **log-loss** (classification): the gradient is `\(p_i - y_i\)` (predicted prob minus label), which is *not* a simple residual — but the algorithm is the same: compute gradient → fit tree → step. ] .pull-right[ <img src="slides_files/figure-html/gradient-viz-1.png" style="display: block; margin: auto;" /> ]