class: center, middle, inverse, title-slide .title[ # Module 4: Regressions and A/B Tests with statsmodels ] .subtitle[ ## The module where Python finally feels comfortable for an econometrician ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">pandas basics</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Joins, merges, group-by recipes</a></td><td>✓ done</td></tr> <tr><td><b>4</b></td><td><b>Regression and A/B tests with statsmodels</b> <i>(you are here)</i></td><td>← current</td></tr> <tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr> </table> --- # Two Interfaces ```python # 1. Formula interface (R-style) — usually what you want import statsmodels.formula.api as smf fit = smf.ols("fare_usd ~ distance_mi + C(city)", data=rides).fit() # 2. Matrix interface import statsmodels.api as sm X = sm.add_constant(rides[["distance_mi"]]) y = rides["fare_usd"] fit = sm.OLS(y, X).fit() ``` -- **Use the formula interface unless you have a specific reason not to.** It's shorter, handles factors automatically, and looks like R. -- The convention I'll use throughout: `import statsmodels.formula.api as smf`. --- # OLS with Formulas ```python fit = smf.ols("fare_usd ~ distance_mi + duration_min + C(city)", data=rides).fit() print(fit.summary()) ``` -- The formula syntax is **identical to R**: - `+` adds terms - `:` is interaction only - `*` is interaction + main effects - `-` removes terms - `I(x^2)` for arithmetic - `C(x)` wraps a variable as a factor (Python needs this; R has factors built in) --- # Reading the Summary ```python fit.params # Series of coefficients fit.bse # standard errors fit.tvalues # t-statistics fit.pvalues # p-values fit.conf_int() # confidence intervals fit.rsquared fit.nobs fit.resid # residuals fit.fittedvalues # ŷ ``` -- For a clean DataFrame summary: ```python pd.DataFrame({ "coef": fit.params, "se": fit.bse, "t": fit.tvalues, "p": fit.pvalues }).round(3) ``` --- # Robust Standard Errors ```python # Heteroskedasticity-robust (HC3 is the modern default) fit = smf.ols("y ~ x", data=df).fit(cov_type="HC3") # Cluster-robust fit = smf.ols("y ~ x", data=df).fit( cov_type="cluster", cov_kwds={"groups": df["driver_id"]} ) ``` -- R / dplyr equivalent: `lm_robust(..., clusters = driver_id)` from `estimatr`, or `coeftest(fit, vcov = vcovHC(fit, type = "HC3"))`. -- **Important:** the cluster `groups` is a Series, not a column name string. --- # Fixed Effects **Small N:** use `C(...)` directly. ```python fit = smf.ols("fare_usd ~ distance_mi + C(driver_id) + C(city)", data=rides).fit() ``` -- **Many fixed effects (thousands):** use `linearmodels.PanelOLS` instead. ```python from linearmodels.panel import PanelOLS panel = rides.set_index(["driver_id", "pickup_at"]) fit = PanelOLS.from_formula( "fare_usd ~ distance_mi + EntityEffects + TimeEffects", data=panel ).fit(cov_type="clustered", cluster_entity=True) ``` -- `linearmodels` is the third-party panel-data package econometricians use. Not installed by default, but standard at any econ-research-adjacent shop. --- # Logit / Probit / GLM ```python # Logistic regression logit = smf.logit("accepted ~ distance_mi + C(city)", data=requests).fit() # Probit probit = smf.probit("accepted ~ distance_mi", data=requests).fit() # Custom GLM glm = smf.glm("accepted ~ distance_mi", data=requests, family=sm.families.Binomial()).fit() ``` -- Output is the same shape as `ols`. Coefficients are on the log-odds scale; `np.exp(fit.params)` gives odds ratios. --- # A/B Test Inference: Three Ways **Way 1: t-test on the means** ```python from scipy import stats t, p = stats.ttest_ind( ab.loc[ab["treatment"] == 1, "spend_usd"], ab.loc[ab["treatment"] == 0, "spend_usd"], equal_var=False # Welch's t-test ) ``` -- **Way 2: OLS regression (the econometrician's t-test)** ```python fit = smf.ols("spend_usd ~ treatment", data=ab).fit(cov_type="HC3") print(fit.summary()) ``` -- The coefficient on `treatment` is the ATE; its SE and p-value answer the same question as the t-test. **Identical with HC2/HC3 SEs.** This is the version you should write in the interview — it generalizes to fixed effects, clustering, and interactions. --- # Confidence Intervals ```python ate = fit.params["treatment"] ate_se = fit.bse["treatment"] ate_ci = fit.conf_int().loc["treatment"] print(f"ATE: {ate:.3f} (SE = {ate_se:.3f})") print(f"95% CI: [{ate_ci[0]:.3f}, {ate_ci[1]:.3f}]") ``` -- **Variance reduction trick:** add fixed effects. ```python fit = smf.ols("spend_usd ~ treatment + C(city)", data=ab).fit(cov_type="HC3") ``` The treatment estimate is still unbiased (because of randomization), but more precise. --- # Difference-in-Differences ```python df["post"] = (df["t"] >= 10).astype(int) df["treated"] = (df["city"] == "A").astype(int) df["did"] = df["post"] * df["treated"] fit = smf.ols( "outcome ~ post + treated + did + C(city) + C(t)", data=df ).fit(cov_type="cluster", cov_kwds={"groups": df["city"]}) ``` -- The coefficient on `did` is the DiD estimate. Two-way fixed effects (city + time) absorb the level effects; `did` cleanly identifies the treatment effect under parallel-trends. --- # Common Traps **1. NaNs are dropped silently.** Compare `fit.nobs` to `len(data)` to see how many rows you lost. Source of "my coefficients changed when I added a control" mysteries. -- **2. Categorical reference level.** `C(city)` uses the first city alphabetically. Override with `C(city, Treatment(reference="SF"))`. -- **3. Cluster argument.** Pass a Series, not a string. -- **4. Formula API copies the data.** Fine for most sizes, slow for very large data. Use the matrix interface for big data. --- class: inverse, center, middle # Interview Questions --- # Q1. Regress fare on distance + duration + city with robust SEs. Report coefficients. *Hint: `smf.ols("y ~ x + C(g)", data=df).fit(cov_type="HC3")`.* -- ```python fit = smf.ols("fare_usd ~ distance_mi + duration_min + C(city)", data=rides).fit(cov_type="HC3") pd.DataFrame({"coef": fit.params, "se": fit.bse, "t": fit.tvalues, "p": fit.pvalues}).round(4) ``` --- # Q2. Compute the A/B test treatment effect on spend, with a 95% CI. *Hint: OLS of `spend ~ treatment` with HC3 SEs. `.params`, `.conf_int()`.* -- ```python fit_ab = smf.ols("spend_usd ~ treatment", data=ab).fit(cov_type="HC3") ate = fit_ab.params["treatment"] ate_ci = fit_ab.conf_int().loc["treatment"] print(f"ATE: {ate:.3f}, 95% CI: [{ate_ci[0]:.3f}, {ate_ci[1]:.3f}]") ``` Three lines. This is the answer to "is the A/B test significant." --- # Q3. Estimate a difference-in-differences with city and month fixed effects. *Hint: create `did = treated * post`, include it in the formula, cluster SEs by city.* -- ```python panel["did"] = panel["treated"] * panel["post"] did_fit = smf.ols( "outcome ~ treated + post + did + C(city) + C(month)", data=panel ).fit(cov_type="cluster", cov_kwds={"groups": panel["city"]}) did_fit.params["did"] ``` The `did` coefficient is the DiD estimate. Cluster at the treatment-assignment level. --- class: inverse # The Key Takeaways <br> ### 1. `import statsmodels.formula.api as smf` — then write R-style formulas. 90% of what you need. -- <br> ### 2. The OLS-as-A/B-test pattern is the right way to do A/B inference — generalizes to fixed effects and clustering for free. -- <br> ### 3. For panels with many fixed effects, use `linearmodels.PanelOLS`. It's not in the default install but it's standard. --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr> <tr><td>2</td><td><a href="../module-02/slides.html">pandas basics</a></td><td>✓ done</td></tr> <tr><td>3</td><td><a href="../module-03/slides.html">Joins, merges, group-by recipes</a></td><td>✓ done</td></tr> <tr><td>4</td><td>Regression and A/B tests with statsmodels <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>5</b></td><td><b>End-to-end interview scenario</b></td><td>next</td></tr> </table>