Module 4: Regressions and A/B Tests with statsmodels

class: center, middle, inverse, title-slide

.title[
# Module 4: Regressions and A/B Tests with statsmodels
]
.subtitle[
## The module where Python finally feels comfortable for an econometrician
]

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">pandas basics</a></td><td>✓ done</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Joins, merges, group-by recipes</a></td><td>✓ done</td></tr>
<tr><td><b>4</b></td><td><b>Regression and A/B tests with statsmodels</b> <i>(you are here)</i></td><td>← current</td></tr>
<tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr>
</table>

---

# Two Interfaces

```python
# 1. Formula interface (R-style) — usually what you want
import statsmodels.formula.api as smf
fit = smf.ols("fare_usd ~ distance_mi + C(city)", data=rides).fit()

# 2. Matrix interface
import statsmodels.api as sm
X = sm.add_constant(rides[["distance_mi"]])
y = rides["fare_usd"]
fit = sm.OLS(y, X).fit()
```

**Use the formula interface unless you have a specific reason not to.** It's shorter, handles factors automatically, and looks like R.

The convention I'll use throughout: `import statsmodels.formula.api as smf`.

---

# OLS with Formulas

```python
fit = smf.ols("fare_usd ~ distance_mi + duration_min + C(city)", data=rides).fit()
print(fit.summary())
```

The formula syntax is **identical to R**:

- `+` adds terms
- `:` is interaction only
- `*` is interaction + main effects
- `-` removes terms
- `I(x^2)` for arithmetic
- `C(x)` wraps a variable as a factor (Python needs this; R has factors built in)

---

# Reading the Summary

```python
fit.params              # Series of coefficients
fit.bse                 # standard errors
fit.tvalues             # t-statistics
fit.pvalues             # p-values
fit.conf_int()          # confidence intervals
fit.rsquared
fit.nobs
fit.resid               # residuals
fit.fittedvalues        # ŷ
```

For a clean DataFrame summary:

```python
pd.DataFrame({
    "coef": fit.params, "se": fit.bse,
    "t": fit.tvalues,   "p": fit.pvalues
}).round(3)
```

---

# Robust Standard Errors

```python
# Heteroskedasticity-robust (HC3 is the modern default)
fit = smf.ols("y ~ x", data=df).fit(cov_type="HC3")

# Cluster-robust
fit = smf.ols("y ~ x", data=df).fit(
    cov_type="cluster",
    cov_kwds={"groups": df["driver_id"]}
)
```

R / dplyr equivalent: `lm_robust(..., clusters = driver_id)` from `estimatr`, or `coeftest(fit, vcov = vcovHC(fit, type = "HC3"))`.

**Important:** the cluster `groups` is a Series, not a column name string.

---

# Fixed Effects

**Small N:** use `C(...)` directly.

```python
fit = smf.ols("fare_usd ~ distance_mi + C(driver_id) + C(city)", data=rides).fit()
```

**Many fixed effects (thousands):** use `linearmodels.PanelOLS` instead.

```python
from linearmodels.panel import PanelOLS
panel = rides.set_index(["driver_id", "pickup_at"])
fit = PanelOLS.from_formula(
    "fare_usd ~ distance_mi + EntityEffects + TimeEffects",
    data=panel
).fit(cov_type="clustered", cluster_entity=True)
```

`linearmodels` is the third-party panel-data package econometricians use. Not installed by default, but standard at any econ-research-adjacent shop.

---

# Logit / Probit / GLM

```python
# Logistic regression
logit = smf.logit("accepted ~ distance_mi + C(city)", data=requests).fit()

# Probit
probit = smf.probit("accepted ~ distance_mi", data=requests).fit()

# Custom GLM
glm = smf.glm("accepted ~ distance_mi", data=requests,
              family=sm.families.Binomial()).fit()
```

Output is the same shape as `ols`. Coefficients are on the log-odds scale; `np.exp(fit.params)` gives odds ratios.

---

# A/B Test Inference: Three Ways

**Way 1: t-test on the means**

```python
from scipy import stats
t, p = stats.ttest_ind(
    ab.loc[ab["treatment"] == 1, "spend_usd"],
    ab.loc[ab["treatment"] == 0, "spend_usd"],
    equal_var=False  # Welch's t-test
)
```

**Way 2: OLS regression (the econometrician's t-test)**

```python
fit = smf.ols("spend_usd ~ treatment", data=ab).fit(cov_type="HC3")
print(fit.summary())
```

The coefficient on `treatment` is the ATE; its SE and p-value answer the same question as the t-test. **Identical with HC2/HC3 SEs.** This is the version you should write in the interview — it generalizes to fixed effects, clustering, and interactions.

---

# Confidence Intervals

```python
ate    = fit.params["treatment"]
ate_se = fit.bse["treatment"]
ate_ci = fit.conf_int().loc["treatment"]

print(f"ATE: {ate:.3f} (SE = {ate_se:.3f})")
print(f"95% CI: [{ate_ci[0]:.3f}, {ate_ci[1]:.3f}]")
```

**Variance reduction trick:** add fixed effects.

```python
fit = smf.ols("spend_usd ~ treatment + C(city)", data=ab).fit(cov_type="HC3")
```

The treatment estimate is still unbiased (because of randomization), but more precise.

---

# Difference-in-Differences

```python
df["post"]    = (df["t"] >= 10).astype(int)
df["treated"] = (df["city"] == "A").astype(int)
df["did"]     = df["post"] * df["treated"]

fit = smf.ols(
    "outcome ~ post + treated + did + C(city) + C(t)",
    data=df
).fit(cov_type="cluster", cov_kwds={"groups": df["city"]})
```

The coefficient on `did` is the DiD estimate. Two-way fixed effects (city + time) absorb the level effects; `did` cleanly identifies the treatment effect under parallel-trends.

---

# Common Traps

**1. NaNs are dropped silently.** Compare `fit.nobs` to `len(data)` to see how many rows you lost. Source of "my coefficients changed when I added a control" mysteries.

**2. Categorical reference level.** `C(city)` uses the first city alphabetically. Override with `C(city, Treatment(reference="SF"))`.

**3. Cluster argument.** Pass a Series, not a string.

**4. Formula API copies the data.** Fine for most sizes, slow for very large data. Use the matrix interface for big data.

---
class: inverse, center, middle

# Interview Questions

---

# Q1. Regress fare on distance + duration + city with robust SEs. Report coefficients.

*Hint: `smf.ols("y ~ x + C(g)", data=df).fit(cov_type="HC3")`.*

```python
fit = smf.ols("fare_usd ~ distance_mi + duration_min + C(city)",
              data=rides).fit(cov_type="HC3")
pd.DataFrame({"coef": fit.params, "se": fit.bse,
              "t": fit.tvalues, "p": fit.pvalues}).round(4)
```

---

# Q2. Compute the A/B test treatment effect on spend, with a 95% CI.

*Hint: OLS of `spend ~ treatment` with HC3 SEs. `.params`, `.conf_int()`.*

```python
fit_ab = smf.ols("spend_usd ~ treatment", data=ab).fit(cov_type="HC3")
ate    = fit_ab.params["treatment"]
ate_ci = fit_ab.conf_int().loc["treatment"]
print(f"ATE: {ate:.3f}, 95% CI: [{ate_ci[0]:.3f}, {ate_ci[1]:.3f}]")
```

Three lines. This is the answer to "is the A/B test significant."

---

# Q3. Estimate a difference-in-differences with city and month fixed effects.

*Hint: create `did = treated * post`, include it in the formula, cluster SEs by city.*

```python
panel["did"] = panel["treated"] * panel["post"]
did_fit = smf.ols(
    "outcome ~ treated + post + did + C(city) + C(month)",
    data=panel
).fit(cov_type="cluster", cov_kwds={"groups": panel["city"]})
did_fit.params["did"]
```

The `did` coefficient is the DiD estimate. Cluster at the treatment-assignment level.

---
class: inverse

# The Key Takeaways

<br>

### 1. `import statsmodels.formula.api as smf` — then write R-style formulas. 90% of what you need.

<br>

### 2. The OLS-as-A/B-test pattern is the right way to do A/B inference — generalizes to fixed effects and clustering for free.

<br>

### 3. For panels with many fixed effects, use `linearmodels.PanelOLS`. It's not in the default install but it's standard.

---

# Course Map

<table>
<tr><th>#</th><th>Module</th><th>Status</th></tr>
<tr><td>1</td><td><a href="../module-01/slides.html">Python for R Users</a></td><td>✓ done</td></tr>
<tr><td>2</td><td><a href="../module-02/slides.html">pandas basics</a></td><td>✓ done</td></tr>
<tr><td>3</td><td><a href="../module-03/slides.html">Joins, merges, group-by recipes</a></td><td>✓ done</td></tr>
<tr><td>4</td><td>Regression and A/B tests with statsmodels <i>(just finished)</i></td><td>✓ done</td></tr>
<tr><td><b>5</b></td><td><b>End-to-end interview scenario</b></td><td>next</td></tr>
</table>