class: center, middle, inverse, title-slide .title[ # Module 1: Python for R Users ] .subtitle[ ## The cheat sheet ] --- <style type="text/css"> .remark-code, .remark-inline-code { font-size: 80%; } .remark-slide-content { padding: 1em 2em; } .small { font-size: 80%; } </style> # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td><b>1</b></td><td><b>Python for R Users</b> <i>(you are here)</i></td><td>← current</td></tr> <tr><td>2</td><td>pandas basics: filter / mutate / arrange / summarise</td><td>upcoming</td></tr> <tr><td>3</td><td>Joins, merges, group-by recipes</td><td>upcoming</td></tr> <tr><td>4</td><td>Regression and A/B tests with statsmodels</td><td>upcoming</td></tr> <tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr> </table> --- # Before We Start: Setup **Step 1.** Build the CSV data files (once, in your terminal): ```bash cd ~/Desktop/sandbox/python-for-r-users Rscript data/build_csvs.R ``` -- **Step 2.** Install **Positron** — a free IDE from the RStudio team that handles R *and* Python natively. Same keyboard shortcuts you already know. Download from [positron.posit.co](https://positron.posit.co/), drag to Applications. -- **Step 3.** Open the course folder: `File → Open Folder → python-for-r-users` Open `module-01/exercise.py`. Highlight code → `Cmd+Return` to send to the Python console — just like RStudio. -- **Test it:** highlight and run this: ```python print(1 + 1) ``` If you see `2` in the console, you're ready. --- # The Big Differences in 90 Seconds | | R | Python | |---|---|---| | Indexing starts at | 1 | **0** | | Inclusive ranges? | Yes (`1:5` = 1..5) | **No** (`range(1,5)` = 1..4, excludes the right end) | | Assignment | `<-` (or `=`) | `=` | | Indentation | cosmetic | **load-bearing** | | Vectorized? | Everything is | Lists no, NumPy/pandas yes | | Missing values | `NA` (typed) | `None` / `np.nan` (untyped) | | Booleans | `TRUE`/`FALSE` | `True`/`False` | -- The biggest mental shift: **indentation is part of the syntax**. There are no curly braces. The 4-space indent IS the block. --- # Two Things From That Table Worth Explaining **Why does `range(1,5)` include 1 but not 5?** Python uses "half-open" intervals: include the start, exclude the end. `range(n)` gives exactly `n` items, and consecutive ranges like `range(0,5)` + `range(5,10)` don't overlap. Same rule for list slicing: `x[1:3]` = items at index 1 and 2, not 3. -- **What does "vectorized" mean?** In R, `c(1,2,3) + 10` gives `c(11,12,13)` — the operation applies to every element automatically. In Python, `[1,2,3] + 10` is an **error** — plain lists don't do element-wise math. You need NumPy or pandas for that (Module 2). --- # Variables and Types ```python x = 42 # int y = 3.14 # float name = "Allison" # str greeting = f"Hello, {name}!" # → "Hello, Allison!" is_ready = True # bool (capitalized!) maybe = None # Python's NULL ``` -- **f-strings:** the `f` before the quote says "replace `{...}` with the variable's value." Like R's `glue::glue("Hello, {name}!")` or `paste0("Hello, ", name, "!")`. Without the `f`, the braces are literal text. -- R-isms that don't work: `x <- 5` parses as "is x less than minus 5" → `False`. Use `=`. --- # The Four Core Data Structures ```python nums = [1, 2, 3, 4, 5] # list — ordered, mutable point = (3.0, 4.0) # tuple — ordered, immutable ride = {"id": 42, "fare": 15.5} # dict — key-value (named list) seen = {1, 2, 3} # set — unordered, unique ``` -- Indexing on lists: ```python nums[0] # → 1 (zero-indexed!) nums[-1] # → 5 (negative = from end) nums[1:3] # → [2, 3] (right-exclusive) ``` -- **Lists are NOT vectorized.** `[1, 2, 3] + 1` is a `TypeError`. For vectorized math you need NumPy or pandas (Module 2). -- **Mutable vs immutable:** "mutable" means you can change it after creating it. `nums[0] = 99` works on a list (mutable) but crashes on a tuple (immutable). **Watch out:** in R, modifying a vector makes a copy. In Python, modifying a list changes the **original** — if you pass it to a function, the function can alter your data. --- # Comprehensions: The Python Idiom ```python # List comprehension squares = [x ** 2 for x in range(10)] # → [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] # With a filter evens = [x for x in range(20) if x % 2 == 0] # → [0, 2, 4, ..., 18] # Dict comprehension square_map = {x: x ** 2 for x in range(5)} # → {0: 0, 1: 1, 2: 4, 3: 9, 4: 16} ``` -- **Reading that last line:** `{x: x ** 2 for x in range(5)}` means "for each `x` from 0 to 4, create a key-value pair where the key is `x` and the value is `x` squared." The `{}` braces make it a dict (not a list). Like `setNames(0:4, (0:4)^2)` in R, but more readable. -- R equivalent: `sapply()` / `purrr::map()`. Pythonic style prefers comprehensions to loops for transformations + filters. --- # Control Flow: if / for ```python x = 3 if x > 0: # the : starts the block print("positive") # indentation = the block body elif x == 0: print("zero") else: print("negative") # → positive ``` ```python nums = [10, 20, 30] for item in nums: # like R's: for (item in nums) { ... } print(item) # → 10, 20, 30 ``` The `:` after the condition is required. Indentation defines the block (no `{}`). No parentheses around the condition. --- # Control Flow: while ```python x = 3 while x > 0: print(x) x -= 1 # shorthand for x = x - 1 # → 3, 2, 1 ``` -- **R → Python cheat sheet:** | R | Python | |---|---| | `if (x > 0) { ... }` | `if x > 0:` + indent | | `for (i in 1:5) { ... }` | `for i in range(1, 6):` + indent | | `while (x > 0) { ... }` | `while x > 0:` + indent | --- # Functions ```python def mph(distance, minutes): """Miles per hour.""" # ← docstring (triple quotes) return distance / (minutes / 60) print(mph(5, 15)) # → 20.0 ``` The `"""..."""` is a **docstring** — Python's convention for documenting what a function does, what it takes, and what it returns. Like writing `#' @description` / `#' @param` in R's roxygen. Optional, but good practice. Once written, `help(mph)` prints it. -- ```python # Default arguments — same idea as R's f <- function(x, y = 2) def greet(name, greeting="Hello"): return f"{greeting}, {name}!" print(greet("Maya")) # → Hello, Maya! print(greet("Maya", "Welcome")) # → Welcome, Maya! ``` -- ```python # Lambda = anonymous function, like R's \(x) x^2 square = lambda x: x ** 2 print(square(4)) # → 16 ``` --- # Imports ```python import pandas as pd # whole module, alias as pd import numpy as np from statsmodels.formula.api import ols # specific name from a module ``` -- In R, `library(dplyr)` dumps all functions into your namespace — you just call `filter()` without saying where it came from (and get conflicts when two packages share a name). In Python, you **always prefix**: `pd.read_csv()`, `np.mean()`. Verbose, but you always know which module a function belongs to — no conflicts, no guessing. --- # 10 Things That Trip Up R Users .pull-left[ 1. **Zero-indexing.** `x[0]` is the first element. 2. **Indentation is syntax.** A misplaced space breaks your script. 3. **No vectorized math on lists.** Use NumPy/pandas. 4. **`==` for comparison, `=` for assignment.** No `<-`. 5. **`True`/`False` are capitalized.** ] .pull-right[ <ol start="6"> <li><b><code>None</code> instead of <code>NULL</code>/<code>NA</code>.</b> NumPy uses <code>np.nan</code> for missing floats.</li> <li><b>Methods vs functions.</b> <code>nums.append(6)</code> changes <code>nums</code> itself (nothing returned). <code>sorted(nums)</code> returns a new sorted list and leaves <code>nums</code> alone. In R, both would return a new object. In Python, <code>.method()</code> often mutates; <code>function()</code> often copies.</li> <li><b>Mutability matters.</b> Lists/dicts mutable; tuples/strings immutable.</li> <li><b><code>for</code> loops are fine here.</b> Python culture is OK with them.</li> <li><b>No <code>%>%</code> pipe.</b> Python uses method chaining instead: <code>df.filter().groupby().mean()</code> — read left to right, like a pipe.</li> </ol> ] --- class: inverse, center, middle # Interview Questions --- # Q1. Write a function that computes miles per hour given distance (mi) and time (min). *Hint: `def`, division, `return`.* -- ```python def mph(distance, minutes): return distance / (minutes / 60) print(mph(5, 15)) # → 20.0 ``` --- # Q2. Given a list of ride dicts, compute the average fare for SF rides only. ```python rides = [{"city": "SF", "fare": 12}, {"city": "NY", "fare": 18}, {"city": "SF", "fare": 9}] ``` *Hint: list comprehension with a filter, then `sum() / len()`.* -- ```python sf_fares = [r["fare"] for r in rides if r["city"] == "SF"] print(sum(sf_fares) / len(sf_fares)) # → 10.5 # Note: Python has no built-in mean(). Use sum()/len(), # or numpy.mean(), or once in pandas: series.mean() ``` --- # Q3. Read `data/rides.csv` with pandas and show the first 5 rows. *Hint: `pd.read_csv()` and `.head()`.* -- ```python import pandas as pd rides = pd.read_csv("data/rides.csv") rides.head() ``` This gives you a `DataFrame` — the Python equivalent of R's `data.frame` / `tibble`. Module 2 covers everything you can do with one. --- # Q4. Invert a dict (swap keys and values). ```python original = {"a": 1, "b": 2, "c": 3} # → should become {1: "a", 2: "b", 3: "c"} ``` *Hint: dict comprehension + `.items()`.* -- ```python inverted = {v: k for k, v in original.items()} print(inverted) # → {1: 'a', 2: 'b', 3: 'c'} ``` **Reading it:** `{v: k ...}` = "in the new dict, `v` is the key, `k` is the value" (the `:` separates key from value, same as `{"a": 1}`). `.items()` gives `(key, value)` pairs from the original; the comprehension flips them. --- # Q5. Given a list of numbers, return the unique even numbers in sorted order. ```python nums = [4, 7, 2, 8, 4, 1, 6, 3, 8, 2] ``` *Hint: set comprehension for uniqueness + filter, then `sorted()`.* -- ```python print(sorted({n for n in nums if n % 2 == 0})) # → [2, 4, 6, 8] ``` `{...}` with no `key: value` is a **set comprehension** — gives unique values. `sorted()` returns a list. --- class: inverse # The Key Takeaways <br> ### 1. Python is "R syntax with zero-indexing, explicit imports, and load-bearing whitespace." Once you internalize those three, you can read any Python data script. -- <br> ### 2. List/dict/set comprehensions replace most uses of `for` loops and `sapply`/`map`. Memorize the pattern. -- <br> ### 3. Lists are not vectorized — you need NumPy or pandas for that. Module 2 starts there. --- # Course Map <table> <tr><th>#</th><th>Module</th><th>Status</th></tr> <tr><td>1</td><td>Python for R Users <i>(just finished)</i></td><td>✓ done</td></tr> <tr><td><b>2</b></td><td><b>pandas basics</b></td><td>next</td></tr> <tr><td>3</td><td>Joins, merges, group-by recipes</td><td>upcoming</td></tr> <tr><td>4</td><td>Regression and A/B tests with statsmodels</td><td>upcoming</td></tr> <tr><td>5</td><td>End-to-end interview scenario</td><td>upcoming</td></tr> </table>