---
title: "Introduction to multiple regression (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```


::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity 8" or "8 multiple linear regression").
:::


## Warm-up {-}

### Example 1

Let's explore some data on penguins.
You can find a **codebook** for these data by typing `?penguins` in your *console* (not qmd).


```{r}
# Load packages
library(tidyverse)
data(penguins)
penguins <- penguins %>% 
  filter(species != "Adelie", bill_len < 57)

# Check it out
head(penguins)
```

Our goal is to build a model that we can use to get good predictions of penguins' flipper ("arm") lengths.
Consider 2 simple linear regression models of `flipper_len` by penguin `bill_len` and `species`: 

```{r}
# Model 1: R-squared = 0.0288
summary(lm(flipper_len ~ bill_len, penguins))

# Model 2: R-squared = 0.7014
summary(lm(flipper_len ~ species, penguins))
```

How might we improve our predictions of `flipper_len` using only these 2 predictors?
If we improve the predictions, what do you think is a reasonable range of possible values for the new R-squared?


### Example 2

Consider a simple linear regression model of `flipper_len` by `bill_len`:

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

Thoughts?
What's going on here?
How does this highlight the limitations of a simple linear regression model?


### Example 3

The `cps` dataset contains employment information collected by the U.S. Current Population Survey (CPS) in 2018.
We can use these data to explore wages among 18-34 year olds.
The original codebook is [here](https://cps.ipums.org/cps/resources/codebooks/cpsmar18.pdf).

```{r}
# Import data
cps <- read_csv("https://mac-stat.github.io/data/cps_2018.csv") %>% 
  select(-education, -hours) %>% 
  filter(age >= 18, age <= 34) %>% 
  filter(wage < 250000)
```

```{r}
# Check it out
head(cps)
```

We can use a simple linear regression model to summarize the relationship of `wage` with `marital` status:

```{r}
# Build the model
wage_mod <- lm(wage ~ marital, data = cps)

# Summarize the model
coef(summary(wage_mod))
```

What do you / don't you conclude from this model?
How does it highlight the limitations of a simple linear regression model?


\
\
\
\


**Reflection: Why are multiple regression models so useful?**

We can put more than 1 predictor into a regression model!
Adding predictors to models...

- Predictive viewpoint: Helps us better predict the response
- Descriptive viewpoint: Helps us better understand the isolated (causal) effect of a variable by holding constant confounders, other variables that may distort the relationship of interest...


## Exercises {-}

In the coming weeks, we'll explore how to visualize, interpret, build, and evaluate multiple linear regression models.
First, we'll explore some foundations using a model of penguin `flipper_len` by just 2 predictors: `bill_len` (quantitative) and `species` (categorical).


### Exercise 1: Visualizing the relationship

We've learned how to visualize the relationship of `flipper_len` by `bill_len` alone:

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len)) + 
  geom_point()
```

a. THINK: How might we change the scatterplot points to *also* indicate information about penguin `species`? (There's more than 1 approach!)

b. Try out your idea by modifying the code below. If you get stuck, talk with the tables around you!

```{r}
penguins %>%
  ggplot(aes(y = flipper_len, x = bill_len, ___ = ___)) +
  geom_point()
```


### Exercise 2: Visualizing the model

We've also learned that a simple linear regression model of `flipper_len` by `bill_len` alone can be represented by a line:

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

a. THINK: Reflecting on your plot of `flipper_len` by `bill_len` and `species` in Exercise 1, how do you think a *multiple* regression model of `flipper_len` using *both* of these predictors would be represented?

b. Check your intuition below by modifying the code below to include `species` in this plot, as you did in Exercise 1.

```{r}
penguins %>%
  ggplot(aes(y = flipper_len, x = bill_len, ___ = ___)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```


### Exercise 3: Intuition

Your plot in Exercise 2 demonstrated that the **multiple linear regression model** of `flipper_len` by `bill_len` and `species` is represented by 2 lines. 
Let's interpret the punchlines!

For each question, provide an answer along with evidence from the model lines that supports your answer.

a. What's the relationship between `flipper_len` and `bill_len`, no matter a penguin's `species`? What's your evidence?

b. Does the rate of increase in `flipper_len` with `bill_len` appear to differ between the two `species`? What's your evidence?

c. What's the relationship between `flipper_len` and `species`, no matter a penguin's `bill_len`? What's your evidence?

d. The formula for this linear regression model is represented by the below formula:

E[`flipper_len` | `species`, `bill_len`] = $\beta_0$ + $\beta_1$ `bill_len` + $\beta_2$ `speciesGentoo`

Based on parts a-c: 

- Will $\beta_1$ be positive or negative? Why?

- Will $\beta_2$ be positive or negative? Why?


### Exercise 4: Model formula

Of course, there's a formula behind the multiple regression model.
We can obtain this using the usual `lm()` function.

```{r}
# Build the model
penguin_mod <- lm(flipper_len ~ bill_len + species, data = penguins)

# Summarize the model
coef(summary(penguin_mod))
```


a. In the `lm()` function, how did we communicate that we wanted to model `flipper_len` by *both* `bill_len` and `species`?


b. Complete the following model formula. (Was your answer in Exercise 3 part d correct?!)    
    
    E[flipper_len | bill_len, speciesGentoo] = ___ + ___ * bill_len + ___ * speciesGentoo


### Exercise 5: Sub-model formulas

Ok. We now have a single *formula* for the model.

And we observed earlier that this formula is represented by two lines: one describing the relationship between `flipper_len` and `bill_len` for `Chinstrap` penguins and the other for `Gentoo` penguins.

Let's bring these ideas together.

Utilize the model formula to obtain the equations of these two lines, i.e. to obtain the *sub-model formulas* for the 2 species. Hint: Plug speciesGentoo = 0 and speciesGentoo = 1.        
    
Chinstrap: E[`flipper_len` | `bill_len`] = ___ + ___ `bill_len`

Gentoo:    E[`flipper_len` | `bill_len`] = ___ + ___ `bill_len`


### Exercise 6: coefficients -- physical interpretation

Reflecting on Exercise 5, let's interpret what the model coefficients tell us about the *physical* properties of the two 2 sub-model lines.
Choose the correct option given in parentheses:        

a. The intercept coefficient, 127.75, is the intercept of the line for (Chinstrap / Gentoo) penguins.

b. The `bill_len` coefficient, 1.40, is the (intercept / slope) of both lines.

c. The `speciesGentoo` coefficient, 22.85, indicates that the (intercept / slope) of the line for Gentoo is 22.85mm higher than the (intercept / slope) of the line for Chinstrap. Similarly, since the lines are parallel, the line for Gentoo is 22.85mm higher than the line for Chinstrap at any `bill_len`.
    
    
### Exercise 7: coefficients -- contextual interpretation

Next, interpret each coefficient in a *contextually* meaningful way.
What do they tell us about penguin flipper lengths?!

a. Interpret 127.75 (intercept of the Chinstrap line).

b. Interpret 1.40 (slope of both lines). For both Chinstrap and Gentoo penguins, the expected flipper length...

c. Interpret 22.85. At any `bill_len`, the expected flipper length...
    
    
### Exercise 8: Prediction

Now that we better understand the model, let's use it to predict flipper lengths!
Recall the model and visualization:

E[`flipper_len` | `bill_len`, `speciesGentoo`] = 127.75 + 1.40 * `bill_len` + 22.85 * `speciesGentoo`


```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len, color = species)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

a. Predict the flipper length of a Chinstrap penguin with a 50mm long bill. Make sure your calculation is consistent with the plot.       

```{r}
127.75 + 1.40*___ + 22.85*___
```    

To check your work, mark your prediction on the plot using a red dot:
Make sure it's where it should be!

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len, color = species)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_point(aes(x = 50, y = ____), color = "red", size = 4)
```


b. Predict the flipper length of a Gentoo penguin with a 50mm long bill. Make sure your calculation is consistent with the plot.       

```{r}
127.75 + 1.40*___ + 22.85*___
```    

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len, color = species)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_point(aes(x = 50, y = ____), color = "red", size = 4)
```


c. Use the `predict()` function to confirm your predictions in parts a and b.

```{r}
# Confirm the calculation in part a
predict(penguin_mod,
        newdata = data.frame(bill_len = ___, species = "___"))

# Confirm the calculation in part b
predict(penguin_mod,
        newdata = data.frame(bill_len = ___, species = "___"))
```


### Exercise 9: R-squared

Finally, recall that improving our predictions was one motivation for multiple linear regression (using 2 predictors instead of 1).
To this end, consider the R-squared values of the simple linear regression models that use just *one* predictor at a time:

```{r}
mod_bill <- lm(flipper_len ~ bill_len, data = penguins)
summary(mod_bill)

mod_species <- lm(flipper_len ~ species, data = penguins)
summary(mod_species)
```


a. If you had to use only *1* of our 2 predictors, which would give the better predictions of `flipper_len`?


b. What do you *guess* is the R-squared of our multiple regression model that uses *both* of these predictors? Why?


c. Check your intuition. How does the R-squared of our multiple regression model compare to that of the 2 separate simple linear regression models? 

```{r}
summary(penguin_mod)
```


### Reflection

You've now explored your first multiple regression model!
Thus you likely have a lot of questions about what's to come.
What are they?


## Extra exercises {-}


### Exercise 10: Challenge Part 1

In this activity, we explored the model of `flipper_len` by a quantitative predictor (`bill_len`) *and* a categorical predictor (`species`).
Challenge yourself to anticipate what visualizations / models might be like with 2 *quantitative* predictors: `bill_len` and `bill_dep`.


#### Part a

First, let's visualize the relationship of `flipper_len` with `bill_len` and `bill_dep`.
To do so, start with the scatterplot of `flipper_len` and `bill_len`, and *modify* it.
THINK: How might you change the points to reflect their corresponding `bill_dep`?

```{r}
# Do NOT include a geom_smooth (which wouldn't make sense here)
penguins %>% 
  ggplot(aes(y = flipper_len, x = bill_len)) + 
  geom_point()
```

#### Part b

In Part a, you represented a 3D point cloud in 2D.
BONUS (just for fun & for better understanding that this is a 3D relationship): Visualize this same data in something more like 3D:

```{r}
# Don't worry about this code!
# It's specialized to this exercise and we'll rarely use it
library(plotly)
penguins %>% 
  plot_ly(x = ~bill_len, y = ~bill_dep, z = ~flipper_len,
          type = "scatter3d")
```


#### Part c

We can think of your plot above as a 3D point cloud.
And the estimated formula for the linear regression model is

E[`flipper_len` | `bill_len`, `bill_dep`] = 179.2 + 2.38 `bill_len` - 5.15 `bill_dep`

```{r}
quant_model <- lm(flipper_len ~ bill_len + bill_dep, penguins)
coef(summary(quant_model))
```

**QUESTION:** 

The model of `flipper_len` by `bill_len` and `species` is represented by 2 parallel lines.
How would you draw / represent the model here?!


#### Part d

Challenge: Interpret the `(Intercept)` and `bill_len` coefficients.
What *physical* meaning do they have, i.e. where do they come into the "drawing" of the model?
What *contextual* meaning do they have?


\
\
\
\


### Exercise 11: Challenge Part 2

Next, challenge yourself to anticipate what visualizations / models might be like with 2 *categorical* predictors: `species` and `sex`.


#### Part a

First, let's visualize the relationship of `flipper_len` with `species` and `sex`.
To do so, start with the side-by-side boxplots of `flipper_len` by `species`, and *modify* them.

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = species)) + 
  geom_boxplot()
```

#### Part b

The estimated formula for the linear regression model is

E[`flipper_len` | `species`, `sex`] = 191.8 + 21.07 `speciesGentoo` + 8.39 `sexmale`

```{r}
cat_model <- lm(flipper_len ~ species + sex, penguins)
summary(cat_model)
```

**QUESTION:** 

The model of `flipper_len` by `species` alone would only produce 2 possible predictions, one for Chinstraps and one for Gentoos.
What about the model here?!
How many possible predictions would it make?
HINT: Checking out the plot again might help.


#### Part c

Challenge: Interpret the `(Intercept)` and `speciesGentoo` coefficients.
What *physical* meaning do they have, i.e. where do they come into the "drawing" of the model?
What *contextual* meaning do they have?


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!