---
title: "Multiple linear regression: model building (part 1) (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Warm-up {-}


### Example 1: Research questions

:::{.incremental}
- *Predictive research questions*   
    - Goal: Build a model using our sample data that we can use to predict outcomes outside our sample.
    - Approach: Use any available predictors that help us reach this goal!
    - Interpretability: We don't necessarily care about whether our model turns out to be complicated (within limits!).
    - Context: "Machine learning" is often associated with predictive questions.

- *Descriptive & inferential research questions*    
    - Goal: Better understand and make inferences about specific relationships of interest (within and beyond our sample).
    - Approach: Predictors are defined by the relationships of interest.
    - Interpretability: We need to be able to interpret the model!
    
- *Causal research questions*   
    - Goal: A type of inferential research question that specifically seeks to establish causality.
    - Approach: Include the predictor(s) of interest *AND* all potential confounders.
    - Interpretability: We need to be able to interpret the model!

:::    

\
\
\

For each research question / goal below (from Prof Johnson's stint at the CDC's Division of Vector-Borne Diseases), identify what type it is.

Each of these questions were answered using statistical models!

a. Does a new mosquito repellent effectively prevent mosquito bites?
b. Use blood samples and other info sent into the lab to determine if somebody contracted the West Nile Virus.
c. How much West Nile Virus do mosquitos carry in their saliva and does this differ by species and age?
    

## Exercises {-}

In model building, we might feel tempted to throw more and more things into our model.
In these exercises, we'll use the `penguins` data to explore some nuances and limitations of this approach:

```{r}
# Load packages & data
library(tidyverse)
data(penguins)
```


To get started, the `flipper_len` variable currently measures flipper length in mm. Let's create and save a *new* variable named `flipper_len_cm` which measures flipper length in cm. NOTE: There are 10mm in a cm.

```{r}
penguins <- penguins %>% 
    mutate(flipper_len_cm = flipper_len / 10)
```

Run the code chunk below to build a bunch of models that you'll be exploring in the exercises: 

```{r}
penguin_model_1 <- lm(bill_len ~ flipper_len, penguins)
penguin_model_2 <- lm(bill_len ~ flipper_len_cm, penguins)
penguin_model_3 <- lm(bill_len ~ flipper_len + flipper_len_cm, penguins)
penguin_model_4 <- lm(bill_len ~ body_mass, penguins)
penguin_model_5 <- lm(bill_len ~ flipper_len + body_mass, penguins)
```


### Exercise 1: Modeling bill length by flipper length

What can a penguin's flipper (arm) length tell us about their bill length? To answer this question, we'll consider 3 of our models:

model              predictors
------------------ ---------------------------
`penguin_model_1`  `flipper_len`
`penguin_model_2`  `flipper_len_cm`
`penguin_model_3`  `flipper_len + flipper_len_cm`


Plots of the first two models are below:

```{r}
ggplot(penguins, aes(y = bill_len, x = flipper_len)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

ggplot(penguins, aes(y = bill_len, x = flipper_len_cm)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)
```

a. *Before* examining the model summaries, check your intuition. Do you think the `penguin_model_2` R-squared will be less than, equal to, or more than that of `penguin_model_1`? Similarly, how do you think the `penguin_model_3` R-squared will compare to that of `penguin_model_1`?

b. Check your intuition: Examine the R-squared values for the three penguin models and summarize how these compare.

```{r}
summary(penguin_model_1)$r.squared
summary(penguin_model_2)$r.squared
summary(penguin_model_3)$r.squared
```

c. Explain why your observation in part b makes sense. Support your reasoning with a plot of just the 2 *predictors*: `flipper_len` vs `flipper_len_cm`.

```{r}

```

d. In `summary(penguin_model_3)`, the `flipper_len_cm` coefficient is `NA`. Explain why this makes sense. THINK: The `flipper_len_cm` coefficient would tell us the change in the expected bill length per centimeter change in flipper length, while holding flipper length in millimeters constant
    
    BONUS: For those of you that have taken MATH 236 (Linear Alebra), the NA is the result of matrices that are not of full rank!


### Exercise 2: Incorporating `body_mass`   

Let's now consider models of `bill_len` using `flipper_len` and/or `body_mass`:

model              predictors                 R-squared
------------------ -------------------------- ------------------
`penguin_model_1`  `flipper_len`              0.431
`penguin_model_4`  `body_mass`                0.354
`penguin_model_5`  `flipper_len + body_mass`  TBD (do NOT calculate)
 
 
a. Which is the better predictor of `bill_len`: `flipper_len` or `body_mass`? Provide some numerical evidence.

b. `penguin_model_5` incorporates both `flipper_len` and `body_mass` as predictors. *Before* examining a model summary, ask your gut: Will the `penguin_model_5` R-squared be close to 0.354, close to 0.431, or greater than 0.6?

c. Check your intuition. Report the `penguin_model_5` R-squared and summarize how this compares to that of `penguin_model_1` and `penguin_model_4`.

d. Explain why your observation in part c makes sense. Support your reasoning with a plot of the 2 *predictors*: `flipper_len` vs `body_mass`.

```{r}

```


### Exercise 3: Redundancy and Multicollinearity

The exercises above illustrate special phenomena in multivariate modeling:

- two predictors are **redundant** if they contain the same exact information
- two predictors are **multicollinear** if they are strongly associated (they contain very similar information) but are not completely redundant.

Recall that we examined 5 models:

model              predictors
------------------ ---------------------------
`penguin_model_1`  `flipper_len`
`penguin_model_2`  `flipper_len_cm`
`penguin_model_3`  `flipper_len + flipper_len_cm`
`penguin_model_4`  `body_mass`
`penguin_model_5`  `flipper_len + body_mass`


a. Which model had *redundant* predictors and which predictors were these?
b. Which model had *multicollinear* predictors and which predictors were these?
c. In general, what happens to the R-squared value if we add a *redundant* predictor to a model: will it decrease, stay the same, increase by a small amount, or increase by a significant amount?
d. Similarly, what happens to the R-squared value if we add a *multicollinear* predictor to a model: will it decrease, stay the same, increase by a small amount, or increase by a significant amount?


### Exercise 4: Considerations for strong models

Let's dive deeper into important considerations when building a *strong* model. We'll use a subset of the penguins data for exploring these ideas.

```{r}
# For illustration purposes only, take a sample of 10 penguins.
# We'll discuss this code later in the course!
set.seed(155)
penguins_small <- sample_n(penguins, size = 10) %>%
  mutate(flipper_len = jitter(flipper_len))
```

Check out the relationship of `bill_len` with `flipper_len` for this sample:

```{r}
penguins_small %>% 
  ggplot(aes(y = bill_len, x = flipper_len)) + 
  geom_point()
```


Consider 3 models of bill length.
Letting Y be `bill_len` and X be `flipper_len`:

- 1 predictor:  $E[Y | X] = \beta_0 + \beta_1 X$
- 2 predictors: $E[Y | X] = \beta_0 + \beta_1 X + \beta_2 X^2$
- 9 predictors: $E[Y | X] = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 X^4 + \beta_5 X^5 + \beta_6 X^6 + \beta_7 X^7 + \beta_8 X^8 + \beta_9 X^9$

These models are estimated below:

```{r}
# A model with 1 predictor (flipper_len)
poly_mod_1 <- lm(bill_len ~ flipper_len, penguins_small)

# A model with 2 predictors (flipper_len and flipper_len^2)
poly_mod_2 <- lm(bill_len ~ poly(flipper_len, 2), penguins_small)

# A model with 9 predictors (flipper_len, flipper_len^2, ... on up to flipper_len^9)
poly_mod_9 <- lm(bill_len ~ poly(flipper_len, 9), penguins_small)
```

a. Before doing any analysis, which of the three models do you think will be best?
b. Calculate the R-squared values of these 3 models. Which model do you think is best?

```{r}
summary(poly_mod_1)$r.squared
summary(poly_mod_2)$r.squared
summary(poly_mod_9)$r.squared
```

c. Check out plots depicting the relationship estimated by these 3 models. Which model do you think is best?

```{r}
# A plot of model 1
ggplot(penguins_small, aes(y = bill_len, x = flipper_len)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE)
```

```{r}
# A plot of model 2
ggplot(penguins_small, aes(y = bill_len, x = flipper_len)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
```

```{r}
# A plot of model 9
ggplot(penguins_small, aes(y = bill_len, x = flipper_len)) + 
    geom_point() + 
    geom_smooth(method = "lm", formula = y ~ poly(x, 9), se = FALSE)
```


### Exercise 5: Reflecting on these investigations

a. List 3 of your favorite foods. Now imagine making a dish that combines all of these foods. Do you think it would taste good?
b. Too many good things doesn't make necessarily make a better thing. Model 9 demonstrates that it's always *possible* to get a perfect R-squared of 1, but there are drawbacks to putting more and more predictors into our model. Answer the following about model 9:
    - How easy would it be to interpret this model?
    - Would you say that this model captures the general trend of the relationship between `bill_len` and `flipper_len`?
    - How well do you think this model would generalize to penguins that were not included in the `penguins_small` sample? For example, would you expect these new penguins to fall on the wiggly model 9 curve?

 
### Exercise 6: Overfitting

a. We say that `poly_mod_9` is **overfit** to the 10 penguins in the `penguins_small` dataset. To understand this term, apply `poly_mod_9` to the 344 penguins in the original dataset. In your own words, describe what "overfit" means.

```{r}
penguins %>% 
  ggplot(aes(y = bill_len, x = flipper_len)) + 
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 9), se = FALSE, data = penguins_small) +
  geom_point(data = penguins_small, color = "red", size = 3)
```

b. Check out the following [xkcd comic](https://xkcd.com/2048/). Which plot pokes fun at overfitting?

![](https://imgs.xkcd.com/comics/curve_fitting_2x.png)


c. Check out some other goodies:

![](https://pbs.twimg.com/media/EbI7FNxX0AU_gIR.jpg)

![](https://miro.medium.com/max/800/1*cT-z5lx-c0phjaw-iVUxvA.jpeg)


### Exercise 7: Questioning R-squared
 
Zooming out, explain some limitations of relying on R-squared to measure the strength / usefulness of a model.


### Exercise 8: Adjusted R-squared

We've observed that, unless a predictor is redundant with another, R-squared will increase. Even if that predictor is strongly multicollinear with another. Even if that predictor isn't a good predictor! Thus if we only look at R-squared we might get overly greedy. We can check our greedy impulses a few ways. We take a more in depth approach in STAT 253, but one quick alternative is reported right in our model `summary()` tables. **Adjusted R-squared** includes a *penalty* for incorporating more and more predictors. Mathematically (where $n$ is the sample size and $p$ is the number of non-intercept coefficients):

$$
\text{Adjusted } R^2 = 1 - (1 - R^2) \left( \frac{n-1}{n-p-1} \right)
$$

As we add more predictors into a model, R-squared increases, BUT the number of non-intercept coefficients $p$ *also* increases.
Thus unlike R-squared, Adjusted R-squared can *decrease* if the information that a predictor contributes to a model (reflected by R-squared) isn't enough to offset the complexity it adds to that model (reflected by $p$).
Consider two models:

```{r}
example_mod_1 <- lm(bill_len ~ species, penguins)
example_mod_2 <- lm(bill_len ~ species + island, penguins)
```


a. Check out the summaries for the 2 example models and summarize your observations in this table:

Model                         R-squared   Adj R-squared
---------------------------- ----------- ---------------
bill_len ~ species              ???           ???
bill_len ~ species + island     ???           ???    


b. In general, how does a model's Adjusted R-squared compare to the R-squared? Is it greater, less than, or equal to the R-squared?
c. How did the R-squared change from example model 1 to model 2? How did the Adjusted R-squared change?
d. Explain what it is about `island` that resulted in a decreased Adjusted R-squared. Note: it's not necessarily the case that `island` is a bad predictor on its own!
e. Pulling this all together, which is the "better" of these 2 models?


### Reflection

Today we looked at some cautions surrounding indiscriminately adding variables to a model. Summarize key takeaways.

> **Response:** Put your response here.


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!