---
title: "Multiple regression principles (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```


## Warm-up {-}


Let's revisit the bikeshare data:

```{r}
# Load packages & import data
library(tidyverse)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>% 
  rename(rides = riders_registered)
```

Our goal is to explain / predict how registered ridership varies from day to day.

To this end, we'll build various **multiple linear regression** models of `rides` by different combinations of the possible predictors.

```{r}
# Check out the data
head(bikes)
```


### Example 1: Review visualization

Let's build a model of `rides` by `windspeed` (quantitative) and `weekend` status (categorical).

```{r}
bikes %>% 
  select(rides, windspeed, weekend) %>% 
  head()
```

a. Write a population model statement for this regression model.

b. Plot & describe, in words, the relationship between these 3 variables.


```{r}
# Plot of rides vs windspeed & weekend
# Include a representation of the linear regression model
# HINT: Start with a plot of rides vs windspeed, then add an aesthetic for weekend!

```


### Example 2: Review model

Let's build the model:

```{r}
bike_model_1 <- lm(rides ~ windspeed + weekend, data = bikes)
coef(summary(bike_model_1))
```

The estimated sample model equation is therefore:

E[rides | windspeed, weekendTRUE] = 4738.38 - 63.97 * windspeed - 925.16 * weekendTRUE

This model equation is represented by 2 lines, one corresponding to weekends and the other to weekdays. Simplify the model equation above for weekdays and weekends:       
    
weekdays: rides = ___ - ___ windspeed   

weekends: rides = ___ - ___ windspeed


### Example 3: Review coefficient interpretation


a. The intercept coefficient, 4738.38, represents the *intercept* of the sub-model for weekdays, the reference category. What's its *contextual* interpretation?
    

b. The `windspeed` coefficient, -63.97, represents the *shared slope* of the weekend and weekday sub-models. What's its *contextual* interpretation?
    

c. The `weekendTRUE` coefficient, -925.16, represents the *change in intercept* for the weekend vs weekday sub-model. What's its *contextual* interpretation?


::: {.callout-note title = "Interpreting MLR coefficients (2 predictors)"}

Consider a multiple linear regression model with 2 predictors:

$$E[Y | X_1, X_2] = \beta_0 + \beta_1 X_1 + \beta_2 X_2$$

Then

- $\beta_0$ ("beta 0") = expected value of $Y$ when $X_1$ and $X_2$ are both 0, ie. when all quantitative predictors are set to 0 and the categorical predictors are set to their reference levels.    

- $\beta_1$ ("beta 1") = change in the expected value of $Y$ for two cases whose values of $X_1$ differ by 1 but who are identical in terms of $X_2$.

- $\beta_2$ ("beta 2") = change in the expected value of $Y$ for two cases whose values of $X_2$ differ by 1 but who are identical in terms of $X_1$.

\
\

Note: If $X_j$ is an indicator variable, then "for two cases whose values of $X_j$ differ by 1" means "comparing the group $X_j$ to the reference level"
:::

\


::: {.callout-note title="Theory Box: Fitting multiple linear regression models"}

Consider a multiple linear regression model with $p$ predictors:

$$E[Y|X_1,X_2,...,X_p] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$$

We use the *least squares criterion* to estimate $(\beta_0, \beta_1, ..., \beta_p)$, i.e. we pick the coefficients *minimize the sum of squared residuals*.
Just as with simple linear regression, we can write out a formula for these least square estimates.
But for multiple linear regression, this requires linear algebra.

Suppose we have $n$ data points.
Using matrix notation, we can collect our observed response values into a vector $Y$, our predictor values into a matrix $X$, and our regression coefficients into a vector $\beta$.
The column of 1s in $X$ reflects our model's inclusion of an intercept:

$$Y = \left(\begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{array} \right) \;\;\;\; \text{ and } \;\;\;\;
X = \left(\begin{array}{ccccc} 
1 & X_{11} & X_{12} & \cdots & X_{1p} \\
1 & X_{21} & X_{22} & \cdots & X_{2p} \\
\vdots & \vdots & \vdots & \cdots & \vdots \\
1 & X_{n1} & X_{n2} & \cdots & X_{np} \\
\end{array} \right) \;\;\;\; \text{ and } \;\;\;\; \beta = \left(\begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{array} \right)$$


Then we can express the model $E[Y_i|X_{i1}, X_{i2},...,X_{ip}] = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_p X_{ip}$ for $i \in \{1,...,n\}$ as:

$$E[Y|X] = X\beta$$
Further, let $\hat{\beta}$ denote the vector of sample estimated $\beta$ and $\hat{Y}$ denote the vector of predictions / model values:

$$\hat{Y} = X \hat{\beta}$$

Thus the sum of squared residuals is

$$\sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = (Y - \hat{Y})^T(Y - \hat{Y}) = (Y - X \hat{\beta})^T(Y - X \hat{\beta})$$

and the following formula for sample coefficients $\hat{\beta}$ are the **least squares estimates** of $\beta$, i.e. they *minimize* the sum of squared residuals:

$$\hat{\beta} = (X^TX)^{-1}X^TY$$


:::


## Exercises {-}


Thus far, we've explored a couple examples of multiple regression models that have 2 predictors, 1 quantitative and 1 categorical.
Let's explore what happens when *both* predictors are categorical or *both* are quantitative!


### Exercise 1: 2 categorical predictors -- visualization

Let's explore the relationship of `rides` with 2 *categorical* predictors, `weekend` status and `season`.

The below code plots `rides` vs `season`.

Modify this code to *also* include information about `weekend`.

HINT: Remember the visualization *principle* that additional categorical predictors require some sort of grouping mechanism / mechanism that distinguishes between the 2 groups.

```{r}
# rides vs season
bikes %>% 
  ggplot(aes(y = rides, x = season)) + 
  geom_boxplot()

# rides vs season AND weekend
bikes %>%
  ggplot(aes(y = rides, x = season, ___ = ___)) +
  geom_boxplot()
```


### Exercise 2: follow-up

a. Describe (in words) the relationship of ridership with season & weekend status.
In doing so, address the following:

- No matter the weekend status, how is ridership associated with season?
- No matter the season, how is ridership associated with weekend status?
- Does the overall relationship of ridership with season & weekend status appear to be *strong*? 


b. A model of `rides` by `season` alone would be represented by only 4 expected outcomes, 1 for each season. Considering this and the plot above, how do you *anticipate* a model of `rides` by `season` and `weekend` status will be represented?        
    - 2 lines, 1 for each weekend status
    - 8 lines, 1 for each possible combination of season & weekend
    - 2 expected outcomes, 1 for each weekend status
    - 8 expected outcomes, 1 for each possible combination of season & weekend
    

### Exercise 3: 2 categorical predictors -- build the model

Let's build the multiple regression model of `rides` vs `season` and `weekend`:
    
```{r}
bike_model_2 <- lm(rides ~ weekend + season, bikes)
coef(summary(bike_model_2))
```

Thus estimated model formula is given by:

E[rides | weekend, season] = 4260.45 - 912.33 weekendTRUE - 116.38 seasonspring + 438.44 seasonsummer - 1719.06 seasonwinter

a. Use this model to predict the ridership on the following days:    
    
```{r}
# a fall weekday
4260.45 - 912.33*___ - 116.38*___  + 438.44*___ - 1719.06*___

# a winter weekday    
4260.45 - 912.33*___ - 116.38*___  + 438.44*___ - 1719.06*___

# a fall weekend day        
4260.45 - 912.33*___ - 116.38*___  + 438.44*___ - 1719.06*___

# a winter weekend day
4260.45 - 912.33*___ - 116.38*___  + 438.44*___ - 1719.06*___
```

Using the calculations in the chunk above, calculate the *differences* between various predictions:

```{r}
# diff in predicted ridership on a winter weekday vs a fall weekday
# (that is, the diff btwn 2 days that are at the same time of week but DIFFERENT seasons)
# WHERE HAVE YOU OBSERVED THIS NUMBER BEFORE?!
___ - ___
```

```{r}
# diff in predicted ridership on a winter weekend vs a winter weekday
# (that is, the diff btwn 2 days that are in the same season but DIFFERENT times of the week)
# WHERE HAVE YOU OBSERVED THIS NUMBER BEFORE?!
___ - ___
```


b. We only made 4 predictions here. How many *possible* predictions does this model produce? Is this consistent with your intuition in the previous exercise?


### Exercise 4: 2 categorical predictors -- interpret the model

Use your above predictions and visualization to fill in the below interpretations of the model coefficients.

Hint: What is the consequence of plugging in 0 or 1 for the different `weekend` and `season` categories?    


a. Interpreting 4260: On average, we expect there to be 4260 riders on (weekdays/weekends) during the (fall/spring/summer/winter).    

b. Interpreting -912: On average, *in any season*, we expect there to be 912 (more/fewer) riders on weekends than on ___.

c. Interpreting -1719: On average, *on both weekdays and weekends*, we expect there to be 1719 (more/fewer) riders in winter than in ___.


### Exercise 5: 2 quantitative predictors -- visualization   

Next, consider the relationship between `rides` and 2 *quantitative* predictors: `windspeed` and `temp_feel`.
Check out the plot of this relationship below which reflects the visualization *principle* that quantitative variables require some sort of numerical scaling mechanism -- `rides` and `windspeed` get numerical axes, and `temp_feel` gets a color scale:

![](https://mac-stat.github.io/images/155/bikes_multivar_3.png)

Modify the code below to recreate this plot.

```{r}
bikes %>%
  ggplot(aes(y = rides, x = windspeed, ___ = ___)) +
  geom_point()
```

OPTIONAL: Check out this *interactive* plot which allows us to explore this point cloud in 3D.

```{r}
library(plotly)
bikes %>% 
  plot_ly(x = ~windspeed, y = ~temp_feel, z = ~rides, 
          type = "scatter3d", 
          mode = "markers",
          marker = list(size = 5, color = ~rides, colorscale = "Viridis"))
```


### Exercise 6: follow-up

Describe (in words) the relationship of ridership with windspeed & temperature.
Make sure to comment on all 3 variables, strength of the relationship, and any outliers (if relevant).


### Exercise 7: 2 quantitative predictors -- modeling

Let's build the multiple regression model of `rides` vs `windspeed` and `temp_feel`:
    
```{r}
bike_model_3 <- lm(rides ~ windspeed + temp_feel, data = bikes)
coef(summary(bike_model_3))
```

Thus estimated model formula is

E[rides | windspeed, temp_feel] = -24.06 - 36.54 windspeed + 55.52 temp_feel


a. Interpret the intercept coefficient, -24.06, in context.


b. Interpret the `windspeed` coefficient, -36.54, in context. 


c. Interpret the `temp_feel` coefficient, 55.52, in context. 


### Exercise 8: Which is "best"?

We've now observed 3 different models of ridership, each having 2 predictors.
The R-squared values of these models, along with those of the simple linear regression models with each predictor alone, are summarized below.
    
model                     predictors                R-squared
------------------------- ------------------------- ------------
`bike_model_1`            `windspeed` & `weekend`   0.119
`bike_model_2`            `weekend` & `season`      0.349
`bike_model_3`            `windspeed` & `temp_feel` 0.310
`bike_model_4`            `windspeed`               0.047
`bike_model_5`            `temp_feel`               0.296
`bike_model_6`            `weekend`                 0.074
`bike_model_7`            `season`                  0.279


a. Which model does the best job of explaining the variability in ridership from day to day?


b. If you could only pick one predictor, which would it be?


c. What happens to R-squared when we add a second predictor to our model, and why does this make sense? For example, how does the R-squared for model 1 (with both `windspeed` and `weekend`) compare to those of model 4 (only `windspeed`) and model 6 (only `weekend`)?


d. Are 2 predictors always better than 1? Provide evidence and explain why this makes sense.


### Exercise 9: Principles of interpretation

These exercises have revealed some **principles** behind interpreting model coefficients, summarized below.

Review and confirm that these make sense.

---

**Principles of interpretation**

Consider a multiple linear regression model:

$$E[Y | X_1, X_2, ..., X_p] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$$

We can interpret the coefficients as follows:    

- $\beta_0$ ("beta 0") is the y-intercept. It describes the average value of $Y$ when $X_1, X_2,..., X_p$ are all 0, ie. when all quantitative predictors are set to 0 and the categorical predictors are set to their reference levels.    

- $\beta_i$ ("beta i") is the coefficient of $X_i$.    

    - If $X_i$ is quantitative, $\beta_i$ describes the average change in $Y$ associated with a 1-unit increase in $X_i$ while at a fixed set of the other $X$.  
    
    - If $X_i$ represents a category of a categorical variable, $\beta_i$ describes the average difference in $Y$ between this category and the reference category, while at a fixed set of the other $X$.    

---


\
\
\
\


## Extra practice {-}


### Exercise 10: Practice 1

Consider the relationship of `rides` vs `weekend` and `weather_cat`.    

a. Construct a visualization of this relationship.   
b. Construct a model of this relationship.    
c. Interpret the first 3 model coefficients. 


### Exercise 11: Practice 2

Consider the relationship of `rides` vs `temp_feel` and `humidity`.    

a. Construct a visualization of this relationship.    
b. Construct a model of this relationship.    
c. Interpret the first 3 model coefficients.    
    

### Exercise 12: Practice 3

Consider the relationship of `rides` vs `temp_feel` and `weather_cat`.    

a. Construct a visualization of this relationship.    
b. Construct a model of this relationship.    
c. Interpret the first 3 model coefficients.    
    

### Exercise 13: CHALLENGE

We've explored models with 2 predictors.
What about 3 predictors?!
Consider the relationship of `rides` vs `temp_feel`, `humidity`, AND `weekend`.

a. Construct a visualization of this relationship.    
b. Construct a model of this relationship.    
c. Interpret each model coefficient. 


### Exercise 14: Explore more examples

Outside of class, check out the below videos that include extra examples:

- [Interpreting multivariate models (Part I)](https://youtu.be/DnXwu1OWgMM) and [Part II](https://youtu.be/vQ_QZcZCEMU) ([slides](https://drive.google.com/file/d/1fiJL1IbReg6RJmjxRideYzYTyQtWR3Or/view?usp=sharing))


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!