---
title: "Multiple linear regression: interaction terms practice (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup, eval=TRUE}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("black", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity 12" or "12 interactions").
:::


## Warm-up {-}


### Example 1: Identify the confounder / effect modifier

It's our job to identify important confounders and effect modifiers.
Context is important!


a. Your friend wants to explore the relationship of bikeshare ridership among casual riders with temperature.    
    - Is weekend status a potential *confounder* or *effect modifier* in this relationship?
    - Represent your ideas using a causal diagram. Draw it in your notebook.

```{r echo = FALSE, eval = TRUE, fig.width = 4, fig.height = 3}
library(tidyverse)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>% 
  filter(temp_actual < 80)

bikes %>%
  ggplot(aes(y = riders_casual, x = temp_actual, color = weekend)) + 
  geom_point()
```

b. Another friend claims that ice cream causes sun burns, noting that sun burn rates increase as ice cream sales increase. They are mistaken!   
    - What important variable did they *not* consider in their analysis?
    - Is this a potential *confounder* or *effect modifier*?
    - Represent your ideas using a causal diagram. Draw it in your notebook.


\
\
\
\


### Example 2: Interaction or no interaction?

In addition to context, we can use plots to help identify potential interactions among our predictors, i.e. potential effect modifiers.

a. In predicting penguin bill length, is there a "notable" interaction between bill depth and sex? Provide evidence and explain your answer in context.

```{r eval = TRUE}
data(penguins)
penguins %>% 
  na.omit() %>% 
  ggplot(aes(y = bill_len, x = bill_dep, color = sex)) + 
  geom_smooth(method = "lm", se = FALSE)
```

b. In predicting wages, is there a "notable" interaction between education and job industry? Provide evidence and explain what your answer means in context.

```{r eval = TRUE}
read_csv("https://mac-stat.github.io/data/cps_2018.csv") %>% 
  filter(industry %in% c("ag", "construction", "service")) %>% 
  ggplot(aes(y = wage, x = education, color = industry)) + 
  geom_smooth(method = "lm", se = FALSE)
```

c. In predicting bikeshare ridership, is there a "notable" interaction between weekend and user status (whether a user is a *casual* or *registered* rider)? Provide evidence and explain what your answer means in context.

```{r eval = TRUE}
# Don't worry about this wrangling code!
new_bikes <- bikes %>%
  select(riders_casual, riders_registered, weekend, temp_actual) %>%
  pivot_longer(cols = riders_casual:riders_registered, 
               names_to = "user", names_prefix = "riders_",
               values_to = "rides") %>%
  mutate(weekend = factor(weekend))

new_bikes %>% 
  ggplot(aes(y = rides, x = user, fill = weekend)) + 
  geom_boxplot(alpha = .4)
```


d. In predicting bikeshare ridership among registered riders, is there a "notable" interaction between weekend status and season? Provide evidence and explain your answer in context.


```{r eval = TRUE}
bikes %>% 
  ggplot(aes(y = riders_registered, x = season, fill = weekend)) + 
  geom_boxplot(alpha = .4)
```


\
\
\
\


### Example 3: Model formulas

For the toy example below: 

- write out a model formula using $\beta$ notation
- write out sub-model formulas for groups A and B
- sketch / note where each $\beta$ coefficient is represented on the plot
- and indicate whether each $\beta$ coefficient is negative, 0, or positive

```{r eval = TRUE}
# Ignore this code!
read_csv("https://mac-stat.github.io/data/cps_2018.csv") %>% 
  filter(industry %in% c("construction", "service")) %>% 
  rename(y = wage, x = education, z = industry) %>% 
  mutate(z = ifelse(z == "construction", "B", "A")) %>% 
  ggplot(aes(y = y, x = x, color = z)) + 
  geom_smooth(method = "lm", se = FALSE)
```


## Exercises {-}

[Benoit and Marsh (2008)](http://www.kenbenoit.net/pdfs/ajps_348.pdf) explore the 2002 Irish Dail elections, Ireland's version of the U.S. House of Representatives.

Specifically, they explore how the number of `votes` a candidate received is associated with their campaign `spending` (in 1,000s of Euros) and their `incumbent` status: `Yes` = candidate is an incumbent (they are running for RE-election), or `No`. 

Import their data:

```{r eval = TRUE}
campaigns <- read_csv("https://mac-stat.github.io/data/campaign_spending.csv") %>% 
  select(wholename, district, votes, incumbent, spending) %>% 
  mutate(spending = spending / 1000) %>% 
  filter(!is.na(spending))
```

For the first several exercises, we'll consider the following research questions:

1. What role does campaign spending play in elections? 

2. .    
    a. Do candidates that spend more money tend to get more votes? 
    b. How might this depend upon whether a candidate is an incumbent (they are running for RE-election) or a challenger (they are challenging the incumbent)? 


### Exercise 1: Translating scientific questions into statistical questions

a. Check out the variables we have access to in the `campaigns` data, and consider our research Question 1. How might we translate this question into a statistical one? That is, how might we answer using the data we have available? What might you compare?

There is no one *right* answer to this! Brainstorm with your group.

```{r}
head(campaigns)
```

b. Question 2a is a bit more specific than Question 1. Translate this question into a statistical one. Specifically, write out the population model statement in $E[Y | X] = ...$ notation that would answer this question, and note which regression coefficient you would interpret to provide you with an answer.

$$
E[___ | ___] = ...
$$

c. Question 2b is also specific, and builds on Question 2a. Translate this question into a statistical one. Specifically, write out the population model statement for a **multiple** linear regression model in $E[Y | X] = ...$ notation that would answer this question, and note which regression coefficient you would interpret to provide you with an answer.

$$
E[___ | ___] = ...
$$

### Exercise 2: Visualizing Interaction

a. Visualize the relationship between campaign spending and number of votes a candidate received. Include an aesthetic to distinguish this relationship between incumbents and challengers. Do *not* yet include lines of best fit from any statistical model!

```{r}
# Visualization

```

b. Based on this visualization, what are your answers to research questions 2a and 2b? Write your answer in 2-3 sentences, describing general trends you notice, suitable for a general audience.

c. Add lines of best fit from a statistical model that includes an interaction term between incumbent status and spending to your plot from part (a), using `geom_smooth`. Based on your updated plot, do you think including an interaction between incumbent status and spending in a multiple linear regression model would be meaningful in this context? Why or why not?

```{r}
# Visualization with lines of best fit

```

### Exercise 3: Fitting and interpreting models with interaction terms

a. Fit the regression model you wrote out in Exercise 1 (c). *Report* the regression coefficients below and note what each physically represents in your plot of the model lines above. 

```{r}
# Model with interaction term

```

> `(Intercept)`:

> `incumbentYes`:

> `spending`:

> `incumbentYes:spending`:

b. Using the coefficient estimates from part (a), write out *two separate model statements*, one for incumbents and one for challengers. Combine terms (using algebra) when you can! *Hint*: remember the indicator variables video!

* For incumbents:

$$
E[votes | spending] = 
$$

* For challengers:

$$
E[votes | spending] = 
$$

c. Interpret the `incumbent` coefficient, in context. *Make sure to use non-causal language, include units, and talk about averages rather than individual cases.* Is this coefficient scientifically meaningful?

d. Interpret the `spending` and interaction coefficients in your model, in context. *Make sure to use non-causal language, include units, and talk about averages rather than individual cases.* PRO TIP: When at least one of the variables interacting is quantitative, it's often convenient to interpret the interaction coefficient alongside the quantitative variable's coefficient. The former modifies the latter.

e. Based on your interpretation in part (d), and the visualization you made including lines of best fit, do you think that including an interaction term for incumbent status and spending is meaningful, when predicting number of votes? Explain why or why not.

### Exercise 4: Interactions between two categorical variables

Let's return to our data on bike ridership. Suppose we are interested in the relationship between daily ridership (our response variable) and whether a user is a casual or registered rider **and** whether the day falls on a weekend. First, we need to create a binary variable indicating whether a user is a casual or registered rider.

```{r}
# Creating user variable, don't worry about syntax!
new_bikes <- bikes %>%
  dplyr::select(riders_casual, riders_registered, weekend, temp_actual) %>%
  pivot_longer(cols = riders_casual:riders_registered, names_to = "user",
               names_prefix = "riders_", values_to = "rides") %>%
  mutate(weekend = factor(weekend))
```

a. For each of our three relevant variables, `weekend`, `user`, and `rides`, classify them as quantitative or categorical.

> `weekend`:

> `user`:

> `rides`:

b. Make an appropriate visualization to explore the relationship between these three variables.

```{r}
# Visualization

```

c. Is the relationship between ridership and weekend status the same for both registered and casual users? Explain why or why not, referencing the visualization you made in part (b).

d. To reflect what you observed in your visualization, fit a multiple linear regression model with an interaction term between `weekend` and `user` in our model of ridership.

```{r}
# Multiple linear regression model

```

e. Use your model to make 2 predictions:

- Predict casual ridership on a weekday = 

- Predict registered ridership on a weekend = 


f. Use the appropriate coefficient estimate(s) to answer the following. (This is not easy! Your plot can help you assess whether your answers make sense!)

- On weekdays, how many *more* registered vs casual riders would we expect?


- Among casual riders, how many *more* riders would we expect on a weekend vs weekday?


- Among registered riders, how many *fewer* riders would we expect on a weekend vs weekday? NOTE: The answer is NOT 1714.


- Among casual riders, what's the expected ridership on a weekday?


g. Interpret the interaction term from your model, in context. *Make sure to use non-causal language, include units, and talk about averages rather than individual cases.* Just as in Exercise 3, you may find it useful to first write out multiple model statements for different categories defined by one of your categorical variables, and proceed from there! 

### Exercise 5: Interactions between two quantitative variables

```{r}
# A little bit of data wrangling code - let's not focus on this for now
cars <- read_csv("https://mac-stat.github.io/data/used_cars.csv") %>%
  mutate(milage = milage %>% str_replace_all(",","") %>% str_replace(" mi.","") %>% as.numeric(),
         price = price %>% str_replace_all(",","") %>% str_replace("\\$","") %>% as.numeric(),
         age = 2026 - model_year) # 2026 so that yr. 2024 cars are one year old
```

Here we'll explore the relationship between `price`, `milage`, and `age` of a used car. Below is a scatterplot of mileage vs. price, colored by age:

```{r}
cars %>% 
  ggplot(aes(x = milage, y = price, col = age)) +
  geom_point(alpha = 0.5) + # make the points less opaque
  scale_color_viridis_c(option = "H") + # a fun, colorblind-friendly palette!
  theme_classic() # removes the gray background and grid
```

It's a little difficult to tell what exactly is going on here. In particular, does the relationship between mileage and price vary with age of a used car? Let's try adding some fitted lines for cars of different ages.

```{r}
# Ignore where the numbers in geom_abline() came from for now... we'll get there
cars %>% 
  ggplot(aes(x = milage, y = price, col = age)) +
  geom_point(alpha = 0.5) + 
  scale_color_viridis_c(option = "H") + 
  theme_classic() +
  geom_abline(slope = -6.627e-01 + 1.775e-02, intercept = 8.628e+04 -1.395e+03, col = "black") +
  geom_abline(slope = -6.627e-01 + 10 * 1.775e-02, intercept = 8.628e+04 - 10 * 1.395e+03, col = "blue") +
  geom_abline(slope = -6.627e-01 + 30 * 1.775e-02, intercept = 8.628e+04 - 30 * 1.395e+03, col = "green") +
  ggtitle("Black: Age = 1yr, Blue: Age = 10yr, Green: Age = 30yr")
```

a. **Challenge question**: Based on the fitted lines in the plot above, anticipate what the *signs* (positive or negative) of the coefficients in the following interaction model will be:

$$
E[price | age, milage] = \beta_0 + \beta_1 milage + \beta_2 age + \beta_3 milage:age
$$
* $\beta_0$:  Put your response here...

* $\beta_1$:  Put your response here...

* $\beta_2$:  Put your response here...

* $\beta_3$:  Put your response here...

b. Fit a multiple linear regression model with an interaction term between `milage` and `age` in our model of used car `price`.


```{r}
# Multiple linear regression model


# ... now do you see where the numbers in geom_abline() came from?
```


As before, we could choose distinct ages, and interpret the relationship between mileage and price for each of those groups separately. However, since age is *quantitative* and not *categorical*, this doesn't quite give us the whole picture. Instead, we want to know how the relationship between mileage and price changes **for each additional year old** a car is. This is what the interaction coefficient estimates, when the interaction term is between two quantitative variables!

c. Interpret the interaction term, in context. *Make sure to use non-causal language, include units, and talk about averages rather than individual cases.*


### Reflection

Through the exercises above, you practiced visualizing, fitting, and interpreting multiple linear regression models with interaction terms between combinations of categorical and quantitative variables. Think about how the fitted lines looked in situations where you think there was a *meaningful* interaction taking place. How do you think the fitted lines would look if there was *no* meaningful interaction present? Explain your reasoning.


## Extra practice {-}


The `homes` data includes 2006 data on homes in Saratoga County, New York:

```{r}
# Import data
homes <- read_csv("https://mac-stat.github.io/data/homes.csv") %>% 
  mutate(price = Price, size = Lot.Size, new = (New.Construct == 1)) %>% 
  select(price, size, new)
head(homes)
```

Let's explore the relationship of home price ($) with lot size (acres) and whether or not it's new construction (TRUE/FALSE):

```{r}
homes %>% 
  ggplot(aes(y = price, x = size, color = new)) +
  geom_smooth(method = "lm", se = FALSE)
```


### Exercise 6: Concepts

Do any 2 of these 3 variables (price, size, new) *interact*?
If so:

- Which 2 variables interact?
- What's your evidence?
- What does this interaction mean in context?


### Exercise 7: More concepts

Consider the interaction model below:

E[`price` | `size`, `new`] = $\beta_0$ + $\beta_1$ `size` + $\beta_2$ `newTRUE` + $\beta_3$ `size*newTRUE`

Answer the following questions using the plot alone -- don't yet build any models.
Based on our sample data...


a. Is $\beta_0$ negative (< 0), 0, or positive (> 0)?
Provide evidence from the plot and explain what this means in context.

b. Is $\beta_1$ negative (< 0), 0, or positive (> 0)?
Provide evidence from the plot and explain what this means in context.

c. Is $\beta_2$ negative (< 0), 0, or positive (> 0)?
Provide evidence from the plot and explain what this means in context.

d. Is $\beta_3$ negative (< 0), 0, or positive (> 0)?
Provide evidence from the plot and explain what this means in context.


### Exercise 8: Use the model

Build the sample model:

```{r}
home_model <- lm(price ~ size * new, data = homes)
coef(summary(home_model))
```

a. Were your answers to the previous exercise correct?!

b. Provide 2 separate sub-model formulas for new and not new construction.

c. Interpret the `size` and `size:newTRUE` coefficients.


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!