
11 Multiple linear regression: exploring interaction
Settling In
- Sit where you’d like
- Introduce yourself
- Check in with each other.
- Convo Prompt: How is AI changing your life? How do you think AI will change society?
- Help each other get ready to take notes!
- Open your notebook. No computer needed today!
We’ll go through the exercises on paper, but the digital version is on the website (qmd here).
Recap
Thus far, our multiple linear regression models have assumed that multiple predictors have separate, additive associations with the response:
E[Y | X_1, X_2] = \beta_0 + \beta_1 X_1 + \beta_2 X_2
Reality might be messier!
The association of Y with X_1 might differ depending on X_2. In other words, X_2 might modify the effect of X_1 on Y.
No problem!
Our multiple linear regression models can handle this reality via interaction terms.
By the end of this lesson, you should be able to:
- Describe when it would be useful to include an interaction term to a model
- Write a model formula for an interaction model
- Interpret the coefficients in an interaction model in the data context
Today is a day to discover ideas, so no readings or videos to go through before class.
There will be resources for the next class.
This qmd file is where you’ll type notes, code, etc. Directions:
- Save this file in the
inclass_activitiessub-folder of thestat155folder you created before today’s class. Use a file name related to the activity number and/or today’s date (eg: “activity_11” or “11_interaction_explore”).
Warm-up
Guiding question: What job sectors have the highest return on education?
We’ll use data from the 2018 Current Population Survey to explore. The codebook for this data is available here. For now we’ll focus on individuals who have jobs in the management or transportation sectors to simplify our explorations.
# Load packages and data
library(tidyverse)
cps <- read_csv("https://mac-stat.github.io/data/cps_2018.csv")
# Get data on just the management and transportation sectors
cps_sub <- cps %>%
filter(industry %in% c("management", "transportation"))It would be great to know the true effect of years of education on wages. Let’s start by looking at the relationship between these two variables.
ggplot(cps_sub, aes(x = education, y = wage)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)There is a positive correlation between years of education and wages, with a fair bit of spread about the line of best fit. We can fit a simple linear regression model to obtain the intercept and slope of that line.
wage_mod_1 <- lm(wage ~ education, data = cps_sub)
coef(summary(wage_mod_1))Note that our intercept estimate is negative which doesn’t make sense! People with 0 years of education still earn wages! Inspecting the plot, we likely could have better accounted for this with a nonlinear transformation of the education variable, but we will leave this issue aside for now.
Let’s focus on this question: does this simple linear regression model help us understand the true effect of years of education?
Example 1: Confounders
We’ll want to consider confounding variables in order to better answer that question. One possible confounder is industry. Draw a causal diagram showing how industry, years of education, and wage relate, and explain what unfair comparisons result from using a simple linear regression model.
Example 2: Multiple linear regression
Let’s adjust for industry by fitting a multiple linear regression model.
wage_mod_2 <- lm(wage ~ education + industry, cps_sub)
coef(summary(wage_mod_2))Interpret the education and industrytransportation coefficients in the context of the data. (Remember to include units.) How does the relationship between years of education and wages change after adjusting for industry?
Example 3: Visualization
Hold on! We sped ahead too quickly. It’s important to visualize our data thoroughly first. Let’s add industry to our original scatterplot. What do you notice about the lines of best fit for these two industries?
ggplot(cps_sub, aes(x = education, y = wage, color = industry)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)It would be nice to be able to capture the different trend for the two industries. Let’s see if our original multiple linear regression model (wage_mod_1) was able to do so.
# Visualize the relationships from the fitted model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_line(aes(y = wage_mod_2$fitted.values))What do you notice about what our model produces? Based on your coefficient interpretations from earlier, is this behavior what you would have expected? How do you think our multiple linear regression model is limited? How might we try to fix this?
Example 4: Improving the model
In our causal diagram, both years of education and industry affect wages, and one way to capture this is with our model in wage_mod_2:
E[\text{wage} \mid \text{education}, \text{industry}) = \beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation}
Some other ways to capture how wages are affected by years of education and industry could look like this:
- \beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \text{education}^2
- \beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \log(\text{education})
- \beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \text{education}*\text{industrytransportation}
That last type of model is called an interaction model. A general interaction model formula looks like this:
E[Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1*X_2
The outcome Y depends on X_1 and X_2 with the usual multiple linear regression part: \beta_1 X_1 + \beta_2 X_2. But it also includes an interaction term \beta_3 X_1*X_2.
Let’s fit an interaction model for our cps_sub data and explore what relationships our model estimates.
# Fit an interaction model
# NEW SYNTAX: Note the * instead of +
wage_mod_2 <- lm(wage ~ education * industry, cps_sub)
# Visualize the relationships from the interaction model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_line(aes(y = wage_mod_2$fitted.values))
# By default, the interaction model is drawn by geom_smooth
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_smooth(method = "lm", se = FALSE)How does our new interaction model compare to our previous one?
Example 5: Interpreting the interaction model
This is a more complex new model! Let’s explore what is going on mathematically by examining the overall model formula and how we can use it to get model formulas for each industry.
# View coefficient estimates
coef(summary(wage_mod_2))Model formulas:
E[wage | education, industry] = -65590.606 + 8678.274 education + 90232.230 transportation - 7580.228 education * transportation
Broken down by industry:
Management:
E[wage | education, industry = management] = -65590.606 + 8678.274 education
Transportation:
E[wage | education, industry = transportation] = -65590.606 + 8678.274 education + 90232.230 - 7580.228 education = (-65590.606 + 90232.230) + (8678.274 - 7580.228)education
= 24641.62 + 1098.046education
Question 1: The intercept coefficient, -65590.606, corresponds to what property of the lines?
- management intercept
- transportation intercept
- how the transportation intercept compares to the management intercept
Thus, how can we interpret this coefficient in the context of the wage analysis?
Question 2: The transportation coefficient, 90232.230, corresponds to what property of the lines?
- management intercept
- transportation intercept
- how the transportation intercept compares to the management intercept
Thus, how can we interpret this coefficient in the context of the wage analysis?
Question 3: The education coefficient, 8678.274, corresponds to what property of the lines?
- management slope
- transportation slope
- how the transportation slope compares to the management slope
Thus, how can we interpret this coefficient in the context of the wage analysis?
Question 4: The interaction coefficient, -7580.228, corresponds to what property of the lines?
- management slope
- transportation slope
- how the transportation slope compares to the management slope
Thus, how can we interpret this coefficient in the context of the wage analysis?
Exercises
Exercise 1: Wages across all industries
The plot below illustrates the relationship between wage and education for all of the industries in our cps dataset.
# Plot
ggplot(cps, aes(y = wage, x = education, color = industry)) +
geom_smooth(method = "lm", se = FALSE)What about this plot indicates that it would be a good idea to fit an interaction model?
What industry will R use as the reference category?
(Challenge!) Before fitting the model in R, write down what you think the model formula will look like.
Fit a model that includes an interaction term between
educationandindustry.
# Fit an interaction model called wage_model
# Display summarized model outputIn what industry do wages increase the most per additional year of education? What is this increase?
Similarly, in what industry do wages increase the least per additional year of education? What is this increase?
Exercise 2: Thinking beyond
Do you think there are other variables (which may or may not be in our cps data) that have an interaction with industry in affecting wages? If you were to fit an interaction model, what results might you expect to find?
Reflection
Through the exercises above, we developed ideas about when to fit interaction models and how to interpret results. Describe what makes sense and what is still unclear about this topic.
Solutions
Warm Up
Example 1
Example 2
Solution
wage_mod_2 <- lm(wage ~ education + industry, cps_sub)
coef(summary(wage_mod_2)) Estimate Std. Error t value Pr(>|t|)
(Intercept) -54505.666 7274.6504 -7.492548 8.015884e-14
education 7973.563 456.6417 17.461311 3.244777e-66
industrytransportation -5420.432 3589.2564 -1.510182 1.310637e-01
After controlling for industry, each additional year of education is associated with a $7974 increase in the average wage.
After controlling for years of education, the average wage for individuals working in the transportation industry is $5420 less than that of those in the management industry.
The change in average wage for each additional year of education is less after accounting for industry
Example 3
Solution
ggplot(cps_sub, aes(x = education, y = wage, color = industry)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# Visualize the relationships from the fitted model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_line(aes(y = wage_mod_2$fitted.values))
The model with both education and industry by default creates parallel lines. We would need to allow the slopes for education to be different for each industry. The could be done by fitting two separate lines, one for transportation and one for management.
Example 4
Solution
# Fit an interaction model
# NEW SYNTAX: Note the * instead of +
wage_mod_2 <- lm(wage ~ education * industry, cps_sub)
# Visualize the relationships from the interaction model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_line(aes(y = wage_mod_2$fitted.values))
# By default, the interaction model is drawn by geom_smooth
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) +
geom_smooth(method = "lm", se = FALSE)This one model allows the slopes to differ!
Example 5
Solution
Model formulas:
E[wage | education, industry] = -65590.606 + 8678.274 education + 90232.230 transportation - 7580.228 education * transportation
Broken down by industry:
Management:
E[wage | education, industry = management] = -65590.606 + 8678.274 education
Transportation:
E[wage | education, industry = transportation] = -65590.606 + 8678.274 education + 90232.230 - 7580.228 education = (-65590.606 + 90232.230) + (8678.274 - 7580.228)education
= 24641.62 + 1098.046education
Question 1: The intercept coefficient, -65590.606, corresponds to what property of the lines?
- management intercept
In the context of the wage analysis, $-65590.606 is the average wage for workers in the management industry with 0 years of education. However, this is extrapolation, so it doesn’t make sense to interpret.
Question 2: The transportation coefficient, 90232.230, corresponds to what property of the lines?
- how the transportation intercept compares to the management intercept
In the context of the wage analysis, $90232.230 is the difference in average wage between workers in transportation vs. management industry with 0 years of education.
Question 3: The education coefficient, 8678.274, corresponds to what property of the lines?
- management slope
In the context of the wage analysis, for each additional year of education, there is an associated $8678.274 increase in average wage for workers in the management industry.
Question 4: The interaction coefficient, -7580.228, corresponds to what property of the lines?
- how the transportation slope compares to the management slope
In the context of the wage analysis, the is an associated change in average wage for each additional year of education is $7580.228 less for workers in the transportation as compared to the management industry.
Exercises
Exercise 1: Wages across all industries
Solution
The plot below illustrates the relationship between wage and education for all of the industries in our cps dataset.
# Plot
ggplot(cps, aes(y = wage, x = education, color = industry)) +
geom_smooth(method = "lm", se = FALSE)
The industry-specific lines all have different slopes.
ag (first in alphabetical order)
This is a challenge! Compare your prediction to what you see when fitting the model in part d.
# Fit an interaction model called wage_model
wage_model <- lm(wage ~ education*industry, data = cps)
# Display summarized model output
coef(summary(wage_model)) Estimate Std. Error t value
(Intercept) 31475.87521 22370.504 1.40702574
education 61.95396 2039.257 0.03038065
industryconstruction -14427.01189 25740.953 -0.56046923
industryinstallation_production -33208.72359 25346.017 -1.31021469
industrymanagement -97066.48097 23305.235 -4.16500759
industryservice -55462.76415 23229.134 -2.38763810
industrytransportation -6834.25066 27495.549 -0.24855844
education:industryconstruction 2295.51232 2297.659 0.99906577
education:industryinstallation_production 3759.05906 2244.792 1.67456904
education:industrymanagement 8616.31984 2080.190 4.14208220
education:industryservice 4384.72036 2092.523 2.09542317
education:industrytransportation 1036.09210 2409.093 0.43007562
Pr(>|t|)
(Intercept) 1.594509e-01
education 9.757641e-01
industryconstruction 5.751720e-01
industryinstallation_production 1.901533e-01
industrymanagement 3.139604e-05
industryservice 1.697551e-02
industrytransportation 8.037075e-01
education:industryconstruction 3.177870e-01
education:industryinstallation_production 9.405013e-02
education:industrymanagement 3.470009e-05
education:industryservice 3.615851e-02
education:industrytransportation 6.671499e-01
In the management industry, wages increase the most per year of education. The increase is 61.95396 + 8616.31984 = $8678.274 per year. That is, every additional year of education is associated with an average increase of $8678.27 in yearly wages in the management industry.
In the agriculture industry, wages increase the least per year of education. The increase is $61.95 per year. That is, every additional year of education is associated with an average increase of $61.95 in yearly wages in the ag industry.
Exercise 2: Thinking beyond
Solution
If a variable x has an interaction with the industry variable in affecting wages, then the relationship between x and wages must be different by industry. We might suspect that this could be the case for hours worked per week. We can make a plot to verify that this is actually the case:
ggplot(cps, aes(y = wage, x = hours, color = industry)) +
geom_smooth(method = "lm", se = FALSE)