---
title: "Simple linear regression: categorical predictor (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```

::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity_7" or "7_categorical_predictors").
:::


## Warm-up {-}


**Context:** Today we'll explore data on thousands of diamonds to understand how physical characteristics relate to price.

```{r warning=FALSE, message=FALSE}
# Load packages and import data
library(tidyverse)
data(diamonds)

# A little bit of data wrangling code - let's not focus on this for now
diamonds <- diamonds %>% 
    mutate(
        cut = factor(cut, ordered = FALSE),
        color = factor(color, ordered = FALSE),
        clarity = factor(clarity, ordered = FALSE)
    )
```

We'll focus on 3 variables:

- `price` = price in US dollars
- `cut` = quality of the diamond cut or shape (`Fair`, `Good`, `Very Good`, `Premium`, `Ideal`)
- `color` = diamond color, from `D` (best) to `J` (worst)

\
\


**Research question:**

How can we predict or explain the variability in `price` from diamond to diamond using diamond `cut` or `color`?


\
\


**What's new?**

In this research question, our outcome / response variable (`price`) is still *quantitative* but our potential predictors (`cut`, `color`) are *categorical*!
We're in luck -- simple linear regression is a powerful tool that can be used no matter whether a predictor is quantitative or categorical.


### Example 1: Get to know the data

Write R code to answer the following:

```{r}
# What do the first few rows of the data look like?
# What does a case represent?

```

```{r}
# How many cases and variables do we have?

```

```{r}
# Construct and interpret 2 visualizations of price

```

Interpretation:

```{r}
# Calculate the central tendency and spread in price

```

```{r}
# Tabulate the number of diamonds of each cut

```

```{r}
# Construct and interpret a visualization of the cut variable

```

Interpretation:


### Example 2: Price vs cut (bad viz)

Let's now explore how diamond price might be explained by cut.
We'll start by visualizing this relationship.
Use the results below to explain why a scatterplot is *not* an effective visualization and a line is *not* an effective summary of the relationship in this example (or for exploring a quantitative Y variable vs categorical X predictor in general).

NOTE: R gives error messages for bad code, but not bad ideas!!

```{r}
# Try a scatterplot
diamonds %>% 
  ggplot(aes(y = price, x = cut)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```


### Example 3: Price vs cut (good viz)

Side-by-side *density plots* provide a better picture of the relationship between price and cut.

```{r}
# Check out a density plot of price alone
diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_density()
```

```{r}
# Modify that code to separate the density plot by cut,
# using different line colors to distinguish between different cuts:
diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_density()
```

```{r}
# Modify the code again: use different *fills* to distinguish cuts
diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_density()
```

```{r}
# Modify this code to create side-by-side *boxplots* of price by cut
diamonds %>% 
  ggplot(aes(y = price)) + 
  geom_boxplot()
```

```{r}
# Modify this code to create separate *histograms*:
diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(color = "white") + 
  facet_wrap(~ ___)
```


**FOLLOW-UPS**

- Describe the relationship of `price` with `cut`. Do you notice anything unexpected? What do you think might be happening here?
- What do you think was the least effective plot?


::: {.callout-note title = "Next up: modeling!"}

We can use linear regression to model the relationship of a quantitative response variable Y with a categorical predictor variable X!
This is a relief because predictors are often categorical.
Examples:

- Model time to campus (Y) by mode of transportation (X) -- walk, bike, bus, etc.
- Model film revenue (Y) by genre (X) -- rom com, sci-fi, etc.
- Model wages (Y) by job industry (X) -- agriculture, construction, etc.
- Model some health outcome (Y) by treatment -- A or B.

Though the linear regression model formula is *not* represented by a line, it still has a familiar form:

$E[Y | X]$ = $\beta_0$ + $\beta_1$ (something related to $X$)

:::


## Exercises {-}


### Exercise 1: Numerical summaries by category

In exploring the relationship of `price` with `cut`, we've explored some visual summaries.
Let's follow these up with some numerical summaries.

a. To warm up, first calculate the mean `price` across all diamonds.

```{r}
diamonds %>% 
    ___(mean(___))
```

b. To summarize the trends we observed in the grouped plots above, we can calculate the mean `price` for each type of `cut`. This requires the inclusion of the `group_by()` function:

```{r}
# Group the diamonds data by cut
# Show just the first 6 rows
# Notice that nothing obviously changes, just how the data are structured in the background
diamonds %>% 
    group_by(cut) %>% 
    head()
```

```{r}
# So we *almost always* follow up `group_by()` with `summarize()`
# For example: calculate the mean price by cut
diamonds %>% 
    group_by(cut) %>% 
    ___(mean(___))
```


c. Examine the group mean measurements, and make sure that you can match these numbers up with what you see in the plots.


### Exercise 2: Indicator variables

a. Based on the results above, we learn that, on average, diamonds with a "Fair" cut tend to cost more than higher-quality cuts. Let's construct a new variable named `cutFair`, using on the following criteria:

- `cutFair` = 1 if the diamond is of Fair cut
- `cutFair` = 0 otherwise (any other value of cut (Good, Very Good, Premium, Ideal))

The `ifelse` function allows to create a new variable from an existing one, based on whether or not the values in that variable meet a certain "condition" (remember, you can always look up function documentation in R by typing `?ifelse` in the Console, and hitting enter!). 

Fill in the following code to create `cutFair`. The condition was given to you already. Try to use this to complete the code.

```{r}
# In the first blank, put what value cutFair should have if the condition is "met", or TRUE
# In the second blank, put what value cutFair should have if the condition is "not met", or FALSE
diamonds <- diamonds %>%
  mutate(cutFair = ifelse(cut == "Fair", ___, ___))
```

Variables like `cutFair` that are coded as 0/1 to numerically indicate if a categorical variable is at a particular state are known as an **indicator variable**. You will sometimes see these referred to as a "binary variable" or "dichotomous variable"; you may also encounter the term "dummy variable" in older statistical literature.

b. Now, let's calculate the group means based on the new `cutFair` indicator variable. Examine the group means and store them in your brain for later:

```{r}
diamonds %>% 
    group_by(cutFair) %>% 
    summarize(mean(price))
```

c. Calculate the *difference* between the mean price for fair and not fair diamonds (mean for fair - mean for not fair).
Check out the actual number and keep this in your brain!

```{r}

```


### Exercise 3: Modeling trend using a categorical predictor with exactly 2 categories 

Next, let's *model* the trend in the relationship between the `cutFair` and `price` variables using a simple linear regression model:

```{r}
# Construct the model
diamond_mod0 <- lm(price ~ cutFair, data = diamonds)

# Summarize the model
coef(summary(diamond_mod0))
```

Compare these results to the output of the previous exercise. What do you notice? How do you interpret the intercept and `cutFair` coefficient terms from this model?


### Exercise 4: Modeling trend using a categorical predictor with >2 categories

Using a single binary predictor like the `cutFair` indicator variable is useful when there are two clearly delineated categories. However, the `cut` variable actually contains 5 categories! Because we've collapsed all non-Fair classifications into a single category (i.e. `cutFair = 0`), the model above can't tell us anything about the difference in expected price between, say, *Premium* and *Ideal* cuts. The good news is that it is straightforward to model categorical predictors with >2 categories. We can do this by using the `cut` variable as our predictor: 

```{r}
# Construct the model
diamond_mod <- lm(price ~ cut, data = diamonds)

# Summarize the model
coef(summary(diamond_mod))
```

a. Even though we specified a single predictor variable in the model, we get 4 coefficient estimates--why do you think this is the case?


NOTE: We observe 4 indicator variables (for Good, Very Good, Premium, and Ideal), but we do not see `cutFair` in the model output. This is because `Fair` is the **reference level** of the `cut` variable (it's first alphabetically). You'll learn below that it is, indeed, still in the model. You'll also learn why the term "reference level" makes sense!
  

b. After examining the summary table output from the code chunk above, complete the model formula:

\

E[price | cut] = ___ +/- ___ cutGood +/- ___ cutVery Good +/- ___ cutPremium +/- ___ cutIdeal

\   


### Exercise 5: Making sense of the model

Recall our model:  E[price | cut] = 4358.7578 - 429.8933 cutGood - 376.9979 cutVery Good + 225.4999 cutPremium - 901.2158 cutIdeal

a. Use the model formula to calculate the expected/typical price for diamonds of *Good* cut. HINT: Plug in 0s and 1s!!

```{r}
4358.7578 - 429.8933*___ - 376.9979*___ + 225.4999*___ - 901.2158*___
```

b. Similarly, calculate the expected/typical price for diamonds of *Fair* cut.

```{r}
4358.7578 - 429.8933*___ - 376.9979*___ + 225.4999*___ - 901.2158*___
```

c. Check your work with the `predict()` function:

```{r}
predict(diamond_mod, newdata = data.frame(cut = "Good"))
predict(diamond_mod, newdata = data.frame(cut = "Fair"))
predict(diamond_mod, newdata = data.frame(cut = c("Good", "Fair")))
```


d. Re-examine these 2 calculations. Where have you observed these numbers before?!


### Exercise 6: Interpreting coefficients

Recall that our model formula is *not a formula for a line*. Thus we can't interpret the coefficients as "slopes" as we have before. Taking this into account and reflecting upon your calculations above...   

a. Interpret the intercept coefficient (`4358.7578`) in terms of the data context. *Make sure to use non-causal language, include units, and talk about averages rather than individual cases.*

b. Interpret the `cutGood` and `cutVery Good` coefficients (`-429.8933` and `-376.9979`) in terms of the data context. Hint: where did you use these value in the prediction calculations above?


### Exercise 7: Modeling choices (CHALLENGE)

Why do we fit this model in this way (using 4 indicator variables `cutGood`,  `cutVery Good`, `cutPremium`, `cutIdeal`)? Instead, suppose that we created a single variable `cutCat` that gave each category a numerical value: 0 for Fair, 1 for Good, 2 for Very Good, 3 for Premium, and 4 for Ideal.

How would this change things? What are the pros and cons of each approach?


### Reflection

Through the exercises above, you learned how to build and interpret models that incorporate a categorical predictor variable. For the benefit of your future self, summarize how one can interpret the coefficients for a categorical predictor.


## Extra Practice {-}


### Exercise 8: The least squares criterion

The coefficient estimates `diamond_mod` were selected in the same way as when our predictor is quantitative: by minimizing the sum of the squared residuals.
Use this model to calculate the residual for the first diamond in the dataset:

```{r}
# Observed data on the first diamond
diamonds %>% 
  select(price, cut) %>% 
  head(1)
```


### Exercise 9: Diamond color

Consider modeling `price` by `color`.

- Before creating a visualization that shows the relationship between price and color, write down what you expect the plot to look like. Then construct and interpret an appropriate plot.
- Compute the average price for each color.
- Fit an appropriate linear model with `lm()` and display a short summary of the model.
- Write out the model formula from the above summary.
- Which color is the reference level? How can you tell from the model summary?
- Interpret the intercept and two other coefficients from the model in terms of the data context.


### Exercise 10: Diamond clarity

Repeat the steps from the previous exercise for the `clarity` variable.


### Exercise 11: Evaluating model strength

Let's study some penguin data!

```{r}
data(penguins)
penguins <- penguins %>% 
  filter(!is.na(sex), !is.na(species))
```

We'll focus on 3 variables with the goal of predicting `flipper_len`:

- `flipper_len` = the length of the penguin's flippers (arms) in mm
- `species` = Adelie, Chinstrap, or Gentoo
- `sex` = female, male

And build 2 models of `flipper_len`:

```{r}
# Model of flipper_len by sex
flipper_model_1 <- lm(flipper_len ~ sex, data = penguins)
flipper_model_2 <- lm(flipper_len ~ species, data = penguins)
```


Just as with models that use *quantitative* predictors, it's important to *evaluate* our (eventual) models of `flipper_len` by `species` and `sex`: are they correct? strong? fair?

Let's start with strength.
How strong are these models?
What's the best predictor?
Let's explore.

a.  Based on the boxplots below, what is the stronger predictor of `flipper_len`: `sex` or `species`?

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = sex)) + 
  geom_boxplot()
```

```{r}
penguins %>% 
  ggplot(aes(y = flipper_len, x = species)) + 
  geom_boxplot()
```

b. We used 2 approaches to measuring the strength of the simple linear regression models in previous activities: correlation and R-squared.
Unfortunately, we get errors when we try to calculate the correlation between `flipper_len` and `sex`, and between `flipper_len` and `species`.
Why?

```{r}
penguins %>% 
  summarize(cor(sex, flipper_len))
penguins %>% 
  summarize(cor(species, flipper_len))
```


c. Luckily, R-squared works no matter whether a predictor is quantitative or categorical! Interpret and compare the R-squared values for our separate models of `flipper_len` by `sex` and `species`. Which is the stronger predictor? Does this match your answer in part a?

```{r}
# Model of flipper_len by sex
summary(flipper_model_1)

# Model of flipper_len by species
summary(flipper_model_2)
```


### Exercise 12: Evaluating model correctness

Let's just focus on the stronger of our 2 models, that of `flipper_len` by `species` (`flipper_model_2`), and ask: is it correct (not wrong)?
Recall that when our predictor was *quantitative*, residual plots provided some insight.
Check out the residual plots below.
They look a little goofy!
Explain the goofiness (what's happening here) and describe what you learn about the model's "correctness".
Is the model "correct"?

```{r}
# Residual plot
flipper_model_2 %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0)
```

```{r}
# Residual plot using boxes!
flipper_model_2 %>% 
  ggplot(aes(x = .fitted, y = .resid, group = .fitted)) + 
  geom_boxplot() + 
  geom_hline(yintercept = 0)
```


### Exercise 13: Really understand how the coefficients work

Next, consider the model of `flipper_len` by `island`.
The average `flipper_len` of penguins on each `island` is calculated below:

```{r}
penguins %>% 
  group_by(island) %>% 
  summarize(mean(flipper_len))
```

a. Using just the above averages and your understanding of categorical predictors, fill in the model coefficients and indicator variables below. Do NOT use `lm()` yet!!!

E[`flipper_len` | `island`] = ___ +/- ___ island??? +/- ___ island???


b. Check your work to Part a.

```{r}
flipper_model_3 <- lm(flipper_len ~ island, penguins)
summary(flipper_model_3)
```


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!