---
title: "Simple linear regression: model evaluation (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```

::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity_5" or "5_model_evaluation").
:::


## Warm-up {-}

Let's explore the `high_peaks` data, which includes information on hiking trails in the 46 "high peaks" in the Adirondack mountains of northern New York state.^[original data source = `HighPeaks` data in the `Stat2Data` package; original image source = <https://commons.wikimedia.org/wiki/File:New_York_state_geographic_map-en.svg>]


![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/New_York_state_geographic_map-en.svg/1600px-New_York_state_geographic_map-en.svg.png)




\
\

### Example 1: Modeling review

Let's begin by modeling the `time` (in hours) that it takes to complete each hike by the hike's vertical `ascent` (in feet).

```{r}
# Load packages and data, and examine data structure
library(tidyverse)
peaks <- read_csv("https://mac-stat.github.io/data/high_peaks.csv")

# Check out the data


```



```{r}
# Fit a simple linear regression model
peaks_model <- ___(___, ___)

# Check out a model summary table
# (the big one, not the simplified version!)
___(peaks_model)
```


**Follow-up**

Write out the sample model equation.





\
\
\
\


::: {.callout-note title = "Interpreting model coefficients"}

When interpreting model coefficients, remember to:

1. interpret in context
2. use non-causal language
3. include units
4. talk about averages rather than individual cases

:::

\
\
\
\

### Example 2: More modeling review

Interpret the 2 model coefficients.
  
  


### Example 3: Is it correct?

Let's evaluate the quality of this model. First, is it correct (i.e. not wrong)?

Comment on the **LINE** assumptions, using the below **residual plot** where relevant.

```{r}
# Residual plot
# NOTE: We start our code with the MODEL not the DATA!!!
# That's where the residuals (`.resid`) & predictions (`.fitted`) are stored
# We'll RARELY do this.
peaks_model %>% 
  ___(___(___ = .fitted, ___ = .resid)) + 
  ___() + 
  geom_hline(yintercept = 0)
```






\
\
\
\

### Example 4: Plot the data

Let’s go back and plot the data, specifically the relationship of `time` with `ascent` (which we should always do before building the model!). Include a representation of the linear regression model.


```{r}
# NOTE: "lm" stands for "linear model"
___ %>% 
  ___(aes(y = ___, x = ___)) +
  geom___() + 
  geom___(method = "lm", se = FALSE)
```


### Example 5: Is it strong?
We can measure the strength of this model using R-squared ($R^2$). Report and interpret the R-squared value. HINT: No new R code needed! Just check out the model summary table.

```{r}
summary(peaks_model)
```


\
\
\
\


### Example 6: Is the model fair?

Stop to consider the data context. 


## Exercises {-}

You’ll work with a variety of datasets & scenarios in these exercises. The goal is to explore the different ways that a model can be bad (or good!), not to work through 1 cohesive analysis.

**DIRECTIONS**

- Work with your group:
  Be kind to yourself and others. Engage yourself and engage others.

- When you get stuck:
  Re-read the problem, discuss with your group, ask the instructor, check solutions in the online manual.
  

### Exercise 1: Is the model correct?

Let's check out some weather data from Hobart, Australia, including the `maxtemp` (maximum temperature in degrees Celsius) and `day_of_year` (1--365):

```{r}
# Import data
# Take note of the filter code!
weather <- read_csv("https://mac-stat.github.io/data/weather_3_locations.csv") %>% 
  filter(location == "Hobart") %>% 
  mutate(day_of_year = yday(date)) %>% 
  select(maxtemp, day_of_year)

# Check it out
head(weather)
```

Let's use this data to model `maxtemp` by `day_of_year`: 

```{r}
# Fit a linear model of 
weather_model <- lm(maxtemp ~ day_of_year, data = weather)

# Check it out
summary(weather_model)
```

a. The estimated model formula is

E[`maxtemp` | `day_of_year`] = 20.105 - 0.012 `day_of_year`

Discuss the implications of the slope coefficient -0.012 in this context. Does this make sense (what do you think is the actual relationship between these 2 variables)?



b. Now plot the relationship of `maxtemp` with `day_of_year` with both a curved and linear trend line. This plot indicates that the simple linear regression model (blue) is wrong. Specify which LINE assumption it violates.
  
```{r}
# Plot maxtemp vs day_of_year with...
# a linear regression model trend AND
# a smooth red curve (not necessarily linear)
___ %>% 
  ___(aes(y = ___, x = ___)) + 
  geom___() +
  geom___(method = "lm", se = FALSE) + 
  geom___(se = FALSE, color = "red") 
```


c. Check out the residual plot for this model. Convince yourself that you could come to the same conclusion from Part a using this plot alone.

```{r}
# Check out the residual plot for weather_model
___ %>% 
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE)
```

    
Note: We can *fix* this model by adding a quadratic "transformation term". We'll discuss this idea in our next class.




### Exercise 2: What's incorrect about this model?

Consider another example.
The `mammals` data includes data on the average brain weight (g) and body weight (kg) for a variety of mammals:

```{r}
# Import the data
mammals <- read_csv("https://mac-stat.github.io/data/mammals.csv")

# Check it out
head(mammals)
```

Fit a model of `brain` vs `body` weight:

```{r}
# Construct the model
mammal_model <- lm(brain ~ body, mammals)

# Check it out
summary(mammal_model)
```


a. Construct two plots that will help us evaluate `mammal_model`:

```{r}
# Scatterplot of brain weight (y) vs body weight (x)
# Include a model trend line (i.e. a representation of mammal_model)

```

```{r}
# Residual plot for mammal_model

```

b. These two plots confirm that our model is wrong. *What* is wrong? That is, which of the LINE assumptions are violated? (NOTE: We again can *fix* this model by "transforming" one or both of the `brain` and `body` variables. We'll discuss this idea in our next class.)


c. Just for fun, let's dig into the mammals data.
Discuss what you observe:

```{r}
# Label the points by the animal name!
# Discuss: What 2 things are new in this code?
ggplot(mammals, aes(x = body, y = brain, label = animal)) + 
    geom_text() + 
    geom_smooth(method = "lm", se = FALSE) 
```

```{r}
# Zoom in
ggplot(mammals, aes(x = body, y = brain, label = animal)) + 
    geom_text() + 
    lims(y = c(0, 1500), x = c(0, 600))
```

```{r}
# Zoom in more
ggplot(mammals, aes(x = body, y = brain, label = animal)) + 
    geom_text() + 
    lims(y = c(0, 500), x = c(0, 200))
```



### Exercise 3: Is the model strong?

Let's revisit the `bikes` data, but focus just on *MONDAY* ridership:

```{r}
# Take a peak at this code for review!
mondays <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>% 
  rename(rides = riders_registered) %>% 
  filter(day_of_week == "Mon")

# Check it out
head(mondays)
```

Check out the plot of `rides` vs `temp_feel`:

```{r}
mondays %>% 
  ggplot(aes(y = rides, x = temp_feel)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

And the plot of `rides` vs `windspeed`:

```{r}
mondays %>% 
  ggplot(aes(y = rides, x = windspeed)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

#### Part a

Based on the plots alone, which is the stronger predictor of ridership: temperature or windspeed?

#### Part b

Check your answer. Obtain, interpret, and compare the R-squared values for the 2 models of interest.

```{r}
# Fit the models
bike_model_1 <- lm(rides ~ temp_feel, data = mondays)
bike_model_2 <- lm(rides ~ windspeed, data = mondays)

# Obtain R-squared
___(bike_model_1)
___(bike_model_2)
```

#### Part c

Since the relationships modeled by `bike_model_1` and `bike_model_2` both include a single quantitative predictor, we can calculate their correlations.
Do that below *and* demonstrate how these correlations can be used to calculate the R-squared values from Part a.

```{r}
# Calculate correlation
mondays %>% 
  ___(___(rides, temp_feel), ___(rides, windspeed))

# Use the correlations to calculate the model R-squared values
# (within rounding)

```



### Exercise 4: Calculating R-squared (Part 1)

Let's dig into where the R-squared metric comes from!

a. Check out the `model_1_info` table below that, for each day in the dataset, includes info on the *observed* ridership, the *predicted* ridership using `bike_model_1`, and the *residual*. (You made similar calculations in the previous activity.) *Discuss with each other how these 3 columns are related!* For example, how can we get the `rides` column from the `predicted` and `residual` columns?

```{r}
# Don't worry about this code!
model_1_info <- mondays %>% 
  select(rides) %>% 
  mutate(predicted = bike_model_1$fitted.values, residual = bike_model_1$residuals)

# Check it out
head(model_1_info)
```


b. In Part a we observed that the observed `rides`, `predicted` values, and `residual` values all vary from day to day. Here, calculate the *variance* of the observed `rides`, `predicted` values, and `residual` values. *Identify and discuss with each other how these 3 values are related!* For example, how can we calculate `var(rides)` using `var(predicted)` and `var(residual)`?

```{r}
model_1_info %>% 
  ___(var(rides), var(predicted), var(residual))
```
        
        
c. Recall that `bike_model_1` had an R-squared of 0.3137. Challenge: How can you calculate this using the `var(rides)` and `var(predicted)` values from Part b? Hint: Remember that R-squared is the proportion of the variability in a response variable Y that is explained by its model (thus its predictions).

```{r}
# Calculate R-squared
```

     

### Exercise 5: Calculating R-squared (Part 2)

<!-- Recommended: Check out the rendered version of this exercise.  -->

Your work for `bike_model_1` is summarized below, along with additional calculations from `bike_model_2`.  
    
\

Model           Var(observed)  Var(predicted)  Var(residuals) R-squared
-------------- -------------- --------------- --------------- ----------
`bike_model_1`        2262666          709695         1552970 0.31
`bike_model_2`        2262666           65864         2196801 0.02

\

These results reflect 2 important facts:

1. We can **partition the variance** of the observed response values into the variability that's explained by the model (the variance of the predictions) and the variability that's left unexplained by the model (the variance of the residuals):   
    
$$\text{Var(observed) = Var(predicted) + Var(residuals)}$$

2. It follows that R-squared, the *proportion* of the variance in the observed response values that's explained the the model, can be calculated by:

$$R^2 = \frac{\text{Var(predicted)}}{\text{Var(observed)}} = 1 - \frac{\text{Var(residuals)}}{\text{Var(observed)}}$$


\

These properties are also reflected in the pictures below:

![](https://mac-stat.github.io/images/155/rsquared.png)



**FOLLOW-UP QUESTIONS:**

Reflecting on the above table of variances, R-squared formula, and plot...

- In terms of obtaining higher R-squared / stronger model, do we want the variance of the residuals to be close to 0 or close to the variance of the observed Y values? Why does this make sense? 

- In terms of obtaining higher R-squared / stronger model, do we want the variance of the prediction to be close to 0 or close to the variance of the observed Y values? Why does this make sense? 



### Exercise 6: Biased data, biased results: example 1

In the above exercises, we focused on exploring our first 2 model evaluation questions: Is it correct? Is it strong?
In the next exercises, let's explore the third question: Is it fair?

**Data are not neutral**.
Data can reflect personal biases, institutional biases, power dynamics, societal biases, the limits of our knowledge, etc.
And biased data can lead to biased analyses.
Consider the example of a large company that developed a model / algorithm to review the résumés of applicants for software developer & other tech positions.
The model then gave each applicant a score indicating their hire-ability or potential for success at the company.
You can think of this model as something like:            

$$E[\text{potential | résumé features}] = \beta_0 + \beta_1 (\text{résumé features})$$


Skim this [Reuter's article](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G) about the company's résumé model.    

- Explain why the data used by this model are not neutral.

- What are the potential implications, personal or societal, of the results produced from this biased data?



### Exercise 7: Rigid data collection systems

When working with categorical variables, we've seen that our units of observation fall into neat groups. Reality isn't so discrete. For example, check out questions 6 and 9 on [page 2 of the 2020 US Census](https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/questionnaires-and-instructions/questionnaires/2020-informational-questionnaire-english_DI-Q1.pdf). With your group, discuss the following:    

- What are a couple of issues you see with these questions?

- What impact might this type of data collection have on a subsequent *analysis* of the census responses and the policies it might inform?

- Can you think of a better way to write these questions while still preserving the privacy of respondents?

**FOR A DEEPER DISCUSSION:** Read [Chapter 4 of Data Feminism](https://data-feminism.mitpress.mit.edu/pub/h1w0nbqp/release/3) on "What gets counted counts". 







### Reflection

What has stuck with you most in our exploration of model evaluation? Why?



## Extra practice {-}

### Exercise 8: Data drill

a. Let's practice some data wrangling with our `peaks` data. In addition to some basic functions (e.g. `head()`), use the `tidyverse` functions we've been accumulating: `select()`, `summarize()`, `filter()`.

```{r}
# How many hikes are in the data set?


# Show just the peak's name and its elevation for the first 6 hikes
___ %>% 
  ___ %>% 
  head()

# Calculate the average hiking time


# Show data for the hikes that are *at least* 17 miles long


# Calculate the average hiking time for hikes that are at least 17 miles long
# HINT: You'll need to use 2 tidyverse functions!
___ %>% 
  ___ %>% 
  ___

# Calculate the average hiking time for hikes with an "easy" rating
# HINT: You'll need to use 2 tidyverse functions again!


```

b. Let's practice some visualization!

```{r}
# Construct a visualization of hike rating


# Construct a visualization of hiking time


# Construct a visualization of the relationship between hiking time and length (distance)

```


### Exercise 9: Types of models


#### Part a

Is it possible to build a model that's correct but weak? If not, explain. If yes, sketch an example of what this might look like.

#### Part b

Is it possible to build a model that's wrong but strong? If not, explain. If yes, sketch an example of what this might look like.


Visualize the relationship between these 2 variables, including representations of the raw data *and* the simple linear regression model.
Describe your observations.

```{r}

```

#### Part c

Build the linear regression model.
Write out the estimated model formula and interpret the 2 coefficients.

```{r}

```




## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!


