---
title: "Simple linear regression: Formalization (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```

::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity_4" or "4_simple_linear_regression").
:::

## Warm-up {-}

**Guiding question**

How is bikeshare ridership related to various weather factors?


\


**Context**

We'll explore data from [Capital Bikeshare](https://capitalbikeshare.com/), a company in Washington DC.
Our main goal will be to explore daily ridership among *registered* users, as measured by `rides`.

```{r}
# Load packages and import data
# Shorten the "riders_registered" variable to "rides"
# Keep only certain columns / variables
library(tidyverse)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>% 
  rename(rides = riders_registered) %>% 
  select(date, day_of_week, weekend, temp_feel, humidity, windspeed, rides)
```


\
\

### EXAMPLE 1: Get to know the data

```{r}
# Check out the first 6 rows
# What does a case represent, i.e. what's the unit of observation?


# How much data do we have?

```

**NOTE**

It's important to think about the "who, what, when, where, why, and how" of a dataset.
In this example, the *where* and *when* are particularly relevant.
These data were collected in D.C. in 2011-12.
The patterns we observe may not extend to other places and/or time periods!


\
\
\
\


### EXAMPLE 2: Explore ridership

Let's get acquainted with the `rides` variable.
Remember: Code = communication.
Pay special attention to formatting / readability.

```{r}
# Calculate the average (central tendency) and standard deviation (variability) in daily ridership
bikes %>% 
  ___(mean(___), sd(___))
```

```{r}
# Construct a plot of how ridership varies from day to day
bikes %>% 
  ___(aes(___))
```


**Follow-up**

Summarize, in words, what you learned from the plot and numerical summaries.
Remember to comment on: *shape*, *central tendency*, *spread* / *variability*, and, if relevant, any *outliers*.


\
\
\
\


### EXAMPLE 3: Relationships

Now that we better understand ridership (`rides`), natural and important follow-up questions in nearly every statistical analysis are:

What *explains* why ridership varies from day to day, i.e. why there are more riders on some days than others?!

For example, is ridership *related to* / can it be *predicted by* various weather factors such as `windspeed` in miles per hour (mph), or the "feels like" temperature in degrees Fahrenheit (`temp_feel`)?

In this setting:

- response variable Y = `rides`
- potential predictors X = `windspeed`, `temp_feel`

Let's start with the relationship of `rides` with `windspeed`.

#### Part a: intuition

Quick:

Do you anticipate that `rides` is related to `windspeed`?
If so, do you anticipate that the relationship is positive or negative?
Strong, weak, or moderate?


#### Part b: visualization

It's always good to start our analysis with a picture!
We can *visualize* the relationship between these 2 quantitative variables using a **scatterplot**.

This will require some new code, but let's start with what we know.

```{r}
bikes %>% 
  ggplot(aes(___))
```

Add the following two lines to your plot.

```{r}
# Add a blue simple linear regression line (line of best fit)
# and a orange *curve* of best fit
PUT YOUR PLOT HERE +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(color = "orange", se = FALSE)
```

#### Part c: description

Summarize the relationship.

Remember to comment on: *shape*, *direction*, *strength*, and, if relevant, any *outliers*.


\
\
\
\


### EXAMPLE 4: Correlation

Some of our above description is a bit vague, and somewhat subjective.

We can more precisely *numerically measure* the *strength* and *direction* of a *linear* relationship using **correlation**.

Guess the correlation between ridership and windspeed.

(You'll calculate this in the exercises below.)

\
\
\
\

### EXAMPLE 5: Simple linear regression (Part 1)

We can also more accurately summarize this relationship by finding a formula for the linear trend:

```{r}
bikes %>% 
  ggplot(aes(y = rides, x = windspeed)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

Consider the *sample* linear regression model of this relationship, i.e. the formula for the line in the scatterplot:

E[ridership | windspeed] = $\hat{\beta}_0$ + $\hat{\beta}_1$ windspeed

\
\

In the exercises, you'll show that:

E[ridership | windspeed] = 4490.10 - 65.34 windspeed

Consider the **intercept coefficient** 4490.10.
Physically, this is the *y-intercept* of our model line (where it crosses the y-axis).

What's the *contextual* meaning?

a. A 1 mph increase in wind speed is associated with a 4490.10 increase in expected ridership.
b. On days with wind speeds of 0 mph (no wind!), we expect 4490.10 riders. 
c. On average, there are 4490.10 riders per day.

\
\
\
\


### EXAMPLE 6: Simple linear regression (Part 2)

Consider the **windspeed coefficient**, -65.34
Physically, this is the *slope* of our model line (rise over run).

Pick the most effective / correct *interpretations* of this slope below.

For each other option, indicate what's wrong with the interpretation.

#### Part 1 

a. Increasing X by 1 is associated with a 65.34 decrease in the average Y value.
b. A 1 mph increase in windspeed is associated with a decrease of 65.34 riders.
c. A 1 mph increase in windspeed is associated with a decrease of 65.34 in average ridership.
d. Windier weather discourages ridership. A 1 mph increase in windspeed produces 65.34 fewer riders on average.
e. If tomorrow is 1 mph warmer than today, we should expect 65.34 fewer riders.
f. A windspeed change of 1 is associated with a 65.34 decrease in average ridership.
g. A 1 rider increase is associated with a 65.34 mph decrease in average windspeed.  


#### Part 2

Keeping in mind the *units* and *range* of our variables, what do you think about the *magnitude* of the slope?
Is a decrease of 65.34 in average ridership per 1-mph increase in windspeed *meaningful* in this context?

\
\
\
\


::: {.callout-note title = "Interpreting model coefficients"}

When interpreting model coefficients, remember to:

1. interpret in context
2. use non-causal language
3. include units
4. talk about averages rather than individual cases

:::

::: {.callout-note title = "Interpreting intercept and slope"}

Language is fluid -- there are many ways to phrase a correct interpretation of the intercept and slope!
Here's one general option, that can be translated into the context at hand:

E[Y|X] = a + bX

- Intercept: When X is 0 units, we expect Y to be a.
- Slope: A 1-unit increase in X is associated with a b-unit (increase / decrease) in the average Y value.

:::


\
\
\
\


## Exercises {-}

**DIRECTIONS**

You'll work on these exercises in your groups. 
Collaboration is a key learning goal in this course.

-   Why? Discussion & collaboration deepens your understanding of a topic while also improving confidence, communication, community, & more. (eg: Deeply learning a new language requires more than working through Duolingo alone. You need to talk with and listen to others!)

-   How? You are expected to:

    -   Use your group members' names & pronouns. It's ok to ask if you don’t remember!
    -   Actively contribute to discussion. Don't work on your own.
    -   Actively include all other group members in discussion.
    -   Create a space where others feel comfortable sharing ideas & questions.

-   We won't discuss these exercises as a class. With that, when you get stuck:    
    -   Carefully re-read the problem. Make sure you didn't miss any directions -- it can be tempting to skip words and go straight to an R chunk, but don't :).
    -   Discuss any questions you have with your group.
    -   If the question is unresolved by the group, be sure to ask the instructor!
    -   Remember that there are solutions in the online manual, at the bottom of the activity.


\
\
\
\


### Exercise 1: Correlation 

In the warm-up, you took a guess at the correlation between ridership and windspeed:

```{r}
bikes %>% 
  ggplot(aes(y = rides, x = windspeed)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

Let's check your intuition!
Fill in the code below to calculate the correlation.

```{r}
# HINT: What function is useful for calculating statistics (eg: means)?
bikes %>% 
  ___(cor = cor(rides, windspeed))
```


\
\
\
\


### Exercise 2: Build the model 

You were told above that our (sample) simple linear regression model was:

E[ridership | windspeed] = 4490.10 - 65.34 windspeed

Verify this using the code below.
Step through code chunk slowly, make note of new code.

```{r}
# Construct and save the model as bike_model_1
# What's the purpose of "rides ~ windspeed"?
# What's the purpose of "data = bikes"?
bike_model_1 <- lm(rides ~ windspeed, data = bikes)
```

```{r}
# A long summary of the model stored in bike_model_1
summary(bike_model_1)
```

```{r}
# A simplified model summary
coef(summary(bike_model_1))
```


\
\
\
\


### Exercise 3: Predictions and residuals 

On March 8, 2012, the `windspeed` was 29.58472 mph and there were 4896 riders:
    
```{r}
# Note the use of filter() and select()
# You'll learn more about filter below!
bikes %>% 
  filter(date == "2012-03-08") %>% 
  select(rides, windspeed) 
```

#### Part a 

In the scatterplot below, locate the point that corresponds to March 8, 2012 (by eye, not using code).
Is it close to the trend?
Were there *more* riders than expected or *fewer* than expected?

```{r}
bikes %>% 
  ggplot(aes(y = rides, x = windspeed)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```


#### Part b 

Using the model formula below, calculate how many riders we would have *expected* or *predicted* on March 8, 2012 based on its windspeed:

E[ridership | windspeed] = 4490.10 - 65.34 windspeed

```{r}
# Calculate the prediction IN THIS CHUNK
# (Convince yourself that you can connect this to the scatterplot)
4490.10 - 65.34*___
```


#### Part c 

Check your prediction using the `predict()` function.

```{r}
# What is the purpose of including bike_model_1?
# What is the purpose of newdata = ___???
predict(bike_model_1, newdata = data.frame(windspeed = 29.58472))
```


#### Part d 

How far does the *observed* ridership on March 8, 2012 fall from its *model prediction*?
Calculate the **residual** or **prediction error**:

```{r}
# Recall: residual = observed Y - predicted Y

```


#### Part e 

Are *positive* residuals above or below the trend line?

When we have *positive* residuals, does the model over- or under-estimate ridership?

Are *negative* residuals above or below the trend line?

When we have *negative* residuals, does the model over- or under-estimate ridership?


\
\
\
\


### Exercise 4: Ridership vs temp_feel (visualization) 

Your turn!
Let's practice what you've learned in the preceding exercises to investigate the relationship between `rides` and `temp_feel`.

#### Part a 

Below, you'll plot the relationship of `rides` and `temp_feel`.
What do you *expect* this to look like?
Will it be...

- positive / negative?
- linear?
- strong / moderate / weak?

#### Part b

```{r}
# Construct a visualization of the relationship of rides with temp_feel
# Include a representation of:
# - the individual data points
# - a simple linear regression line (line of best fit)
# - a *curve* of best fit

```

#### Part c

Summarize, in words, your findings from the plot.
Remember: shape, direction, strength, and, if relevant, outliers.

\
\
\
\


### Exercise 5: Filtering our data

The relationship of `rides` with `temp_feel` is nonlinear!
Though we'll soon learn techniques for handling nonlinear relationships, right now let's focus on days with feels-like temperatures below 80 degrees (where the relationship is linear).
In the `tidyverse` package, whereas `select()` subsets our data to include only certain *columns / variables* of interest, `filter()` subsets our data to include only certain *rows / cases* of interest.
Let's practice:

```{r}
bikes %>% 
  filter(rides > 6911)
```

```{r}
bikes %>% 
  filter(rides >= 6911)
```

```{r}
bikes %>% 
  filter(rides == 6911)
```

Your turn: show only days where `temp_feel` is under 80 degrees.

```{r}
# There are many such days! Show only the first 6.
bikes %>% 
  filter(temp_feel ___) %>% 
  head()
```

Now that you've convinced yourself that the filter works, store the cases where `temp_feel` is under 80 degrees as a new dataset called `bikes_sub`:

```{r}
# Don't use head() here! Otherwise we'd only store the first 6 dates.
bikes_sub <- bikes %>% 
  filter(temp_feel ___)
```


\
\
\
\

### Exercise 6: Ridership vs temp_feel (model) 

Now that we've filtered the data, let's explore the linear relationship of `rides` with `temp_feel` on dates with temperatures below 80 degrees.
NOTE: We'll be using the `bikes_sub` data now!!!

```{r}
bikes_sub %>% 
  ggplot(aes(x = temp_feel, y = rides)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```


#### Part a 

```{r}
# Use lm() to construct a model of rides vs temp_feel
# Save this as bike_model_2
# !!!Remember to use the bikes_sub data!!!

```

```{r}
# Get a short summary of bike_model_2

```

#### Part b 

Write out a formula for the sample model trend.

E[??? | ???] = ???????


#### Part c 

Interpret the intercept *and* the `temp_feel` coefficients.
NOTE: Your interpretation of the intercept should be "correct" but won't make sense here since it **extrapolates** our model far beyond the range of our observed data.


#### Part d 

How well would our model predict the ridership on October 29, 2012?
Follow the steps below to calculate the residual on this date.
Interpret this residual in words!

```{r}
# Obtain the temp_feel and ridership on October 29, 2012 
# (Once you obtain that info, identify the corresponding point on the scatterplot!)


# Predict ridership on this date using bike_model_2 (hence the temperature info)
# NOTE: Use predict() but also convince yourself you could do this from scratch


# Calculate the residual


```


\
\
\
\


### Exercise 7: Data drill (`filter`, `select`, `summarize`) 

This exercise is designed to help you practice your tidy data wrangling skills.
We can't do statistics or really understand data without these skills!
We'll work with a simpler set of 10 data points:

```{r}
new_bikes <- bikes %>% 
  select(date, temp_feel, humidity, rides, day_of_week) %>% 
  head(10)
```

#### Part a 

Thus far, in the `tidyverse` grammar you've learned 3 **verbs** or action words: `summarize()`, `select()`, `filter()`.
First, refresh your memory.
Take note of the following code and the output it produces:
  
```{r}
# First check out the data for a point of reference
new_bikes
```

```{r}
# summarize()
new_bikes %>% 
  summarize(mean(temp_feel), mean(humidity))
```
    
```{r}
# select()
new_bikes %>%
  select(date, temp_feel)
```

```{r}
# select() again
new_bikes %>% 
  select(-date, -temp_feel)
```

```{r}
# filter()
new_bikes %>% 
  filter(rides > 850)
```

```{r}
# filter() again
new_bikes %>% 
  filter(day_of_week == "Sat")
```

```{r}
# filter() again
new_bikes %>% 
  filter(rides > 850, day_of_week == "Sat")
```


#### Part b 

Your turn.
Use `tidyverse` verbs to complete each task below.
   
```{r}
# Keep only information about the humidity and day of week

# Keep only information about the humidity and day of week using a different approach

# Calculate the maximum and minimum temperatures

# Keep only information for Sundays

# Keep only information for Sundays with temperatures below 50

# Calculate the median ridership on Sundays with temperatures below 50

```


\
\
\
\


### Reflection


**Learning goals**

Recall the learning goals for today's activity. 
Explore numerical and visual summaries for the relationship between a *quantitative* **response (aka outcome) variable** and a *quantitative* **predictor (aka explanatory variable)**:

- Construct and interpret a **scatterplot visualization** of the relationship.
- Compute and interpret the linear **correlation** between the 2 variables.
- Analyze and apply a simple linear regression model of the relationship:   
    - obtain and write a **model formula**
    - interpret the model's **intercept** and **slope** coefficients *in context*
    - compute and interpret **predictions** (aka **expected** or **fitted values**) and **residuals** *in context*
    - explain the connection between residuals and the **least squares criterion**

Among these learning goals:

- Which do you have a good handle on?

- Which are struggling with? What feels challenging right now?

- What are some wins from the day?

- Statistics is a particular kind of language and collection of tools for channeling **curiosity** to improve our world. How do the concepts we practiced today facilitate curiosity?


**Code**

In addition to exploring the learning goals, you learned some new code.

- If you haven't already, you're highly encouraged to start tracking and organizing new code in a cheat sheet (eg: a Google doc). This will be a handy reference for you, and the act of making it will help deepen your understanding and retention.

- Reflect upon the `lm()` function. This function takes 2 arguments. What are they and what's their purpose? Mainly, what information does R need in order to build a model?

- Reflect upon the `predict()` function. This function takes 2 arguments. What are they and what's their purpose? Mainly, what information does R need in order to make a prediction?


\
\
\
\


## Extra practice

After class, you're encouraged to work through the optional exercises below.
These explore the *bills* (beaks) of Chinstrap penguins:

```{r}
data(penguins)
chinstraps <- penguins %>% 
  filter(species == "Chinstrap")
```

Specifically, let's explore the relationship of bill length with bill depth, both measured in mm:

```{r}
# Visual summary
chinstraps %>% 
  ggplot(aes(y = bill_len, x = bill_dep)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

```{r}
# Model summary
bill_model <- lm(bill_len ~ bill_dep, data = chinstraps)
coef(summary(bill_model))
```


### Exercise 8: model formulas

Which of the following is the correct (sample) model formula of this relationship:

a. bill_dep = 1.92 + 13.43 bill_len
b. bill_len = 1.92 + 13.43 bill_dep
c. E[bill_dep | bill_len] = 1.92 + 13.43 bill_len
d. E[bill_len | bill_dep] = 1.92 + 13.43 bill_dep
e. bill_dep = 13.43 + 1.92 bill_len
f. bill_len = 13.43 + 1.92 bill_dep
g. E[bill_dep | bill_len] = 13.43 + 1.92 bill_len
h. E[bill_len | bill_dep] = 13.43 + 1.92 bill_dep

\
\
\
\

### Exercise 9: Coefficient interpretations

#### Part a

Interpret the slope: A 1 ___ increase in ___ is associated with a 1.92 ___ [increase/decrease] in ___ ___.

#### Part b

Does it make sense to interpret the intercept in context in this example?

\
\
\
\

### Example 10: More predictions

**Challenge yourself to use just the model coefficients to answer the below questions.**

#### Part a

Predict the bill length of a penguin that has a bill depth of 0cm (which doesn't make sense :)).


#### Part b

You come across 2 groups of penguins.
In group "A", all penguins have a bill depth of 19cm.
In group "B", all penguins have a bill depth of 20cm.
Do you expect that the average bill length in group B is greater or less than that of group A?
How *much* greater / less (provide a specific number).


#### Part c

Among penguins with bill depths of 17mm, the average bill length is 46.07mm.
What's the average bill length among penguins with bill depths of 18mm?


\
\
\
\


### Example 11: Correlation

Using only the scatterplot, what do you think is the correlation between bill_len and bill_dep:
-1, -0.95, -0.65, -0.15, 0, 0.15, 0.65, 0.95, or 1?


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!