---
title: "Simple linear regression: Visualization and Introduction (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```


::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity 3" or "3 simple linear regression").
:::


## Exercises {-}

**Context:** Today we'll explore data from a weightlifting competition. The data originally came from [Kaggle](https://www.kaggle.com/open-powerlifting/powerlifting-database) and [OpenPowerlifting](https://www.openpowerlifting.org/). Our main goal will be to explore the relationship between strength (`TotalKg`) and body weight (`BodyweightKg`). Read in the data below.

```{r}
# Load packages and import data
library(tidyverse)

lifts <- read_csv("https://mac-stat.github.io/data/powerlifting.csv")
```

### Exercise 1: Get to know the data

a. Create a new code chunk by clicking the green "C" button with a green + sign in the top right of the menu bar. In this code chunk, use an appropriate function to look at the first few rows of the data.

b. Create a new code chunk, and use an appropriate function to learn how much data we have (in terms of cases and variables).

c. What does a case represent?

d. Navigate to the [FAQ page](https://www.openpowerlifting.org/faq) and read the response to the "How does this site work? Do you just download results from the federations?" question. What do you learn about data quality and completeness from this response?


### Exercise 2: Get to know the outcome/response variable

`TotalKg` is of primary interest -- how much a competitor lifted in total.
Let's get acquainted with this variable.

a. Construct an appropriate plot to visualize the distribution of this variable, and compute appropriate numerical summaries. THINK: Is `TotalKg` categorical or quantitative?

```{r}

```


b. In the tidyverse package, whereas `select()` subsets our data to include only certain columns / variables of interest, `filter()` subsets our data to include only certain rows / cases of interest. If we wanted to explore the lifters who lifted the greatest total, we could filter to see only those lifts where `TotalKg` is greater than 1000 kg. 

```{r}
lifts %>% 
  filter(TotalKg > 1000)
```


c. Write a good paragraph interpreting the plot and numerical summaries.


d. What follow-up questions do you have about `TotalKg`?!


### Exercise 3: Data visualization - two quantitative variables

A natural follow-up question is: what *explains* why `TotalKg` varies from competitor to competitor?
For example, is `TotalKg` *related* to body weight?
Let's explore!
Let's visualize the *relationship* of `TotalKg` with body weight.
A **scatterplot** (or informally, a "point cloud") allows us to do this! 


a. Run all chunks below and add a brief comment on the outcome of the code:

```{r}
# ???
lifts %>%
  ggplot(aes(x = BodyweightKg, y = TotalKg))
```

```{r}
# ???
lifts %>%
  ggplot(aes(x = BodyweightKg, y = TotalKg)) +
  geom_point()
```

```{r}
# ???
lifts %>%
  ggplot(aes(x = BodyweightKg, y = TotalKg)) +
  geom_point(alpha = 0.1)
```


b. This is your first **bivariate** data visualization (visualization for two variables)! What differences do you notice in the code structure when creating a bivariate visualization, compared to univariate visualizations we've worked with before? 

c. What similarities do you notice in the code structure?

d. Does there appear to be some sort of **pattern** in the structure of the point cloud? Describe it, in no more than three sentences! Comment on:    
    - the *shape* of the relationship between the two variables (curved? linear?)
    - the *direction* of the relationship between the two variables (positive? negative?)
    - the *strength* of the relationship (are the points dispersed? close together?)


### Exercise 4: Scatterplots - patterns in point clouds

Adding a **smoothing** line to our scatterplot can sometimes help illustrate a pattern in our point cloud: 

```{r}
# scatterplot with smoothing line
lifts %>%
  ggplot(aes(x = BodyweightKg, y = TotalKg)) +
  geom_point(alpha = 0.5) +
  geom_smooth()
```

a. Review your answer to Exercise 3. Does the smoothing line assist you in seeing a pattern, or change your answer at all? Why or why not?

b. Does there appear to be a **linear** relationship between body weight and TotalKg (i.e. would a straight line do a decent job at summarizing the relationship between these two variables)? Why or why not?

### Exercise 5: Correlation

In a previous exercise you were asked about the *strength* and *direction* of the relationship between `TotalKg` and `BodyweightKg`.
We can more precisely answer these questions using a numerical summary known as **correlation** (sometimes known as a "correlation coefficient" or "Pearson's correlation").
It works as follows:

1. Properties   
    Correlation ranges from -1 to 1. A correlation of 0 indicates that there is *no* **linear** relationship between the two quantitative variables. A correlation of -1 or 1 indicates a *perfect* **linear** relationship. 

2. Strength   
    - Correlation measures whether the **linear** relationship between x and y is **strong**, **weak**, or **moderate**. This has to do with how **dispersed** our point clouds are around the trend line.
    - Stronger correlations will be *further* from 0 (closer to -1 or 1).

3. Direction    
    - Correlation indicates whether the **linear** relationship between x and y is **positive** or **negative**. That is, does y go "up" when x goes "up" (positive), or does y go "down" when x goes "up" (negative)?
    - *Positive* and *negative* correlations will have the appropriate respective sign (above or below zero).


Let's explore...

a. Rather than a smooth trend line, we can force the line we add to our scatterplots to be *linear* using `geom_smooth(method = 'lm')`, as below:

```{r}
# scatterplot with LINEAR trend line
lifts %>%
  ggplot(aes(x = BodyweightKg, y = TotalKg)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm")
```

b. Based on the above scatterplot, how would you describe the correlation between body weight and TotalKg, in terms of strength and direction?
c. Make a guess as to what *numerical value* the correlation between body weight and TotalKg will have, based on your response to part (b).


### Pause: Theory Box

So how is correlation actually *calculated*? The below "Theory Box" provides the formula behind this number. As with other theory boxes, you are *not* required to memorize, nor will you be assessed on, anything in the theory boxes. If you plan on *continuing* with Statistics courses at Macalester (or are interested in the fundamental theory behind everything!), these theory boxes are for you!

::: {.callout-note title="Theory Box: Calculating correlation"}

The Pearson correlation coefficient, $r_{x, y}$, of $x$ and $y$ is the (almost) **average** of products of the z-scores of variables $x$ and $y$:

$$
r_{x, y} = \frac{\sum z_x z_y}{n - 1}
$$
:::


### Exercise 6: Computing correlation in R

We can compute the correlation between body weight and TotalKg using `summarize` and `cor` functions.
Is the computed correlation close to what you guessed earlier?


```{r}
# correlation

# Note: the order in which you put your two quantitative variables into the cor
# function doesn't matter! Try switching them around to confirm this for yourself
# Because of the missing data, we need to include the use = "complete.obs" - otherwise the correlation would be computed as NA
lifts %>%
    summarize(cor(TotalKg, BodyweightKg, use = "complete.obs"))
```


### Exercise 7: Limitations of correlation

We previously noted that correlation was a numerical summary of the **linear** relationship between two variables. We'll now go through some examples of relationships between quantitative variables to demonstrate why it is *incredibly* important to visualize our data *in addition to* just computing numerical summaries!

For this exercise, we'll be working with the `anscombe` dataset, which is built in to R. To load this dataset into our environment, we run the following code:

```{r}
# load anscombe data
data("anscombe")
```

The `anscombe` dataset contains four different pairs of quantitative variables:

- `x1`, `y1`
- `x2`, `y2`
- `x3`, `y3`
- `x4`, `y4`

Adapt the code we used in Exercise 7 to compute the correlation between each of these four pairs of variables, below:

```{r}
# correlation between x1, y1

# correlation between x2, y2

# correlation between x3, y3

# correlation between x4, y4

```

a. What do you notice about each of these correlations (if the answer to this isn't obvious, double-check your code)? 
b. Describe these correlations in terms of strength and direction, using only the numerical summary to assist you in your description.
c. Draw an example on the white board or at your tables of what you **think** the point clouds for these pairs of variables might look like. 
d. Make *four* distinct scatterplots for each pair of quantitative variables in the `anscombe` dataset. You do not need to add a smooth trend line or a linear trend line to these plots.

```{r}
# scatterplot: x1, y1

# scatterplot: x2, y2

# scatterplot: x3, y3

# scatterplot: x4, y4

```

e. Based on the above correlations and scatterplots, what is the *message* of this last exercise as it relates to the limits of correlation?


### Exercise 8: Discovery -- Lines of "best fit"

In this activity, we've learned how to fit straight lines to data, to help us visualize the relationship between two quantitative variables. So far, `ggplot` has chosen the line for us. How does it know which line is "best", and what does "best" even mean?

Consider the relationship between `x1` and `y1` in the `anscombe` dataset. Run the following code, which creates a scatterplot with a fitted line to our data using the function `geom_abline`:

```{r}
# scatterplot with a fitted line, whose slope is 0.4 and intercept is 3
anscombe %>%
  ggplot(aes(x = x1, y = y1)) +
  geom_point() +
  geom_abline(slope = 0.4, intercept = 3, col = "blue", size = 1)
```

Describe the line that you see. Do you think the line is "good"? What are you using to define "good"?
Some things to think about:

- How many points are **above** the line?
- How many points are **below** the line?
- Are the **distances** of the points above and below the line roughly similar, or is there meaningful difference?

Now add *another* line to our plot. Which line do you think is *better* suited for this data? Why? Be specific!

```{r}
# scatterplot with a fitted line, whose slope is 0.4 and intercept is 3
anscombe %>%
  ggplot(aes(x = x1, y = y1)) +
  geom_point() +
  geom_abline(slope = 0.4, intercept = 3, col = "blue", size = 1) +
  geom_abline(slope = 0.5, intercept = 4, col = "orange", size = 1)
```

It's usually quite simple to note when a line is *bad*, but more difficult to quantify when a line is a *good* fit for our data. Consider the following line:

```{r}
# scatterplot with a fitted line, whose slope is 0.4 and intercept is 3
anscombe %>%
  ggplot(aes(x = x1, y = y1)) +
  geom_point() +
  geom_abline(slope = -0.5, intercept = 10, col = "red", size = 1) 
```

In the next activity, we'll formalize the **principle of least squares**, which will give us one particular definition of a *line of best fit* that is commonly used in statistics! We'll take advantage of the vertical distances between each point and the fitted line (**residuals**), which will help us define (mathematically) a line that best fits our data:

```{r}
anscombe %>%
  mutate(.fitted = predict(lm(y1 ~ x1, .))) %>%
  ggplot(aes(x = x1, y = y1)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_segment(aes(xend = x1, yend = .fitted), col = "red") +
  geom_point()
```


### Reflection

Much of statistics is about making (hopefully) reasonable assumptions in attempt to summarize observed relationships in data. Today we started considering assumptions of *linear* relationships between quantitative variables.

Review the learning objectives at the top of this file and today's activity. How do you imagine assumptions of linearity might be useful in terms of quantifying relationships between quantitative variables? How do you imagine these assumptions could sometimes fall short, or even be unethical in certain cases?

> **Response:** Put your response here.


## Extra practice {-}

### Exercise 9: Correlation and **extreme** values

Let's explore how correlation changes with the addition of **extreme** values, or observations. We'll begin by generating a *toy* dataset called `dat` with two quantitative variables, `x` and `y`. Run the code below to create the dataset. 

**while not required, recall that you can look up function documentation in R using the `?` in front of a function name to figure out what that function is doing!**

```{r}
# create a toy dataset
set.seed(1234)
dat <- data.frame(x = rnorm(100, mean = 5, sd = 2)) %>% 
  mutate(y = -3 * x + rnorm(100, sd = 4))
```

a. Make a scatterplot of `x` vs. `y`.

```{r}
# scatterplot

```

b. Based on your scatterplot, describe the correlation between `x` and `y` in terms of strength and direction.
c. Guess the correlation (the numerical value) between `x` and `y`.
d. Compute the correlation between `x` and `y`. Was your guess from part (c) close?

```{r}
# correlation

```

e. Suppose we observe an additional observation with `x = 15` and `y = -45`. We can create a new data frame, `dat_new1`, that contains this observation in addition to the original ones as follows:

```{r}
# creating dat_new1
new_observation <- data.frame(x = 15, y = -45)
dat_new1 <- dat %>% 
  rbind(new_observation)
```

Now make a scatterplot of `x` vs. `y` for this new data frame, and compute the correlation between `x` and `y`. Did your correlation change very much with the addition of this observation? Hypothesize why or why not.

```{r}
# scatterplot

# correlation
```

f. Suppose instead of our additional observation having values `x = 15` and `y = -45`, we instead observe `x = 15` and `y = -15`. We can create a new data frame, `dat_new2`, that contains this observation in addition to the original ones as follows:

```{r}
# creating dat_new2
new_observation <- data.frame(x = 15, y = -15)
dat_new2 <- dat %>% 
  rbind(new_observation)
```

Now make a scatterplot of `x` vs. `y` for this new data frame, and compute the correlation between `x` and `y`. Did your correlation change very much with the addition of *this* observation? Hypothesize why or why not.

```{r}
# scatterplot

# correlation
```

g. What do you think the takeaway message is of this exercise? 

h. **Challenge** Add linear trend lines to your scatterplots from parts (e) and (f). Does this give you any additional insight into why the correlations may have changed in different ways with the addition of a new observation?


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!