---
title: "The Normal model & sampling variation (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c( "#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Install new package!

We'll need a new package today: `gsheet`.
Install this package by typing the following in your **CONSOLE**. Do NOT put this in your qmd: 

`install.packages("gsheet")`


## Warm-up {-}


Today marks a significant shift in the semester!

Thus far, we’ve focused on **exploratory analysis**: visualizations, numerical summaries, and models that capture the *observed* patterns in our sample data.

Today, we’ll move on to **inference**: using our data to make conclusions about the broader population from which it was taken.

\
\

**Context:**

The `fandango` dataset in the `fivethirtyeight` package includes information about the ratings of 143 films released in 2014-2015 across the review sites Fandango, IMDB, RottenTomatoes, and MetaCritic.

Our goal will be to model the relationship between the RottenTomatoes "Tomatometer" rating (an aggregate of official film critic scores), and the RottenTomatoes "User Rating," which averages the ratings of anyone who uploaded their rating to the site.

```{r}
# Load the data
library(fivethirtyeight)
data(fandango)

fandango <- fandango %>%
  select(film,
          userscore_rt = rottentomatoes_user,
          criticscore_rt = rottentomatoes)

# Plot the relationship
fandango %>% 
  ggplot(aes(x = criticscore_rt, y = userscore_rt)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

# Model the relationship
m1 <- lm(userscore_rt ~ criticscore_rt, data = fandango)
summary(m1)
```

\
\


**POPULATION vs SAMPLE**

This dataset is by no means a comprehensive collection of films and their review scores--it does not contain every film that was released from 2014-2015, nor films released outside of that date range. The review scores are also frozen in time--all of these films have almost certainly accumulated additional reviews since the data were first collected. 

However, our stated goal is to make inferences about the overarching relationship between critic reviews and user reviews for all films (relatedly, we may want to use our model to make predictions about how user reviews are affected by critic reviews for films that may not even exist yet!). Can we actually make these inferences/predictions about a potentially infinite collection of films when all we have is a fairly limited subset of these?


This question points to two of the most important concepts in the field of statistical inference: *populations* and *samples*.

Statisticians have many different ways of defining a population (depending on the questions they are asking), but for the purposes of this exercise:

- *population* = the set of all possible films and all possible review scores that have been *or could be* catalogued on RottenTomatoes

- *sample* = our dataset of 146 films is a *sample* from this population, i.e. a subset of observations in the population. 


::: {.callout-tip icon=false collapse="false" title="Sampling Criteria"}

When we take a sample of data from a population, there is always some set of criteria used to determine how a sample is taken. This could be as simple as "we randomly selected 1% of all films catalogued on RottenTomatoes as of 4/1/2025", or a more complex set of specific criteria (for [this dataset](https://github.com/fivethirtyeight/data/tree/master/fandango), the sample was taken by selecting all films that had tickets for sale on Fandango on 8/24/2015, then further filtering to include films that have a Rotten Tomatoes rating, a RT User rating, a Metacritic score, a Metacritic User score, an IMDb score, and at least 30 fan reviews on Fandango.) 

:::

\
\


**PARAMETERS vs ESTIMATES**

In order to conduct statistical inference using linear regression, we must assume that there is some *true*, underlying, fixed intercept and slope $\beta_0$ and $\beta_1$, that describe the true linear relationship in the overall population that we're interested in.

For example, in modeling the relationship between UserScore and CriticScore on RottenTomatoes, the "true" underlying model we assume is:

$$
E[UserScore \mid CriticScore] = \beta_0 + \beta_1 CriticScore
$$

However, the "true" values of $\beta_0$ and $\beta_1$ are typically impossible to know! Knowing them would require access to our entire population of interest (in this case, the review scores for every film that has been or will be released). When we fit a regression model using the sample that we **do** have, we are actually obtaining **estimates** of those true population parameters (note the notation change of putting a $\hat{ }$ on top of the Betas, to indicate that this is an estimate):

$$
E[UserScore \mid CriticScore] = \hat{\beta}_0 + \hat{\beta}_1 CriticScore
$$

In summary:

- $\beta$ = the "true" but *unknown* **population parameter**
- $\hat{\beta}$ = a **sample estimate**, i.e. an *estimate* of $\beta$ calculated using our sample data


**!!PRETEND!!**

For the sake of this activity, let's assume that our sample estimates, obtained above, are identical to the true population parameters. That is, assume that the "true" population model is:

$$
E[UserScore \mid CriticScore] = \beta_0 + \beta_1 CriticScore = 32.3 + 0.52 CriticScore
$$

🚩🚩🚩 WARNING: Be very careful not to make this assumption in other models you encounter. For this dataset, recall the specific sampling criteria that were used, which means these 146 films likely aren't representative of the full population of films we're interested in. This means that the estimates we obtained probably *don't* match the true population parameters--they may or may not be close, but we don't know for certain! 🚩🚩🚩 


### Example 1: Intuition

Below, we'll *simulate* how parameter estimates are impacted by taking different samples. You'll each take a random sample of 10 films in the dataset, and we'll see if we can recover the presumed population parameters (i.e., the coefficient estimates we obtained from our model using all 146 films that were initially sampled).

What's your intuition:

- Do you think every student will get the same set of 10 films?

- Do you think that *your* coefficient estimates will be the same as your neighbors'?


### Example 2: Random sampling

Use the `sample_n()` function to take a *random sample* of 2 films   

```{r}
# Try running the following chunk A FEW TIMES
fandango %>% 
  sample_n(size = 2, replace = FALSE)
```

Reflect: 
 
- How do your results compare to your neighbors'?

- What is the role of `size = 2`? *HINT*: Remember you can look at function documentation by running `?sample_n` in the console!

- What is the role of `replace = FALSE`? *HINT*: Remember you can look at function documentation by running `?sample_n` in the console!


### Example 3: Setting the seed

Now, "set the seed" to 155 and re-try your sampling. 

```{r}
# Try running the following FULL chunk A FEW TIMES
set.seed(155)
fandango %>% 
  sample_n(size = 2, replace = FALSE)
```


What changed?


### Example 4: Take your own sample

The underlying *random number generator* plays a role in the random sample we happen to get.
If we `set.seed(some positive integer)` before taking a random sample, we'll get the same results.

This **reproducibility** is important:    
    
- we get the same results every time we render our qmd

- we can share our work with others & ensure they get our same answers

- it wouldn't be great if you submitted your work to, say, a journal and weren't able to back up / confirm / reproduce your results!    


Follow the chunks below to obtain and use your *own* unique sample.

```{r}
# DON'T SKIP THIS STEP! 
# Set the random number seed to your own phone number (just the numbers)
# NOTE: The point is to use a different number than other students.
# Any 7 digits of your number, that don't START with 0, are fine
set.seed(___)

# Take a sample of 10 films
my_sample <- fandango %>% 
  sample_n(size = 10, replace = FALSE)
my_sample                       
```    


```{r}
# Plot the relationship of UserScore with CriticScore among your sample
my_sample %>% 
  ggplot(aes(y = userscore_rt, x = criticscore_rt)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

```{r}
# Model the relationship among your sample
my_model <- lm(userscore_rt ~ criticscore_rt, data = my_sample)
coef(summary(my_model))
```


**REPORT YOUR WORK**

Log your intercept and slope *sample estimates* in [this survey](https://forms.gle/UScGPwN3zCYshwih6).


## Exercises {-}

### Exercise 1: Using the Normal model

Before getting back to the experiment, let's practice the concepts you explored in today’s video.

First, run the following code which defines a `shaded_normal()` function which is specialized to this activity alone:

```{r}
library(tidyverse)
shaded_normal <- function(mean, sd, a = NULL, b = NULL){
  min_x <- mean - 4*sd
  max_x <- mean + 4*sd
  a <- max(a, min_x)
  b <- min(b, max_x)
  ggplot() + 
    scale_x_continuous(limits = c(min_x, max_x), breaks = c(mean - sd*(0:3), mean + sd*(1:3))) +
    stat_function(fun = dnorm, args = list(mean = mean, sd = sd)) + 
    stat_function(geom = "area", fun = dnorm, args = list(mean = mean, sd = sd), xlim = c(a, b), fill = "blue") + 
    labs(y = "density") + 
    theme_minimal()
}
```

Next, suppose that the speeds of cars on a highway, in miles per hour, can be reasonably represented by the Normal model with a *mean* of 55mph and a *standard deviation* of 5mph from car to car:

$$
X \sim N(55, 5^2)
$$

```{r}
shaded_normal(mean = 55, sd = 5)
```


a. Provide the (approximate) range of the middle 68% of speeds, and shade in the corresponding region on your Normal curve. NOTE: `a` is the lower end of the range and `b` is the upper end.    

```{r}
shaded_normal(mean = 55, sd = 5, a = ___, b = ___)
```
    
    
b. Use the 68-95-99.7 rule to estimate the probability that a car's speed exceeds 60mph.  

> Your response here

```{r}
# Visualize
shaded_normal(mean = 55, sd = 5, a = 60)
```    


c. Which of the following is the correct *range* for the probability that a car's speed exceeds 67mph? Explain your reasoning.

* less than 0.0015
* between 0.0015 and 0.025
* between 0.025 and 0.16
* greater than 0.16

> Explain your reasoning here

```{r}
# Visualize
shaded_normal(mean = 55, sd = 5, a = 67)
```


### Exercise 2: Z-scores

Inherently important to all of our calculations above is *how many standard deviations a value "X" is from the mean*.

This distance is called a **Z-score** and can be calculated as follows:    
    
$$
\text{Z-score} = \frac{X - \text{mean}}{\text{sd}}
$$
  

For example (from Exercise 1), if I'm traveling 40 miles an hour, my Z-score is -3. That is, my speed is 3 standard deviations *below* the average speed:    
    
```{r}
(40 - 55) / 5
```

a. Consider 2 other drivers. Both drivers are speeding. Who do you think is speeding *more*, relative to the distributions of speed in their area?  

    - Driver A is traveling at 60mph on the highway where speeds are N(55, 5^2) and the speed limit is 55mph.
    - Driver B is traveling at 36mph on a residential road where speeds are N(30, 3^2) and the speed limit is 30mph.
    
> Put your best guess (hypothesis) here

b. Calculate the Z-scores for Drivers A and B.

```{r}
# Driver A


# Driver B


```

c. Now, based on the Z-scores, who is speeding more? NOTE: The below plots might provide some insights.   

```{r}
# Driver A
shaded_normal(mean = 55, sd = 5) + 
  geom_vline(xintercept = 60) + 
  lims(x = c(18, 75), y = c(0, 0.15)) + 
  labs(title = "Driver A")

# Driver B
shaded_normal(mean = 30, sd = 3) + 
  geom_vline(xintercept = 36) + 
  lims(x = c(18, 75), y = c(0, 0.15)) + 
  labs(title = "Driver B")
```

    
> Your response here


### Exercise 3: Sampling variation

Recall that we are assuming the **population parameters** are equal to the estimates we obtained from the model we fit using the initial sample of 146 films:

$$
E[UserScore \mid CriticScore] = 32.3 + 0.52 CriticScore
$$

Let's explore how our **sample estimates** of these parameters *varied* from student to student:


```{r}
# Import the experiment results
library(gsheet)
results <- gsheet2tbl('https://docs.google.com/spreadsheets/d/11OT1VnLTTJasp5BHSKulgJiCbSLiutv8mKDOfvvXZSo/edit?usp=sharing')
```

    
Plot each student's *sample estimate* of the model line (gray).
How do these compare to the assumed *population* model (red)?    

```{r}
fandango %>% 
  ggplot(aes(y = userscore_rt, x = criticscore_rt)) +
  geom_abline(data = results, aes(intercept = sample_intercept, slope = sample_slope, linetype=section), color = "gray") + 
  geom_smooth(method = "lm", color = "red", se = FALSE)
```
    

### Exercise 4: Sample intercepts

Let's focus on just the *sample estimates* of the *intercept* parameter:

```{r}
results %>% 
  ggplot(aes(x = sample_intercept)) + 
  geom_density() + 
  geom_vline(xintercept = 32.3, color = "red")
```    


Comment on the shape, center, and spread of these sample estimates and how they relate to the (assumed) population intercept (red line).

> Your response here


### Exercise 5: Slopes

Suppose we were to construct a density plot of the sample estimates of the `criticscore_rt` coefficient (i.e. the slopes).

a. *Intuitively*, what *shape* do you think this plot will have?

> Your response here

b. *Intuitively*, around what value do you think this plot will be *centered*?

> Your response here

c. Check your intuition:        

```{r}
results %>% 
  ggplot(aes(x = sample_slope)) + 
  geom_density() + 
  geom_vline(xintercept = 0.52, color = "red")
```    


d. Thinking back to the 68-95-99.7 rule, visually approximate the *standard deviation* among the sample slopes.

> Your response here


### Exercise 6: Standard error

You've likely observed that the *typical* or mean slope estimate is roughly equal to the (assumed) population slope parameter of 0.52:

```{r}
results %>% 
  summarize(mean(sample_slope))
```

Thus the *standard deviation* of the slope estimates measures how *far* we might expect an estimate to fall from the (assumed) population slope parameter.

That is, it measures the typical or **standard error** in our sample estimates:

```{r}
results %>% 
  summarize(sd(sample_slope))
```

a. Recall *your* sample estimate of the slope. How far is it from the population slope, 0.52?
    
```{r}

```


b. How many *standard errors* does your estimate fall from the population slope? That is, what's your **Z-score**?        

```{r}

```
    

c. Reflecting upon your Z-score, do you think your sample estimate was one of the "lucky" ones, or one of the "unlucky" ones?

> Your response here


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!