---
title: "Sampling distributions, the CLT, and Bootstrapping (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Warm-Up {-}


Rivers contain small concentrations of mercury which can accumulate in fish.
Scientists studied this phenomenon among largemouth bass in the Wacamaw and Lumber rivers of North Carolina.

One goal of this study was to explore the relationship of a fish's mercury concentration (Concen) with its size, specifically its Length:


$$
E(Concen | Length) = \beta_0 + \beta_1 Length
$$


To *estimate* this relationship, they caught and evaluated 171 fish, and recorded the following:


| variable | meaning                                                  |
|:---------|:---------------------------------------------------------|
| River    | Lumber or Wacamaw                                        |
| Station  | Station number where the fish was caught (0, 1, ..., 15) |
| Length   | Fish's length (in centimeters)                           |
| Weight   | Fish's weight (in grams)                                 |
| Concen   | Fish's mercury concentration (in parts per million; ppm) |


```{r eval = TRUE}
# Load the data & packages
library(tidyverse)
fish <- read_csv("https://Mac-STAT.github.io/data/Mercury.csv")
```


### Example 1: Population parameter vs sample estimate

Plot and model the relationship of mercury concentration with length:

```{r eval = TRUE}
fish %>% 
  ggplot(aes(y = Concen, x = Length)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

fish_model <- lm(Concen ~ Length, data = fish)
coef(summary(fish_model))
```

In the summary table, is the `Length` coefficient 0.058 the *population parameter* $\beta_1$ or a *sample estimate* $\hat{\beta}_1$?


### Example 2: Sampling distributions

Since we don't know $\beta_1$, we *can't* know the exact error in $\hat{\beta}_1$!
This is where **sampling distributions** come in.

They describe how estimates $\hat{\beta}_1$ might vary from sample to sample, thus how far these estimates might fall from $\beta_1$ (i.e. the **standard error**):

![](https://Mac-STAT.github.io/images/155/sampling_distribution_schema.png)

For each of the concepts in the diagram above (Superpopulation, Finite Population, Sample), these represent for this specific data example:

> Superpopulation: The true underlying process that governs the relationship between mercury concentration and length of fish, for every fish that has ever existed and ever will exist in these two rivers!

> Finite Population: At the time the data was collected, the true observed relationship between mercury concentration and length of fish, for all fish in the two rivers *at that time*. 

> Sample: In our data, the observed/estimated relationship between mercury concentration and length of fish (this is $\hat{\beta}_1$!).


### Example 3: Approximating the sampling distribution

In practice, we only take *1 sample* of data (e.g. 1 sample of 171 fish).
We can't go out and take every possible sample of 171 fish!

- Thus we don't have access to the sampling distribution...
- Thus we don't actually know the *standard error* of the sample estimates...
- Thus we need to *approximate* the sampling distribution / standard error using our 1 sample.

We'll take 2 different approaches: 

1. **Central Limit Theorem (CLT)** (theory / formula-based approach)

When our sample size n is "large enough", we might approximate the sampling distribution using the CLT:

$$\hat{\beta}_1 \sim N(\beta_1, \text{ standard error}^2)$$

The standard error in the CLT is approximated from our *sample* via some formula $c / \sqrt{n}$ where "c" is complicated.
It's reported in a model `summary()` table!


2. **Bootstrapping** (simulation approach)

In practice, we can't simulate the sampling distribution by repeatedly taking samples from the population -- we only take 1 sample of size n!!

But we *can* **REsample** our sample:   

- we sample our sample **WITH replacement** (you'll explore why below)
- each REsample is of size n, the same size as the original sample (so that we can assess the potential error associated with samples the same size as ours)

This helps us approximate the *shape* of the sampling distribution and the *standard error* of our estimate:

![](https://bcheggeseth.github.io/155_spring_2026/images/bootstrapping_flow.png)

If it feels somewhat magical to you that this works out, that's a very reasonable feeling.

The saying "*to pull oneself up by the bootstraps*" is often attributed to Rudolf Erich Raspe's 1781 *The Surprising Adventures of Baron Munchausen* in which the character pulls himself out of a swamp by his hair (not bootstraps).
In short, it means to get something from nothing, through your own effort:

![](https://larspsyll.wordpress.com/wp-content/uploads/2015/12/muenchhausen_herrfurth_7_500x789.jpg?w=300&h=500)


In this spirit, statistical bootstrapping **doesn't make any probability model assumptions**. It uses only the information from our one sample to approximate standard errors.


### REFLECT: Approximating the sampling distribution -- CLT vs Bootstrapping

Great! We have two options. Here are some things to think about / reflect on:

- CLT   
    - The *quality* of this approximation hinges upon the validity of the Central Limit *Theorem* which hinges upon the validity of the *theoretical* model assumptions, as well as a "large enough" sample size n.
    - The CLT uses theoretical formulas for the standard error estimates, thus can feel a little mysterious without a solid foundation in probability theory.
    - BUT: When the theory is "valid", nothing beats a CLT approximation!!!
    
- Bootstrapping
    - The statistical theory behind bootstrapping is quite complicated, and there are certain obscure cases (none that we will encounter in STAT 155) where the assumptions underlying bootstrapping fail to hold.
    - It cannot and should not *replace* the CLT. BUT, it but gives us some nice intuition behind the idea of *resampling*, which is fundamental for hypothesis testing (which we'll get to shortly!).
    

> **Reflect:** Before testing them out, what questions do you have about either approach? What do you think would help you build more intuition for the CLT and/or bootstrapping? Does one approach resonate with you more than the other?


## Exercises {-}

**Goals:**

- Learn how to implement the bootstrapping technique.
- Use bootstrapping to explore the impact of sample size on standard errors, and the sampling distribution more generally.
- Compare bootstrapping and CLT approximations.


### Exercise 1: Why "resampling" (`replace = TRUE`)?

Recall that in bootstrapping, we **REsample** our sample -- we sample it with replacement.
Let's wrap our minds around this idea using a small example of just the first 4 fish in our data:

```{r}
small_sample <- fish %>% 
  select(Concen, Length) %>% 
  head(4)

small_sample
```

This sample produces an estimated `Length` coefficient of 0.06532:

```{r}
lm(Concen ~ Length, data = small_sample)
```


a. The chunk below samples 4 fish *without* replacement from our `small_sample` of 4 fish. Run it several times, then discuss:

- What do you observe about the samples themselves?
- If we used each one of these samples to estimate the model of `Concen` by `Length`, what do you anticipate will happen?

```{r}
small_sample %>% 
  sample_n(size = 4, replace = FALSE)
```

b. Test your intuition. The chunk below samples 4 fish from the `small_sample` *without* replacement (`replace = FALSE`), then uses it to estimate the model of `Concen` by `Length`. Run it several times, then discuss:

- What do you observe about the sample models?
- What's the problem with this?

```{r}
small_sample %>% 
  sample_n(size = 4, replace = FALSE) %>% 
  with(lm(Concen ~ Length))
```


c. Sampling our sample *without* replacement merely returns our original sample, thus does NOT give us any sense of sampling variability!
Instead, *REsample* 4 fish from our `small_sample` *with* replacement.
Run it several times.
What do you notice about the samples themselves?


```{r}
small_sample %>% 
  sample_n(size = 4, replace = TRUE)
```

Similarly, what do you notice about the sample models built from the REsamples?

```{r}
small_sample %>% 
  sample_n(size = 4, replace = TRUE) %>% 
  with(lm(Concen ~ Length))
```


::: {.callout-note title = "Reflect: Resampling"}

*REsampling* our sample provides insight into sampling variability, hence potential error in our sample estimates.
Each REsample might include some original sample subjects several times, and others not at all.
But this is a good thing!

- Observing a subject multiple times in a REsample mimics the idea that there are several *separate but similar* subjects in the population!
- Sampling with replacement ensures that our resampled observations are *independent*, which we need in order for bootstrapping to "work"!


:::


### Exercise 2: Run a bootstrap simulation

Recall that we can sample, say, 10 observations from our dataset *with replacement* using `sample_n()`:
    
```{r}
# Run this chunk a few times to explore the different samples you get
fish %>% 
  sample_n(size = 10, replace = TRUE)
```
    
We can also take a sample and then use the data to estimate the model:
    
```{r}
# Run this chunk a few times to explore the different sample models you get
fish %>% 
  sample_n(size = 10, replace = TRUE) %>% 
  with(lm(Concen ~ Length))
```
    
We can also take *multiple* unique samples and build a sample model from each.

The code below obtains *500* separate samples of 10 fish, and stores the model estimates from each:
    
```{r}
# Set the seed so that we all get the same results
set.seed(155)

# Store the sample models
sample_models_10 <- map_df(1:500, function(i){
    fish %>% 
    sample_n(size = 10, replace = TRUE) %>% 
    lm(Concen ~ Length, data = .) %>% 
    coef()
})


# Check it out
head(sample_models_10)
dim(sample_models_10)
```
    

a. What's the point of the `map_df()` function?!? If you've taken any COMP classes, what process do you think `map_df()` is a shortcut for?


b. What is stored in the `Intercept` and `Length` columns of the results?
    
    
c. We'll obtain a **bootstrapping distribution** of $\hat{\beta}_1$ by taking many (500, in this case) different REsamples and exploring the degree to which $\hat{\beta}_1$ varies from sample to sample. REMEMBER: Each bootstrap resample must be the *same size as our original dataset*: 171 fish (not 10). Why? We're trying to understand the potential error in using our sample of 171 fish, not 10 fish! 

Edit the code below to obtain a bootstrapping distribution.

```{r}
# Set the seed so that we all get the same results
set.seed(155)

# Store the sample models
sample_models_boot <- map_df(1:___, function(i){
    fish %>% 
    sample_n(size = ___, replace = TRUE) %>% 
    lm(Concen ~ Length, data = .) %>% 
    coef()
})

```


### Exercise 3: Bootstrap distribution & standard error

Check out the resulting **500** bootstrapped sample models.
The *red line* represents the model estimate calculated from our *original* sample of 171 fish (not the population model, which we don't know!):

```{r}
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(data = sample_models_boot, 
              aes(intercept = `(Intercept)`, slope = Length), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)
```    

Let's focus on the slopes of these 500 sample models.

A plot of the 500 slopes *approximates* features of the **sampling distribution** of the sample slopes.

```{r}
sample_models_boot %>% 
  ggplot(aes(x = Length)) + 
  geom_density() + 
  geom_vline(xintercept = 0.05813, color = "red") 
```    

Describe the sampling distribution: 

a. What's its general shape? 

b. Where is it roughly centered? Is this value the "true" population slope $\beta_1$ or our sample slope estimate $\hat{\beta}_1$?

c. Roughly what's its spread / i.e. what’s the range of estimates you observed?

d. For a more rigorous assessment of the spread among the bootstrap slopes, calculate their standard deviation. This provides a bootstrap *approximation* of the **standard error** of the slope estimates. IMPORTANT RECALL: The standard error measures the typical distance of a sample estimate from the actual population value. Approximating the standard error, hence understanding the potential error in *our* sample estimate, is a key goal here!

```{r}
sample_models_boot %>% 
  ___(sd(Length))
```
 

### Exercise 4: CLT distribution & standard error

Recall that the CLT provides a formula / theory-based alternative to approximating features of the sampling distribution.
It assumes that, so long as our sample size is "big enough", the sampling distribution of the sample slope $\hat{\beta}_1$ will be Normally distributed around the population slope $\beta_1$ with some standard error:


$$\hat{\beta}_1 \sim N(\beta_1, \text{ standard error}^2)$$

The *CLT-related approximation* of the **standard error** of the slope estimates is calculated via a complicated *formula*, not simulation. This is reported in the model `summary()` table, in the `Std. Error` column.

```{r}
coef(summary(fish_model))
```


a. Is the CLT assumption of Normality consistent with the *shape* of the bootstrapping distribution in the previous exercise?

b. Are the CLT and bootstrap estimates of the standard error roughly equivalent? 

c. Want more intuition into the CLT? Watch this video explanation using bunnies and dragons: [https://www.youtube.com/watch?v=jvoxEYmQHNM](https://www.youtube.com/watch?v=jvoxEYmQHNM)


### Exercise 5: Using the CLT

Plugging in the estimated standard error of 0.005, the complete CLT approximation of the sampling distribution of $\hat{\beta}_1$ is:

$$\hat{\beta}_1 \sim N(\beta_1, 0.005^2)$$
Use this result with the 68-95-99.7 property of the Normal model to understand the potential error in a slope estimate.    
a. There are many possible samples of 171 fish. What percent of these will produce an estimate $\hat{\beta}_1$ that's within 0.010, i.e. *2 standard errors*, of the actual population slope $\beta_1$?


b. More than 2 standard errors from $\beta_1$?


c. More than 0.015, i.e. 3 standard errors, *above* $\beta_1$?
    
    
### Exercise 6: CLT and the 68-95-99.7 Rule

Fill in the blanks below to complete some general properties assumed by the CLT:    

- ___% of samples will produce $\hat{\beta}_1$ estimates within **1 st. err.** of $\beta_1$

- ___% of samples will produce $\hat{\beta}_1$ estimates within **2 st. err.** of $\beta_1$

- ___% of samples will produce $\hat{\beta}_1$ estimates within **3 st. err.** of $\beta_1$


### Exercise 7: Increasing sample size

Now that we trust bootstrapping simulations to provide reasonable insight into sampling distributions (they agree with the CLT!), let's use them to explore the impact of **sample size** on the quality of our sample estimates.
Recall our different (bootstrap) sample estimates when we started with a sample of 171 fish:

```{r}
# All bootstrap sample models
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(data = sample_models_boot, 
              aes(intercept = Intercept, slope = Length), 
              color = "gray", size = 0.25)
```    

```{r}
# Slopes of all bootstrap sample models
sample_models_boot %>% 
  ggplot(aes(x = Length)) + 
  geom_density()
```    

Suppose that instead of starting with n = 171 fish, we only had a sample of n = 50 or n = 20 fish!
What impact do you *anticipate* this having on our sample estimates:          

- If we had a smaller sample, how would it impact the sample model lines (top plot): Do you expect there to be more or less variability among the sample model lines?

- If we had a smaller sample, how would it impact the sampling distribution of the sample slopes (bottom plot):   
    - Around what value to you expect it be centered?
    - What general shape do you expect it to have?
    - Do you expect it to be narrower (with smaller standard error) or wider (with larger standard error)?


### Exercise 8: 500 samples of size n

Let's decrease the sample size in our simulation.
First, take 500 REsamples from `fish`, but this time make each resample just 50 fish.
Then build a sample model from each sample:

```{r}
set.seed(155)
sample_models_50 <- map_df(1:500, function(i){
    fish %>% 
    sample_n(size = ___, replace = TRUE) %>% 
    ___(Concen ~ Length, data = .) %>% 
    ___()
})


# Check it out
# Make sure that your first Intercept estimate is -1.9990216	
head(sample_models_50)
```


Similarly, take 500 REsamples of size *20* from `fish`, then build a sample model from each sample:


```{r}
set.seed(155)
sample_models_20 <- map_df(1:500, function(i){
    fish %>% 
    sample_n(size = ___, replace = TRUE) %>% 
    ___(Concen ~ Length, data = .) %>% 
    ___()
})

# Check it out
# Make sure that your first Intercept estimate is -2.2383596	
head(sample_models_20)
```


### Exercise 9: Impact of sample size (part I)

Use the 3 plots below to compare and contrast the 500 sets of sample models when using samples of size 20, 50, and 171.
What happens as we increase sample size?!
Was this what you expected?

```{r}
# 500 sample models using samples of size 20
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_20, 
              aes(intercept = `(Intercept)`, slope = Length), 
              color = "gray", size = 0.25) + 
  lims(x = c(25, 65), y = c(-1, 5))
```


```{r}
# 500 sample models using samples of size 50
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_50, 
              aes(intercept = `(Intercept)`, slope = Length), 
              color = "gray", size = 0.25) + 
  lims(x = c(25, 65), y = c(-1, 5))
```

```{r}
# 500 sample models using samples of size 171
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_boot, 
              aes(intercept = `(Intercept)`, slope = Length), 
              color = "gray", size = 0.25) + 
  lims(x = c(25, 65), y = c(-1, 5))
```


### Exercise 10: Impact of sample size (part II)

Using the 3 plots below, let's focus on just the sampling distributions of our 500 slope estimates $\hat{\beta}_1$, i.e. the slopes of the lines in the above plots.
How do the shapes, centers, and spreads of these sampling distributions compare?
Was this what you expected?

```{r}
# 500 sample slopes using samples of size 20
sample_models_20 %>% 
  ggplot(aes(x = Length)) + 
  geom_density() +
  lims(x = c(-0.01, 0.13), y = c(0, 70))
```

```{r}
# 500 sample slopes using samples of size 50
sample_models_50 %>% 
  ggplot(aes(x = Length)) + 
  geom_density() +
  lims(x = c(-0.01, 0.13), y = c(0, 70))
```

```{r}
# 500 sample slopes using samples of size 171
sample_models_boot %>% 
  ggplot(aes(x = Length)) + 
  geom_density() +
  lims(x = c(-0.01, 0.13), y = c(0, 70))
```


### Exercise 11: Properties of sampling distributions

In light of your observations, complete the following statements about the sampling distribution of the sample slope.    
  
a. For all sample sizes, the shape of the sampling distribution is roughly ___ and the sampling distribution is roughly centered around ___, the sample estimate from our original data.  

b. As sample size increases:    
    The average sample slope estimate INCREASES / DECREASES / IS FAIRLY STABLE.    
    The standard error of the sample slopes INCREASES / DECREASES / IS FAIRLY STABLE.
    
c. Thus, as sample size increases, our sample slopes become MORE RELIABLE / LESS RELIABLE.
  

## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!