---
title: "Hypothesis testing: discovery (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00","black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Warm-up {-}

Let's return to the `fish` dataset. Recall that rivers contain small concentrations of mercury which can accumulate in fish. Scientists studied this phenomenon among 171 largemouth bass in the Wacamaw and Lumber rivers of North Carolina, recording the following:


| variable | meaning                                                  |
|:---------|:---------------------------------------------------------|
| River    | Lumber or Wacamaw                                        |
| Station  | Station number where the fish was caught (0, 1, ..., 15) |
| Length   | Fish's length (in centimeters)                           |
| Weight   | Fish's weight (in grams)                                 |
| Concen   | Fish's mercury concentration (in parts per million; ppm) |

```{r eval=TRUE}
# Load the data & packages
library(tidyverse)
fish <- read_csv("https://mac-stat.github.io/data/Mercury.csv")

head(fish)
```



**GOAL**

Use this sample data to make *inferences* about the relationship of mercury concentration (`Concen`) with the size of a fish, as measured by its `Length`.
This relationship can be described by the *population model*:

$$
E[Concen | Length] = \beta_0 + \beta_1 Length
$$

Our *sample model* is below:

$$
E[Concen | Length] = \hat{\beta}_0 + \hat{\beta}_1 Length = -1.132 + 0.058 Length
$$


```{r eval=TRUE}
fish_mod_1 <- lm(Concen ~ Length, data = fish)
summary(fish_mod_1)
```


### Example 1: Review

Notation:

- $\beta_1$ = the *population parameter* of interest whose "true" value is **unknown**
- $\hat{\beta}_1$ = 0.058 is our sample *estimate* of $\beta_1$
- $SE(\hat{\beta}_1)$ = 0.005 is the *standard error* of $\hat{\beta}_1$


a. Interpret $\hat{\beta}_1$.


b. Using the 68-95-99.7 rule, construct an *approximate* 95% confidence interval for $\beta_1$. That is, calculate an *interval estimate* of $\beta_1$.


c. Compare your CI to an *exact* 95% CI:

```{r}
___(___, level = 0.95)
```

d. Interpret the 95% CI in context.



### Example 2: Translating a hypothesis to notation

Our work above provides point and interval **estimates** of $\beta_1$.
But we also want to test **hypotheses** related to $\beta_1$.

Our fundamental hypothesis here is that, among the fish *population*, mercury levels are associated with fish length -- there's a *relationship* between the two.

How can we state this hypothesis using notation? Choose 1.

- $\beta_1 = 0$
- $\beta_1 \ne 0$
- $\hat{\beta}_1 = 0$
- $\hat{\beta}_1 \ne 0$
- $\beta_0 = 0$
- $\beta_0 \ne 0$
- $\hat{\beta}_0 = 0$
- $\hat{\beta}_0 \ne 0$

NOTE: We'll refer to 0 as the **null value** to which we're comparing our population parameter of interest.


### Example 3: Evaluating a hypothesis using the CI

Since we don't have access to the entire fish population, we must evaluate our hypothesis using our sample data.
We have to ask: does our sample data provide enough *evidence* to support the hypothesis that, among the fish *population*, not just our sample, mercury levels are associated with fish length?

a. Explain why we *can't* use our simple estimate $\hat{\beta}_1$ to evaluate this hypothesis, but we *can* use the 95% CI for $\beta_1$.
(This is true in general, not just this fish example!)

b. So, does your 95% CI from Example 1 support the hypothesis that there's a relationship between mercury levels and fish length? Explain.

c. Similarly, do the *confidence bands* below support the hypothesis?

```{r eval=TRUE}
fish %>% 
  ggplot(aes(y = Concen, x = Length)) + 
  geom_smooth(method = "lm") 
```





**FORMAL HYPOTHESIS TESTS**

Though CIs can help us evaluate hypotheses, **hypothesis tests** provide a more formal approach.
The general idea:

1. Assume that the hypothesis is *not* true.    
2. Determine what sample observations would be *expected* under this assumption.    
3. Evaluate how compatible *our* observed sample data are with this assumption (are they consistent with what we'd expect?).

Or, in context:

1. Assume there were truly *no* association between mercury concentration and fish length, i.e. $\beta_1 = 0$ (the "null value").
2. Determine how our fish sample might have "behaved" if there were truly no association.
3. Evaluate how compatible our fish sample observations are with the null value (are they consistent with what we'd expect if there were no association?).


## Exercises {-}

**NOTE**

- The goal of these exercises is to explore *concepts* in hypothesis testing.
    There's some new code that helps with this exploration, but the *code is not the point*!!
    Throughout, focus on the *goal* of the code and what its *output* tells us, NOT the *details* of the code.
    
- Our results will start to use more **scientific notation**. Examples:   
    - `3e-02` = 0.03 (move the decimal 2 places to the *left*)
    - `3e02` = 300 (move the decimal 2 places to the *right*)
    - `4.12e-07` = 0.000000412 (move the decimal 7 places to the *left*)
    - `4.12e03` = 4120 (move the decimal 3 places to the *right*)
    - `< 2e-16` = the result is *less than* 0.0000000000000002 (it's really really small!)
    

### Exercise 1: Simulating data under the null value

To begin testing our *hypothesis* that there *is* an association between mercury concentration and fish length ($\beta_1 \ne 0$), let's explore how our sample data might behave IF in fact there were *no* association ($\beta_1 = 0$).
We'll first do this using *simulation*, and then through *theory*.

a. Read carefully:

- Suppose that when each individual fish in our sample was collected, the scientists wrote its `Concen` and `Length` on a piece of paper.

- They then ripped the paper in half, with the observed `Concen` on one half and the observed `Length` on the other.

- They then accidentally dropped the pieces of paper! In the resulting mess, they couldn't tell which `Length` observation went with each `Concen` observation.

- To cover up their mistake, they just randomly matched up the `Length` and `Concen` pieces of paper. Essentially, in this shuffling process, the scientists broke the bond of the original `Length` and `Concen` pairs.

Question: Intuitively, if the researchers plotted the new pairs of `Concen` and `Length` values, what do you think their scatterplot would look like? What about their sample model?


b. We can simulate this idea in RStudio! Throughout, focus on concepts over code.

```{r}
# First check out the first 6 fish
fish %>% 
  select(Concen, Length) %>% 
  head()
```

Now, randomly shuffle the `Length` values!
That is, break the original `Concen` and `Length` pairs, and just randomly assign `Length`:

```{r}
# Run this chunk a few times!!
# Shuffle the Length values
fish %>%
  select(Concen, Length) %>% 
  mutate(Length = sample(Length, size = length(Length), replace = FALSE)) %>% 
  head()
```

Let's check out the sample models that arise from the shuffled `Concen` and `Length` pairs.
Summarize your observations! Was your intuition in Part a correct?

```{r}
# Shuffle the Length values
# Run this chunk a few times!!
# Then plot the resulting sample data and model
fish %>% 
  select(Concen, Length) %>% 
  mutate(Length = sample(Length, size = length(Length), replace = FALSE)) %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_point() +
  geom_smooth(method = "lm")
```


c. These shuffled samples give us a sense of the sample models we'd expect IF $\beta_1$ were actually 0, i.e. if there were no relationship between `Concen` and `Length` in the broader population of fish.
To get a sense of the range of possible outcomes in this scenario, let's simulate a bunch of shuffled samples.
Specifically, take 500 different shuffled samples and use each to estimate the model of `Concen` by `Length` (gray lines).
Summarize your observations!!
If there were truly no relationship between `Concen` and `Length`, how would we expect the sample data to behave?

```{r}
set.seed(1)
shuffled_models <- map_df(1:500, function(i){
  fish %>% 
  select(Concen, Length) %>% 
  sample_n(size = length(Length), replace = TRUE) %>% 
  mutate(Length = sample(Length, size = length(Length), replace = FALSE)) %>%
  lm(Concen ~ Length, data = .) %>%
  coef()
}
)
head(shuffled_models)
```


```{r}
# Plot the 500 shuffled models
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_abline(data = shuffled_models, 
              aes(intercept = `(Intercept)`, slope = Length), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", se = FALSE, size = 0) # Ignore this line. It's a clunky workaround
```

d. Finally, focus on just the *slope* estimates $\hat{\beta}_1$ behind the lines above. Summarize your observations!! If there were truly no relationship between `Concen` and `Length`, i.e. $\beta_1$ were truly 0, how would we expect the sample slopes to behave?

```{r}
shuffled_models %>% 
  ggplot(aes(x = Length)) + 
  geom_density()
```



### Exercise 2: Comparing our sample results to the null value (intuition)

Now that we've simulated how sample data might behave if there were truly no relationship between `Concen` and `Length` (i.e. if $\beta_1$ were 0), let's consider the next important step in a **hypothesis test**:

Evaluate how compatible our observed sample data are with this assumption that $\beta_1 = 0$!


a. Let's start with a visual assessment. Is *our* sample model (blue line) consistent / compatible with the "null models" simulated under the assumption the $\beta_1 = 0$ (gray lines)?


```{r}
fish %>% 
  ggplot(aes(x = Length, y = Concen)) + 
  geom_abline(data = shuffled_models, 
              aes(intercept = `(Intercept)`, slope = Length), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", se = FALSE)
```


b. Consider another visual assessment, focusing just on the *slopes* of the lines in the plot above, i.e. our sample estimates $\hat{\beta}_1$.
Is *our* sample slope of $\hat{\beta}_1 = 0.05813$ (blue line) consistent / compatible with the collection of slopes from the "null models" (density plot)?

```{r}
shuffled_models %>% 
  ggplot(aes(x = Length)) + 
  geom_density() + 
  geom_vline(xintercept = 0.05813, color = "blue")
```


c. In Part b, you evaluated the compatibility of *our* sample slope $\hat{\beta}_1 = 0.05813$ model with slopes of the "null models" using visual cues alone. But for a formal hypothesis test, we need some *numerical* measures of compatibility. Brainstorm some ideas!




### Exercise 3: CLT

Now that we've built up some intuition using *simulation*, let's formalize these concepts with some *theory*.

You shouldn't be surprised to know that sampling distributions are at the heart of hypothesis testing! The distribution of slopes from our shuffled sample models uses *simulation* to *approximate* the sampling distribution of slope estimates $\hat{\beta}_1$ we'd expect if the population slope $\beta_1$ were actually 0 (the null value):

```{r}
shuffled_models %>% 
  ggplot(aes(x = Length)) + 
  geom_density()
```

We can also approximate the sampling distribution using mathematical *theory*, specifically the Central Limit Theorem.
If $\beta_1$ were truly 0, then the distribution of possible sample estimates $\hat{\beta}_1$ would be Normally distributed around 0:

$$
\hat{\beta}_1 \sim N(0, SE(\hat{\beta}_1)^2)
$$

where in Example 1 we approximated the standard error to be 0.005.
Thus a plot of the CLT is below.
Confirm that this is *similar* to the simulated sampling distribution above:

```{r}
# Don't worry about this code!!!
# Save your plot
clt_plot <- data.frame(x = 0 + c(-4:4)*0.005) %>% 
  mutate(y = dnorm(x, sd = 0.005)) %>% 
  ggplot(aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 0.005)) +
  geom_segment(aes(x = x, xend = x, y = 0, yend = y), linetype = "dashed") + 
  scale_x_continuous(breaks = c(-4:4)*0.005) + 
  xlab("")
  
clt_plot
```



### Exercise 4: Comparing our sample results to the null value (test statistic)

*Our* sample slope estimate of $\hat{\beta}_1 = 0.05813$ is shown on the sampling distribution of sample slopes $\hat{\beta}_1$ that we'd expect if the population slope $\beta_1$ were actually 0 (the null value):

```{r fig.width = 6}
clt_plot + 
  geom_vline(xintercept = 0.05813, color = "blue")
```

To measure how compatible (or incompatible) our estimate is with the null value, we can measure the *distance* between the two.

a. *Our* sample slope estimate of $\hat{\beta}_1 = 0.05813$ is 0.05813 units (ppm/cm) from the null value of 0.

- In the context of this particular analysis, is that a big or a small number? *How can you tell?* That is, what info are you using to make this assessment?

- Would your answer be the same if you observed this estimate in another context?


b. In practice, we *standardize* our distance calculations so that they have the same meaning no matter the context of the analysis. We can do this by calculating a z-score:

$$
\frac{\hat{\beta}_1 - \text{null value}}{SE(\hat{\beta}_1))} = \text{ number of SE that $\hat{\beta}_1$ falls from the null value}
$$

This is called the **test statistic**.
Calculate *and interpret* the test statistic for our example.


c. Within rounding, the same test statistic appears in our model summary table!
Where is it?!

```{r}
coef(summary(fish_mod_1))
```

d. What can we conclude from your test statistic calculation and interpretation:

- Our sample data is *not* consistent with the null value of $\beta_1 = 0$, i.e. the idea that there's no significant association between mercury concentration and length.

- Our sample data *is* consistent with the null value of $\beta_1 = 0$.





### Exercise 5: Comparing our sample results to the null value (p-value)


Again, *our* sample slope estimate of $\hat{\beta}_1 = 0.05813$ is shown on the sampling distribution of sample slopes $\hat{\beta}_1$ that we'd expect if the population slope $\beta_1$ were actually 0 (the null value):

```{r fig.width = 6}
clt_plot + 
  geom_vline(xintercept = 0.05813, color = "blue")
```


a. Another way to evaluate the compatibility of our estimate with the null value of $\beta_1 = 0$ is to ask how likely we are to have gotten this estimate if the null value were indeed true! That is, what's the *probability* that we would have gotten a sample slope that's at least 0.058 above *or* below 0 IF in fact there were *no* association between mercury concentration and length? Equivalently, what's the *probability* that we would have gotten a test statistic so far from 0 IF in fact $\beta_1$ were 0? 

Use the 68-95-99.7 Rule with the plot above to *approximate* the p-value:

- less than 0.003
- between 0.003 and 0.05
- between 0.05 and 0.32
- bigger than 0.32


b. The calculation above is called a **p-value**. A more accurate p-value is reported in the model summary table in the `Pr(>|t|)` column. What is it?

```{r}
coef(summary(fish_mod_1))
```

c. How can we interpret the p-value? IMPORTANT: Think about the steps that went into calculating the p-value!!

- It's very unlikely that we’d have observed such a steep increase in `Concen` with `Length` among our sample fish "by chance", i.e. if in fact there were no relationship between mercury concentration and length in the broader fish population.

- Given what we observed in our sample data, it's very unlikely that mercury concentration is associated with length.

- Given what we observed in our sample data, it’s very unlikely that mercury concentration and length are unrelated.

PAUSE: Check the solutions to this question! p-values are commonly misinterpreted.



### Exercise 6: River test Part I

The fish in our sample come from two different rivers: Lumber and Wacamaw.
**Research question:** Is there evidence that the mercury concentration in fish differs between the 2 rivers?

To answer this question, we'll control for fish `Length` -- we don't want to mistakenly detect a difference in mercury levels just because we happened to sample bigger fish in one river and smaller fish in the other.
The relevant population model is:

$$
E[Concen | River, Length] = \beta_0 + \beta_1 RiverWacamaw + \beta_2 Length
$$

a. To address the above research question, which model coefficient / population parameter is of primary interest: $\beta_0$, $\beta_1$, or $\beta_2$? And what's the *null value* of this coefficient? That is, what would this coefficient be if there were truly no difference in mercury concentration between the 2 rivers (when controlling for length)?

b. For the parameter of interest, obtain and report a sample estimate and its corresponding standard error: 

```{r}
fish_mod_2 <- lm(___, data = fish)

coef(summary(fish_mod_2))
```

c. Let's use this data to evaluate our research question. To do so, remember the next step: Assume that the null value from Part a is correct and determine what sample estimates we would *expect* to get in this scenario.

Adjust the code below to sketch the appropriate sampling distribution.
Represent *our* sample estimate by a blue line.

```{r}
# Put OUR sample estimate here
# Round to 3 digits
est <- ___

# Put the corresponding standard error here
# Round to 3 digits
se <- ___

data.frame(x = 0 + c(-4:4)*se) %>% 
  mutate(y = dnorm(x, sd = se)) %>% 
  ggplot(aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = se)) +
  geom_segment(aes(x = x, xend = x, y = 0, yend = y), linetype = "dashed") + 
  scale_x_continuous(breaks = c(-4:4)*se) + 
  geom_vline(xintercept = est, color = "blue") +
  labs(y = "density", x = "possible estimates IF the null value were true") 
```


### Exercise 7: River test Part II

Remember the next step: We must evaluate how compatible our sample estimate is with the null value.
We'll take 3 approaches: using a visual assessment, test statistic, and p-value.

a. Visually, in Part c of the previous exercise, does our sample estimate seem compatible with the null value? Mainly, is our estimate consistent with what we'd expect to observe if there were truly no difference in the mercury concentration between the 2 rivers (when controlling for fish length)?

b. Let's focus on the test statistic:

- Obtain the *test statistic* from the model summary table below.
- Show *how* this was calculated from the sample estimate and its standard error.
- *Interpret* the test statistic.

```{r}
coef(summary(fish_mod_2))
```


c. Recall that the p-value calculates the *probability* that, IF in fact the null value were correct (i.e. there was actually *no* difference in mercury concentration by river when controlling for length), we would have gotten a sample estimate that's as far from the null value as ours, either above or below. 

Use the 68-95-99.7 Rule with the plot in Part c of the previous exercise to approximate the p-value:

- less than 0.003
- between 0.003 and 0.05
- between 0.05 and 0.32
- bigger than 0.32

Now obtain and report the exact p-value from the model summary table.

```{r}

```


d. Do your test statistic and p-value indicate that your estimate is or is not compatible with the null value? Explain.


e. Putting this all together, what do you conclude? When controlling for length...

- we have evidence of a statistically significant difference in mercury concentration by river.
- we do not have sufficient evidence to conclude that there's a statistically significant difference in mercury concentration by river.



### Exercise 8: River CI

a. Finally, construct a 95% CI for the population parameter of interest in our river analysis.

```{r}

```

b. What do you conclude from this CI? When controlling for length...

- we have evidence of a statistically significant difference in mercury concentration by river.
- we do not have sufficient evidence to conclude that there's a statistically significant difference in mercury concentration by river.

c. Your conclusions from the hypothesis test (test statistic and p-value) should agree with those from your CI! If this is not the case, review your work.



## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!


