---
title: "Confidence intervals (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Exercises {-}

### Exercise 1: Confidence Interval Review

In the first set of exercises, we'll model the time it takes to complete a mountain hike.
To begin, let's explore the relationship of completion `time` (in hours) by hike `length` (in miles):

E[time | length] = $\beta_0$ + $\beta_1$ length

A sample *estimate* of this population model, obtained using our data on hiking trails in the Adirondack mountains, is below:

E[`time` | `length`] = $\hat{\beta}_0$ + $\hat{\beta}_1$ `length` = 2.048 + 0.684 `length`

```{r}
# Import the data & load important packages
library(tidyverse)
peaks <- read.csv("https://mac-stat.github.io/data/high_peaks.csv")

# Model the relationship
peaks_model_1 <- lm(time ~ length, data = peaks)
coef(summary(peaks_model_1))

# Visualize the relationship
peaks %>% 
  ggplot(aes(y = time, x = length)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

#### Part a

$\hat{\beta}_1$ simply provides a *point estimate*, or our single best guess, of $\beta_1$.
To also produce an *interval estimate*, use the 68-95-99.7 Rule to approximate a 95% CI for $\beta_1$.


#### Part b

We can calculate a more *accurate* CI by applying the `confint()` function to our *model*.
Your approximation from Part a should be close!

```{r}
confint(peaks_model_1, level = 0.95)
```    


### Exercise 2: Interpreting the CI

#### Part a

*Interpreting* the CI for $\beta_1$ in context requires that we can interpret $\beta_1$ itself!
So how can we interpret $\beta_1$ (in general, without assuming a specific value for the unknown $\beta_1$)? Choose 1.

- $\beta_1$ measures the expected completion time for hikes that are 0 miles long
- $\beta_1$ measures the difference in the expected completion time for hikes that long vs hikes that aren't long
- $\beta_1$ measures the change in the expected completion time for each additional 1 mile in length


#### Part b

Per the previous exercise: "We are 95% confident that $\beta_1$ is between 0.56 and 0.81".
Interpret this CI in *context*, drawing on your answer to Part a.


### Exercise 3: time versus elevation

Next, let's explore the relationship between the completion `time` and maximum `elevation` of a hike (in feet): 

E[`time` | `elevation`] = $\beta_0$ + $\beta_1$ `elevation`
        
We can gain insight into this relationship using our sample data:

```{r}
# Model the relationship
peaks_model_2 <- lm(time ~ elevation, data = peaks)

# Get confidence intervals
confint(peaks_model_2, level = 0.95)

# Plot the sample model
ggplot(peaks, aes(y = time, x = elevation)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

What can we conclude from *both* the CI for $\beta_1$ and the confidence bands in the plot above?    


### Exercise 4: time versus rating

Next, let's explore the relationship between the completion `time` and hike `rating` (difficult, easy, or moderate): 

E[`time` | `rating`] = $\beta_0$ + $\beta_1$ `ratingeasy` + $\beta_2$ `ratingmoderate`

**PAUSE:** Before analyzing this model, pause to reflect on what $\beta_0$, $\beta_1$, and $\beta_2$ *mean* in this context.
We can gain insight into these coefficients using our sample data:

```{r}
# Visualize the relationship
ggplot(peaks, aes(y = time, x = rating)) + 
  geom_boxplot()

# Model the relationship
peaks_model_3 <- lm(time ~ rating, data = peaks)

# Obtain CIs for the model coefficients
confint(peaks_model_3)
```


#### Part a

How can we interpret the confidence interval for $\beta_2$, the `ratingmoderate` coefficient:  (-5.92, -3.19)?
We're 95% confident that... (Choose 1)   

- the average completion time of a moderate hike is between 3.19 and 5.92 hours    
- the average completion time of a moderate hike is between 3.19 and 5.92 hours *less* than that of a difficult hike    
- the average completion time of moderate hikes in our sample was between 3.19 and 5.92 hours *less* than that of difficult hikes in our sample    

#### Part b

How can we interpret the confidence interval for the *intercept* $\beta_0$: (13.8, 16.2)?
We're 95% confident that... (Choose 1)      

- the average completion time of a hike with a 0 rating is between 13.8 and 16.2 hours.    
- the average completion time of a difficult hike is between 13.8 and 16.2 hours.
- the average completion time of a hike is between 13.8 and 16.2 hours.
- the average completion time of a difficult hike is between 13.8 and 16.2 hours *more* than that of an easy hike.

### Exercise 5: Multiple logistic regression REVIEW

Let's turn our attention to weather in 3 Australia locations, Hobart, Uluru, and Wollongong: When controlling for location, is today's 9am humidity level (0-100%) a "significant" predictor of whether or not it will rain tomorrow?
The following population model represents this relationship:

log(odds of `rain_tmrw`) = $\beta_0$ + $\beta_1$ `locationUluru` + $\beta_2$ `locationWollongong` + $\beta_3$ `humidity9am`

Below we build a sample model using a sample of 500 data points:


```{r}
# NOTE: We'll take a sub-sample of 500 days for TEACHING PURPOSES ONLY
# In practice, we'd use all data in our sample :)
set.seed(155)
weather <- read_csv("https://mac-stat.github.io/data/weather_3_locations.csv") %>% 
  mutate(rain_tmrw = (raintomorrow == "Yes")) %>% 
  sample_n(size = 500, replace = FALSE)

# Build the sample model
weather_model_1 <- glm(rain_tmrw ~ location + humidity9am, data = weather, family = "binomial")
coef(summary(weather_model_1))

# Exponentiate the coefficients
exp(coef(weather_model_1))
```

#### Part a

Why are we using logistic regression?


#### Part b

Interpret $\hat{\beta}_0$, the estimated intercept, on the odds scale.
Circle any option that's correct!
**On a day with 0 humidity at 9am in Hobart...**

- The odds of rain are 0.015.
- The chance of rain is 0.015.
- The chance of rain is 1.5% as high as the chance of no rain.
- The odds of rain increase by 0.15%.

#### Part c

Interpret $\hat{\beta}_1$, the estimated `locationUluru` coefficient, on the odds scale.
**When controlling for 9am humidity levels...**

- the odds of rain are 0.60 in Uluru.
- the odds of rain are 60% higher in Uluru than in Hobart.
- the odds of rain in Uluru are 60% as high as (hence 40% lower than) the odds in Hobart.
- the chance of rain in Uluru is 60% as high as (hence 40% lower than) the chance that it doesn't rain.


#### Part d

Interpret $\hat{\beta}_3$, the estimated `humidity9am` coefficient, on the odds scale.
**When controlling for location...**

- the odds of rain increase by 4% for every additional 1-percentage point increase in 9am humidity.
- the odds of rain increase by 1.04 for every additional 1-percentage point increase in 9am humidity.
- the odds of rain are 1.04.


\
\
\
\


### Exercise 6: Inference for logistic regression

Let's focus on $\beta_1$, the "true" `locationUluru` coefficient.
In the previous exercise, we *estimated* $\beta_1$ to be -0.504 (or 0.604 on the odds scale).

#### Part a

*On the log(odds) scale*, the 95% CI for $\beta_1$ is roughly

$$
-0.504 \pm 2*0.414 = (-1.332, 0.324)
$$

More accurately:

```{r}
confint(weather_model_1)
```

What can we conclude?

- We do *not* have evidence that the chance of rain significantly differs in Uluru and Hobart.
- We do *not* have evidence that the chance of rain significantly differs in Uluru and Hobart when controlling for 9am humidity (i.e. when the 9am humidity is the same in both locations).
- We *do* have evidence that the chance of rain significantly differs in Uluru and Hobart.
- We *do* have evidence that the chance of rain significantly differs in Uluru and Hobart when controlling for 9am humidity.


#### Part b

We can also convert the 95% CI to the *odds scale* by exponentiating the limits:

$$
(e^{-1.332}, e^{0.324}) = (0.264, 1.38)
$$

More accurately:

```{r}
exp(confint(weather_model_1))
```

What can we conclude?

- We do *not* have evidence that the chance of rain significantly differs in Uluru and Hobart when controlling for 9am humidity.
- We *do* have evidence that the chance of rain significantly differs in Uluru and Hobart.


#### Part c

Your answers to Parts a and b should agree!
Changing between the log(odds) and odds scales *does* change interpretations, but *doesn't* change our conclusions.
To answer Part a, it was important to check whether **0** was in the CI for $\beta_1$ (on the log(odds) scale).
Equivalently, in Part b, what value did you check for in the CI for $e^{\beta_1}$ (on the odds scale)?


\
\
\
\


### Exercise 7: Comparing models

#### Part a

As shown in Part b of the previous exercise, a 95% CI for the *exponentiated* `humidity9am` coefficient is (1.025, 1.058).
So, is `humidity9am` a "significant" predictor of rain in this model?
REMEMBER: Check for "1" not "0"!


#### Part b

Let's add another predictor to our model: `humidity3pm`, the humidity level at 3pm.

```{r}
weather_model_2 <- glm(rain_tmrw ~ location + humidity9am + humidity3pm, data = weather, family = "binomial")
exp(confint(weather_model_2))
```

Using the 95% CI for the `humidity9am` coefficient (exponentiated), is `humidity9am` a "significant" predictor of rain in this model?


#### Part c

In Parts a and b, you should have concluded that `humidity9am` was a "significant" predictor of rain in `weather_model_1` but *not* in `weather_model_2`.
How could this be?!?


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!