---
title: "Confidence intervals (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Exercises {-}


**Goals**

- Build up our confidence with confidence intervals (!) by starting with some familiar data and the simple linear regression setting.
- Explore how to use CIs to assess the "significance" of our sample results.


### Exercise 1: Standard errors

In the first set of exercises, we'll explore daily bikeshare ridership.
To begin, let's explore the relationship of `riders_total` by `windspeed` (in mph):

E[`riders_total` | `windspeed`] = $\beta_0$ + $\beta_1$ `windspeed`

\
\

A sample *estimate* of this population model, obtained using our sample `bikes` data is below:

E[`riders_total` | `windspeed`] = $\hat{\beta}_0$ + $\hat{\beta}_1$ `windspeed` = 5621.15 - 87.51 `windspeed`

```{r}
# Load packages and import data
library(tidyverse)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv")

# Model the relationship
bikes_model_1 <- lm(riders_total ~ windspeed, data = bikes)
coef(summary(bikes_model_1))

# Visualize the relationship
bikes %>% 
  ggplot(aes(y = riders_total, x = windspeed)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```


a. Since $\hat{\beta}_1 = -87.51$, we *estimate* that the expected number of riders decreases by 87.51 for every 1mph increase in windspeed. Report and *interpret* $SE(\hat{\beta}_1)$, the *standard error* of this estimate.


b. Considering context, units, and scale of our data (as illustrated in the plot), do you think this is a small, moderate, or large amount of error? (Mainly, do you think our slope estimate is pretty accurate or does the standard error make you skeptical?)


### Exercise 2: Constructing & interpreting a CI

Continue to let $\beta_1$ be the "true" population `windspeed` coefficient, and $\hat{\beta}_1 = -87.51$ be our *sample estimate* of $\beta_1$.

a. $\hat{\beta}_1$ simply provides a *point estimate*, or our single best guess, of $\beta_1$.
To also produce an *interval estimate*, use the 68-95-99.7 Rule to approximate a 95% CI for $\beta_1$.


b. We can calculate a more *accurate* CI by applying the `confint()` function to our *model*.
Your approximation from Part a should be close!

```{r}
confint(bikes_model_1, level = 0.95)
```    


c. *Interpreting* the CI for $\beta_1$ in context requires that we can interpret $\beta_1$ itself! So how can we interpret $\beta_1$ (in general, without assuming a specific value for the unknown $\beta_1$)? Choose 1.

- $\beta_1$ measures the expected number of riders on days with 0mph windspeed
- $\beta_1$ measures the difference in the expected number of riders on days that have a lot of wind vs days that have little wind
- $\beta_1$ measures the change in the expected number of riders for each additional 1mph in windspeed


d. Per the previous exercise: "We are 95% confident that $\beta_1$ is between -61.13 and -113.88".
Interpret this CI in *context*, drawing on your answer to Part a.


### Exercise 3: Misinterpretations

For each of the following **MISINTERPRETATIONS** of a 95% confidence interval (a,b), explain why the statement is a misinterpretation.

- Misinterpretation 1: "There is a 95% probability that the population parameter is within (a,b)."

- Misinterpretation 2: "There is a 5% probability that the population parameter is not within (a,b)."

- Misinterpretation 3: "There is a 95% chance that the sample estimate in (a,b)."


### Exercise 4: Changing the confidence level

Our 95% CI for $\beta_1$ is (-113.88, -61.13).
What would happen if we changed the **confidence level**?!

a. If we lower our confidence level from 95% to 68%, only 68% of samples would produce 68% CIs that cover $\beta_1$.
Intuitively, would the 68% CI be narrower or wider than a 95% CI?    

b. Use the 68-95-99.7 Rule to approximate the 68% CI for $\beta_1$.

c. What if we wanted to be VERY VERY confident that our CI covered $\beta_1$? Use the 68-95-99.7 Rule to approximate the 99.7% CI for $\beta_1$.

d. What if we wanted to be *100%* confident that our CI covered $\beta_1$?!What do you think the CI would have to be?! (Use logic -- the 69-95-99.7 Rule doesn't help in this scenario.)

e. *Check* your answers to Parts b-d using `confint()`.
(Your answers should be close but not exact.)

```{r}
confint(bikes_model_1, level = 0.68)
confint(bikes_model_1, level = 0.997)
confint(bikes_model_1, level = 1)
```   


### Exercise 5: Trade-offs

Summarize the trade-offs in increasing confidence levels, say from 95% to 99.7%, for a CI of some population parameter $\beta$.

a. Choose the correct words for both statements. As confidence level increases...

- the percent of CIs that cover $\beta$...increases / decreases / stays the same; and
- the *width* of the CI...increases / decreases / stays the same.

b. Why is a very wide CI less useful than a narrower CI? For example, what if a pollster reported with 99.7% confidence that the support for "Candidate A" in an upcoming election is between 5% and 85%?

c. Practitioners typically use a 95% confidence level.
Comment on why you think this is.


### Exercise 6: Using CIs to test hypotheses

Recall our population model of interest:

E[`time` | `windspeed`] = $\beta_0$ + $\beta_1$ `windspeed`

A typical research question here might be whether, among the *population* of days (not just those in our sample), there's a "significant" relationship between ridership and windspeed (i.e. $\beta_1 \ne 0$). Though our sample estimate *suggested* there's a negative relationship ($\hat{\beta}_1 = -87.51$), there's *error* in this estimate. So...does our sample still suggest a relationship after accounting for this potential error?!

a. The *sample model* is plotted below along with *confidence bands* that reflect its potential *error*. Based on this plot alone, what do you think? When accounting for the potential error in our sample model, do we have evidence of a "significant" relationship between ridership and `windspeed`?

```{r}
bikes %>% 
  ggplot(aes(y = riders_total, x = windspeed)) + 
  geom_point() + 
  geom_smooth(method = "lm")
```

b. Recall that our 95% CI for $\beta_1$ was roughly $(-113.88, -61.13)$. Using this CI alone, do we have evidence of a "significant" relationship between ridership and `windspeed`?

c. To answer Parts a and b, you had to make up some "rules" for using plots and CIs to evaluate the significance of $\beta_1$. In general, what were these rules?!

- If (something about the plot), then our sample data provides evidence of a "significant" relationship between Y and X.
- If (something about the CI), then our sample data provides evidence of a "significant" relationship between Y and X.

d. Your work above suggests that there’s a *statistically* significant association between ridership and `windspeed`. This merely suggests that an association *exists* ($\beta_1 \ne 0$). It does *not* necessarily mean that the *magnitude* of the association is *meaningful*, or *practically* significant, in context. Do you think that the association between `riders_total` and `windspeed` is also *practically* significant? Mainly, in the bikeshare context, is the *magnitude* of the association (a decrease between 61.13 and 113.88 riders per 1mph increase in windspeed) actually meaningful?


## Extra Practice 

The exercises below provide more practice with confidence intervals *and* other course concepts: visualizations, model building, logistic regression, causal diagrams, ....
You will likely not get through all of these during class.
That's ok!
Just remember to come back and practice after class.


### Exercise 7: More practice

**Research question:** Is the relationship between wind speed (`windspeed`) (in miles per hour) and number of riders (`riders_total`) different across weekdays and weekends?

a. Construct and interpret a visualization that would address this question.

b. Fit a regression model that would address our research question. (Should it be a linear or a logistic regression model?) Interpret only the coefficient of interest.

```{r}
mod_bikes <- ___
```


c. 

- Construct an approximate 95% confidence interval (CI) for the coefficient of interest by hand using the 68-95-99.7 rule.
- Compare your confidence interval to the one given by `confint()` which gives an exact confidence interval. (The columns give the lower and upper ends of the CI for each coefficient.)
- Interpret the exact confidence interval in context.
- Is zero in the interval? Do we have evidence for a real difference in the windspeed-riders relationship across weekends and weekdays?

```{r}
# By hand (you fill in)


# Using confint()
confint(mod_bikes, level = 0.95)
```


d. Let's see if these results agree when looking at adjusted R-squared.

Fit another regression model that does not have the coefficient of interest from your Part b model. Compare the adjusted R-squared values between this model and the Part b model. Explain your findings.


### Exercise 8: Even more practice!

**Research question:** How different is holiday ridership from non-holidays, after accounting for confounding factors?

a. We believe that weather category (`weather_cat`), temperature (`temp_actual`), and wind speed (`windspeed`) confound the relationship of interest.

- Draw a causal graph that shows the 5 variables of interest. Based on your graph do you believe that the 3 potential confounders are indeed confounders (and not mediators or colliders)?
- Construct visualizations that allow you how each potential confounder relates to `riders_total` and to `holiday`.

b. Based on your Part a explorations, fit an appropriate regression model to answer our research question. Interpret only the coefficient of interest.

**A note about scientific notation in R:** Sometimes you may see numbers with the letter `e` in the middle. This is R's way of expressing scientific notation. Whenever you see `e`, replace that with `10 to the power of ...`. So:

- 1.234e+02 is 1.234 x 10^2 = 123.4
- 1.234e-02 is 1.234 x 10^(-2) = 0.01234

c. .

- Use `confint()` to construct a 95% confidence interval for the coefficient of interest.
- Interpret this confidence interval in context.
- Is zero in the interval? Do we have evidence for a real holiday effect on ridership?


### Exercise 9: CIs with logistic regression

The Western Collaborative Group Study (WCGS) was designed in order to investigate a possible link between Type A behavior and coronary heart disease (CHD), and to develop a framework to select patients for intervention in order to decrease risk of CHD. The study contained 3154 cis men between the ages of 39 and 59 in California who had no history of CHD. They were enrolled in the study in 1960 and 1961, underwent a medical examination and covered their medical history, and they were re-examined annually for interim cardiovascular history.

A full codebook is available [here](https://github.com/Mac-STAT/data/blob/main/wcgs_codebook.md). We will focus on the following variables:

- `chd`: Presence (1) or absence (0) of CHD over followup (outcome)
- `tabp`: Presence (1) or absence (0) of Type A behavior (main variable of interest)
- `age`: Age at time of enrollment in the study (years)
- `sbp`: Systolic blood pressure
- `dbp`: Diastolic blood pressure
- `chol`: Cholesterol (mg/dL)
- `ncigs`: Number of cigarettes smoked per day
- `arcus`: Presence (1) or absence (0) of arcus senilis (a colored ring around the cornea made up of lipids like cholesterol and believed to be a risk factor for CHD)
- `bmi`: BMI = weight * 703 / height^2

**Research question:** Is there a causal effect of Type A/B personality on developing coronary heart disease?

```{r}
wcgs <- read_csv("https://mac-stat.github.io/data/wcgs.csv")
```

a. We believe that the following variables are confounders of the relationship between Type A/B personality `tabp` and coronary heart disease (`CHD`): `age + sbp + dbp + chol + ncigs + arcus + bmi`.

Fit a regression model that would address our research question. (Should it be a linear or a logistic regression model?) Interpret only the coefficient of interest.

```{r}
typea_mod <- ___
```

b. 

- Construct a 95% confidence interval for the odds ratio of interest using the following code.
- Interpret the confidence interval in context.
- Is 1 contained in the interval? Why is 1 a relevant value to look for here?

c. 

(On your own time)

The data context in this exercise has a fraught history with the smoking industry. Read [this article](https://www.thecut.com/2016/08/the-tobacco-industry-helped-create-the-type-a-personality.html) for some context about how the Type A personality came to be defined and studied. (One big takeaway: The smoking industry had a large incentive to find something to blame health problems on other than smoking!)


### Reflection

How are you feeling about your ability to translate research questions into appropriate statistical investigations and addressing those questions using output from those investigations? What has gotten easier? What remains challenging?


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!