---
title: "Simple logistic regression (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Warm-up {-}

We've been focusing on "Normal" linear regression models for **quantitative** response variables Y.

But not all response variables are quantitative!
We'll now explore logistic regression models for **binary categorical** response variables Y.

This *dramatically* expands the types of relationships we're able to explore.
Though the details of our models will change, the spirit and considerations are the same:

- we want to learn about Y from predictors X
- we want to build, interpret, and evaluate our models
- we need to be mindful of multicollinearity, overfitting, etc in this process


\
\
\
\




### Example 1: Normal or logistic?

For each scenario below, indicate whether "Normal" or logistic regression would be the appropriate modeling tool.

a. We want to model a person's commute time (Y) by their distance to work (X).

b. We want to model a person's commute time (Y) by whether or not they take public transit (X).

c. We want to model whether or not a person gets a speeding ticket (Y) based on their driving speed (X).




### Example 2: probability vs odds vs log(odds)

a. There's a 0.9 probability that your team wins its next game. Calculate the corresponding *odds* and *log(odds)* of this event. NOTE: `log()` is the natural log function in RStudio.

```{r}

```


b. The log(odds) that your bus is late are -1.0986. Calculate the corresponding *odds* and *probability* of this event. NOTE: Use `exp()`.

```{r}

```


### Example 3: Conditional odds

The *Titanic* was a British passenger ship that famously sank in 1912 after hitting an iceberg in the North Atlantic Ocean.

Approximately 2200 passengers were on board, and it's estimated that 1500 did not survive the crash.

The `titanic` dataset has info on a sample of these passengers:

```{r}
library(tidyverse)
titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv") %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  filter(!is.na(PClass))
head(titanic)
```

Let's use it to explore the relationship of:

- Y = `Survived` (1 = survived, 0 = did not survive)
- X = `PClass` (ticket class = 1st, 2nd, 3rd)

To this end, check out some tables of counts:

```{r}
titanic %>% 
  count(PClass, Survived)

library(janitor)
titanic %>% 
  tabyl(PClass, Survived) %>%
  adorn_totals(c("row", "col"))
```

And a plot of Y alone, and 2 plots of Y with X.

```{r}
titanic %>% 
  ggplot(aes(x = Survived)) +
  geom_bar() +
  theme_bw() 

titanic %>% 
  ggplot(aes(x = PClass, fill = Survived)) +
  geom_bar() + 
  theme_bw()  +
  theme(legend.position = "bottom")

titanic %>% 
  ggplot(aes(x = PClass, fill = Survived)) +
  geom_bar(position = "fill") + 
  theme_bw() +
  theme(legend.position = "bottom")
```


a. Calculate the *overall* odds that a passenger survived. (Which plot provides insight here?)


b. Y and X are both categorical. To visualize their relationship, we can make bar plots of X, split up by category of Y. Use plots 2 & 3 to describe the relationship of survival with ticket class. (What's the difference between these plots, and which plot makes it easier to answer this question?!)


c. How does Y depend upon X? Calculate the *conditional* odds that a 1st class passenger survived, i.e. the odds of survival *conditioned* on the passenger being 1st class: Odds(survive | 1st class)


d. Calculate the *conditional* odds that a 3rd class passenger survived: Odds(survive | 3rd class)




### Example 4: Odds ratios


Models help us *compare* the behavior of Y for different possible X values.

In logistic regression, these comparisons are often measured by **odds ratios**, a ratio of odds (unsurprisingly) that compares the odds of one group to another.

a. Calculate & interpret the odds ratio of surviving the Titanic, comparing those in the 1st class to those in the 3rd class.



b. Calculate & interpret the odds ratio of surviving the Titanic, comparing those in the 3rd class to those in the 1st class.








\
\
\
\





## Exercises {-}


**The story**

The `climbers` data is sub-sample of the [Himalayan Database](https://www.himalayandatabase.com/) distributed through the [R for Data Science TidyTuesday project](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-09-22).

This dataset includes information on the results and conditions for various Himalayan climbing expeditions.

Each row corresponds to a single member of a climbing expedition team:

```{r eval = TRUE}
# Load packages and data
library(tidyverse)
climbers <- read_csv("https://mac-stat.github.io/data/climbers_sub.csv") %>% 
  select(peak_name, success, age, oxygen_used, height_metres)

# Check out the first 6 rows
head(climbers)
```

Our goal will be to model whether or not a climber has `success` (Y) by their `age` (in years) or `oxygen_used` (TRUE or FALSE).

Since `success` is a binary categorical variable, i.e. a climber is either successful or they're not, we'll utilize **logistic regression**.  





### Exercise 1: Getting to know Y

Our response variable Y, `success`, is binary (hence categorical).
Thus we can explore Y alone using a bar plot and table of counts.
Note what you learn here.

```{r}
climbers %>% 
  count(success)

climbers %>% 
  ggplot(aes(x = success)) + 
  geom_bar()
```





### Exercise 2: Visualizing success vs age

Let's now explore the *relationship* of `success` with `age`.
Since `success` (Y) is categorical and `age` (X) is quantitative, it can be tempting to draw something like a side-by-side boxplot:

```{r}
climbers %>% 
  ggplot(aes(y = age, x = success)) + 
  geom_boxplot()
```

BUT this only gives us insight into the dependence of `age` on `success`, not `success` on `age` (it reverses the roles of Y and X).

Specifically, it helps us answer "how does age compare among success groups" not "how does success compare among age groups".

Instead, we want to calculate the observed *success rate* (probability of success) at each age, or in each *age bracket*.

We can calculate these success rates by age, i.e. the proportion of people of each age that succeeded, using `group_by()` and `summarize()`:

```{r}
# Calculate success rate by age
climbers %>% 
  group_by(age) %>% 
  summarize(success_rate = mean(success)) %>% 
  head(3)

# Split climbers into larger, more stable age brackets
# (This is good when we don't have many observations at each x!)
# Calculate success rate by age bracket
climbers %>% 
  mutate(age_bracket = cut(age, breaks = 20)) %>% 
  group_by(age_bracket) %>% 
  summarize(success_rate = mean(success)) %>% 
  head(3)
```


Now, plot the success rates by age and age bracket below.
Summarize what you learn about the relationship (note that we weren't able to observe this information from the boxplots!).
    

```{r}
climbers %>% 
  group_by(age) %>% 
  summarize(success_rate = mean(success)) %>% 
  ggplot(aes(x = age, y = success_rate)) +
  geom_point()

climbers %>% 
  mutate(age_bracket = cut(age, breaks = 20)) %>% 
  group_by(age_bracket) %>% 
  summarize(success_rate = mean(success)) %>% 
  ggplot(aes(x = age_bracket, y = success_rate)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
    
    
    


### Exercise 3: Logistic regression in R

To *model* the relationship between `success` and `age`, we can construct the **logistic regression model** using the code below.

Point out the **2** new things about this code.

```{r eval = TRUE}
climbing_model_1 <- glm(success ~ age, data = climbers, family = "binomial")

coef(summary(climbing_model_1))
```



### Exercise 4: Writing out & plotting the model

The estimated model formula is stated below on the log(odds), odds, and probability scales.    

- log(odds of success) = 0.42569 - 0.02397age    

- odds of success = $e^{0.42569 - 0.02397\text{age}}$    

- probability of success = $e^{0.42569 - 0.02397\text{age}}$ / ($e^{0.42569 - 0.02397\text{age}}$ + 1)
    

a. Convince yourselves that you know where the above formulas come from!

b. Our *one* model, expressed on different scales, is plotted below. As with the plot of our *data* on success rates in Exercise 3, these indicate that success rate decreases with age. (This is true on every scale, of course.) Compare the *shapes* of the models on these different scales.

```{r}
# Incorporate predictions on the probability, odds, & log-odds scales
climbers_predictions <- climbers %>% 
  mutate(probability = climbing_model_1$fitted.values) %>% 
  mutate(odds = probability / (1-probability)) %>% 
  mutate(log_odds = log(odds))

# Plot the model on the log-odds scale
ggplot(climbers_predictions, aes(x = age, y = log_odds)) + 
  geom_smooth(se = FALSE) 

# Plot the model on the odds scale
ggplot(climbers_predictions, aes(x = age, y = odds)) + 
  geom_smooth(se = FALSE) 

# Plot the model on the probability scale
ggplot(climbers_predictions, aes(x = age, y = probability)) + 
  geom_smooth(se = FALSE) 

#...and zoom out to see broader (and impossible) age range
ggplot(climbers_predictions, aes(x = age, y = as.numeric(success))) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, fullrange = TRUE) + 
  labs(y = "probability") + 
  lims(x = c(-200,200))
```
    
    
    




### Exercise 5: Predictions

Let's use the model formulas from the above exercise to make predictions.
Consider a 20-year-old climber.    

a. Calculate the *log(odds)* that this climber will succeed. HINT: use the log(odds) model.    

b. Calculate the *odds* that this climber will succeed. HINT: Either transform the log(odds) prediction or use the odds model.

c. Calculate the *probability* that this climber will succeed. HINT: Either transform the odds prediction or use the probability model.

d. Check your calculations 2 ways:

- Are they consistent with the model visualizations above?
- Do they match the results of the `predict()` function?

```{r}
# Check the log-odds prediction
predict(climbing_model_1, newdata = data.frame(age = 20))

# Check the odds prediction
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))

# Check the probability prediction
predict(climbing_model_1, newdata = data.frame(age = 20), type = "response")
```    
    



### PAUSE: Interpreting the coefficients

Up next, we'll interpret the coefficients of our logistic regression model:

log(odds of success) = 0.42569 - 0.02397age

Since these are on the log(odds) scale, which isn't easy to understand, we can convert them to the *odds* scale by exponentiating. You'll need to reference these below:

```{r}
coef(climbing_model_1)
exp(coef(climbing_model_1))
```



### Exercise 6: Interpreting the intercept

Let's interpret the intercept of our logistic regression model:

log(odds of success) = *0.42569* - 0.02397age

*Translate* the below interpretation on the log(odds) scale to the *odds* scale:

- The *log(odds of success)* for 0-year-old climbers (lol) is 0.42569.
- The *odds of success* for 0-year-old climbers (lol) are ___.




### Exercise 7: Interpreting the age coefficient

Next, let's interpret the age coefficient:

log(odds of success) = 0.42569 - *0.02397*age

On the log(odds) scale (a line), this is just a *slope*: A 1-year increase in age is associated with a decrease of 0.02397 in the log(odds of success), on average.
BUT when we convert the model to the odds scale, the model is NOT linear, thus the exponentiated age coefficient is not a slope.
So what *does* it mean?

a. Below are the *odds* of success among various ages:

```{r eval = TRUE}
exp(predict(climbing_model_1, newdata = data.frame(age = 23)))
exp(predict(climbing_model_1, newdata = data.frame(age = 22)))
exp(predict(climbing_model_1, newdata = data.frame(age = 21)))
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))
```

Calculate and interpret the *odds ratio* of success, comparing **23**-year-olds to **22**-year-olds.
 (Do not use rounded numbers in this calculation or those below!)


b. Calculate and interpret the *odds ratio* of success, comparing **22**-year-olds to **21**-year-olds.


c. Calculate and interpret the *odds ratio* of success, comparing **21**-year-olds to **20**-year-olds.


d. Summarize the punchline here.

- On the *log(odds)* scale, the `age` coefficient is 0.02397:   
    On average, for every 1-year increase in age, *log*(odds of success) decrease by 0.02397.

- On the *odds* scale, the `age` coefficient is $e^{-0.02397} = 0.976313$:    
    On average, for every 1-year increase in age, *odds* of success are ???% as high.    
    
    That is, they decrease by ???%




### Exercise 8: Another climbing model

Next, let's explore the relationship of `success` (Y) with `oxygen_used` (X).
Since both variables are *categorical*, we can fill in a bar plot of X with the categories of Y:

```{r}
# Plot the relationship
climbers %>% 
  ggplot(aes(x = oxygen_used, fill = success)) +
  geom_bar(position = "fill")

# Model the relationship
climbing_model_2 <- glm(success ~ oxygen_used, data = climbers, family = "binomial")
coef(summary(climbing_model_2))
```

Thus the estimated model formula is:

log(odds of `success`) = -1.327993 + 2.903864 `oxygen_usedTRUE`

a. Describe what you learn about the relationship from the plot alone.

b. Below are the odds and probabilities of success for climbers that do and don't use oxygen. Calculate and interpret the *odds ratio* of success, comparing climbers that *do* use oxygen to those that *don't*. NOTE: Your calculation should be above 1!!

```{r}
# odds of success for a climber that doesn't use oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))

# probability of success for a climber that doesn't use oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")

# odds of success for a climber that uses oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))

# probability of success for a climber that uses oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")
```





### Exercise 9: Interpreting coefficients

Let's interpret the coefficients of our logistic regression model:

log(odds of `success`) = -1.327993 + 2.903864 `oxygen_usedTRUE`

On the log(odds) scale:

- The log(odds of success) for climbers that don't use oxygen is -1.327993.
- The log(odds of success) is 2.903864 higher for climbers that use oxygen than for those that don't.

Yuk.
Convert those coefficients to the odds scale:

```{r}
coef(climbing_model_2)
exp(coef(climbing_model_2))
```



a. How can we interpret the intercept on the odds scale, 0.265? Where have you observed this number before?!


b. How can we interpret the `oxygen_usedTRUE` coefficient on the odds scale, 18.24? Where have you observed this number before?!






### Exercise 10: Connecting tables to logistic regression

For the *simple* logistic regression model with *1 predictor*, which is *categorical*, the model coefficients can be calculated from the table of counts!

Here *rows* represent whether oxygen was used and *columns* represent whether the climber was successful:

```{r}
# Rows correspond to oxygen_used
# Columns correspond to success
climbers %>% 
  tabyl(oxygen_used, success) %>%
  adorn_totals(c("row", "col"))
```

a. Use this table to calculate the odds of success for somebody that *didn't* use oxygen. Where have you observed this number before?! HINT: Use the FALSE *row*.

b. Use this table to calculate the *odds ratio* of success, comparing climbers that *do* use oxygen to those that *don't*. Where have you observed this number before?! HINT: You'll first need to calculate the odds of success for people that use oxygen.



### Reflection

What binary outcomes might be relevant in your project? What predictor(s) could be relevant in a logistic regression model for that outcome?



## Extra practice {-}


### Exercise 11: Calculating probabilities and odds

In Exercise 8, we used the `predict()` function to calculate the odds and probabilities of success for climbers that do and don't use oxygen:

```{r}
# odds of success for a climber that doesn't use oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))

# probability of success for a climber that doesn't use oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")

# odds of success for a climber that uses oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))

# probability of success for a climber that uses oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")
```

Calculate these 4 values *by hand*, starting from the model formula:

log(odds of `success`) = -1.327993 + 2.903864 `oxygen_usedTRUE`




### Exercise 12: Interpreting logistic regression coefficients

In general (not just for climbers), let's summarize how to interpret logistic regression coefficients.
Review the details below and convince yourself that these general properties are consistent with your work above.
Let $Y$ be a binary categorical response variable (eg: Y = 1 if the event of interest happens and is 0 otherwise).
Then a logistic regression model of $Y$ by $Y$ is

$$\log(\text{odds}) = \beta_0 + \beta_1 X$$

- Coefficient interpretation for quantitative X   
    $$\begin{split}    
    \beta_0     & = \text{ LOG(ODDS) when } X=0 \\
    e^{\beta_0} & = \text{ ODDS when } X=0 \\
    \beta_1     & = \text{ change in LOG(ODDS) per 1 unit increase in } X \\
    e^{\beta_1} & = \text{ multiplicative change in ODDS per 1 unit increase in } X \text{ (ie. the odds ratio, comparing the odds for X + 1 vs X)} \\
    \end{split}$$
- Coefficient interpretation for categorical X    
    $$\begin{split}    
    \beta_0     & = \text{ LOG(ODDS) at the reference category } \\
    e^{\beta_0} & = \text{ ODDS at the reference category } \\
    \beta_1     & = \text{ unit change in LOG(ODDS) relative to the reference} \\
    e^{\beta_1} & = \text{ multiplicative change in ODDS relative to the reference } \text{ (ie. the odds ratio, comparing the odds for the given category vs the reference category)} \\
    \end{split}$$






### Exercise 13: R code

A lot of R code was used to produce the plots and other output above.
The code is not the point.
It's more important to understand the logistic regression concepts, and the *output* of the code -- you can always look up code.
With that in mind, review the code to get a sense for the structure!



### Exercise 14: Optional math challenge

Consider a simple logistic regression model:

$$\log(\text{odds}) = \beta_0 + \beta_1 X$$

Prove that $e^{\beta_1}$ indicates the MULTIPLICATIVE change in ODDS for a 1-unit increase in X, i.e. that it's equivalent to the odds ratio for "X+1" vs "X".     
 






## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!


