16  Simple logistic regression

Settling In

  • Sit with your draft group (see Slack)
    • Introduce yourself
    • Chat about data sets that you were interested in.
  • Help each other get ready to take notes!
    • Open your notebook.
    • Open the online manual to the “Course Schedule” and click on today’s activity. That brings you here!
    • Download “16-logistic-univariate.qmd” and open it in RStudio. Read the “Organizing your files” directions at the top of the file!!





Recap

ImportantStatistical superpowers

Thus far we’ve used linear regression models (simple and multiple) to explore the relationship of a quantitative response variable Y with some predictor X (quantitative or categorical):

E[Y | X_1, X_2, ..., X_p] = \beta_0 + \beta_1 X_1 + \beta_2 X_3 + ... + \beta_p X_p

Another statistical superpower?! We can extend these ideas to the common scenario in which Y is categorical. In STAT 155 we’ll specifically focus on using logistic regression models for binary categorical response variables Y (with 2 categories). This dramatically expands the types of relationships we’re able to explore.



NoteLearning goals

By the end of this lesson, you should be able to:

  • Distinguish between probabilities and odds, and convert one to the other
  • Explain the differences between linear regression and logistic regression for modeling binary outcomes
  • Explore simple logistic regression models
    • visualize relationships
    • build models in R
    • interpret coefficients
    • make predictions
    • converting from log odds to odds to probability scales





Logistic Regression

If our outcome is categorical with 2 outcomes (A and B), we create an indicator version of the outcome, Y.

Y = \begin{cases}1 \quad \text{if outcome A happens}\\0 \quad \text{if outcome A doesn't happens}\end{cases}

The “average” value of the indicator Y is the probability of the event, Y=1,

E[Y] = P(Y = 1) = P(\text{the event}) = p

We model the transformed version of E[Y | X], the expected value of Y “given” explanatory variable X using a linear model,

\log \left(\frac{E[Y | X]}{1-E[Y | X ]}\right) = \beta_0 + \beta_1X

which can be written as,

\log \left(Odds[Y = 1 | X]\right) = \beta_0 + \beta_1X



Model Formula Rearrangements

If we exponentiate both sides, e^{LHS} = e^{RHS}, we get

Odds[Y = 1 | X] = \left(\frac{E[Y | X]}{1-E[Y | X ]}\right) = e^{\beta_0 + \beta_1X}



If we use the fact that Odds = \frac{p}{1-p} can be written as p = \frac{Odds}{1 + Odds}, we get

P(Y = 1 | X) = E[Y | X] = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}





Interpretations / Comparisons

To do interpretations, we compare two groups that have 1 unit different in the X (eventually, when we have more X’s, we’ll keep all other variables in the model constant).

Here is the difference in the log(Odds) between those two groups,

\log \left(Odds[Y = 1 | X = x+1 ]\right) - \log \left(Odds[Y = 1 | X = x ]\right) = (\beta_0 + \beta_1(x+1)) - (\beta_0 + \beta_1x)



We can simplify the right hand side (RHS),

\log \left(Odds[Y = 1 | X = x+1 ]\right) - \log \left(Odds[Y = 1 | X = x ]\right) = \beta_1



Now, simplify the left hand side (LHS) using logarithm rules,

\log \left(\frac{Odds[Y = 1 | X = x+1 ] }{ Odds[Y = 1 | X = x ]}\right) = \beta_1



Exponentiate both sides,

\left(\frac{Odds[Y = 1 | X = x+1 ] }{ Odds[Y = 1 | X = x ]} \right)= e^{\beta_1}

We find that e^{\beta_1} gives us the Odds Ratio.







Odds ratios

Let “odds for group A” be the conditional odds of an event for some group A.

Let “odds for group B” be the conditional odds of an event for some group B.

Then the odds ratio (OR) of the event, comparing group A to group B is:

OR = (odds for group A) / (odds for group B)




Then OR >= 0 where:

  • If OR > 1, then the odds of the event are higher for group A than group B. Examples:

    • OR = 2: The odds of the event are 2 times higher (100% higher) for group A than group B.
    • OR = 1.5: The odds of the event are one and a half times higher (50% higher) for group A than group B.
  • If OR = 1, then the odds of the event are the same for groups A and B.

  • If OR < 1, then the odds of the event are lower for group A than group B. Examples:

    • OR = 0.5: The odds of the event are half as high (50% as high) for group A than group B.
    • OR = 0.2:
      • The odds of the event for group A are 0.2 times (20% as high as) the odds for group B.
      • The odds of the event are 80% lower for group A than group B.





English Language

Increase

  • “increase by 4 percentage points” implies + 4 (if the outcome unit was percent)
  • “increase by 4%” implies *1.04
  • “increase by a factor of 1.04” implies *1.04
  • “1.04 time higher” implies *1.04

Decrease

  • “decrease by 96 percentage points” implies - 96 (if the outcome unit was percent)
  • “decrease by 5%” implies *0.95
  • “decrease by a factor of 0.95” implies *0.95
  • “0.95 times that of” implies *0.95





Warm-up

We’ve been focusing on “Normal” linear regression models for quantitative response variables Y.

But not all response variables are quantitative! We’ll now explore logistic regression models for binary categorical response variables Y.

This dramatically expands the types of relationships we’re able to explore. Though the details of our models will change, the spirit and considerations are the same:

  • we want to learn about Y from predictors X
  • we want to build, interpret, and evaluate our models
  • we need to be mindful of multicollinearity, overfitting, etc in this process





Example 1: Normal or logistic?

For each scenario below, indicate whether “Normal” or logistic regression would be the appropriate modeling tool.

  1. We want to model a person’s commute time (Y) by their distance to work (X).

  2. We want to model a person’s commute time (Y) by whether or not they take public transit (X).

  3. We want to model whether or not a person gets a speeding ticket (Y) based on their driving speed (X).

Example 2: probability vs odds vs log(odds)

  1. There’s a 0.9 probability that your team wins its next game. Calculate the corresponding odds and log(odds) of this event. NOTE: log() is the natural log function in RStudio.
  1. The log(odds) that your bus is late are -1.0986. Calculate the corresponding odds and probability of this event. NOTE: Use exp().

Example 3: Conditional odds

The Titanic was a British passenger ship that famously sank in 1912 after hitting an iceberg in the North Atlantic Ocean.

Approximately 2200 passengers were on board, and it’s estimated that 1500 did not survive the crash.

The titanic dataset has info on a sample of these passengers:

library(tidyverse)
titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv") %>% 
  mutate(Survived = as.factor(Survived)) %>% 
  filter(!is.na(PClass))
head(titanic)

Let’s use it to explore the relationship of:

  • Y = Survived (1 = survived, 0 = did not survive)
  • X = PClass (ticket class = 1st, 2nd, 3rd)

To this end, check out some tables of counts:

titanic %>% 
  count(PClass, Survived)

library(janitor)
titanic %>% 
  tabyl(PClass, Survived) %>%
  adorn_totals(c("row", "col"))

And a plot of Y alone, and 2 plots of Y with X.

titanic %>% 
  ggplot(aes(x = Survived)) +
  geom_bar() +
  theme_bw() 

titanic %>% 
  ggplot(aes(x = PClass, fill = Survived)) +
  geom_bar() + 
  theme_bw()  +
  theme(legend.position = "bottom")

titanic %>% 
  ggplot(aes(x = PClass, fill = Survived)) +
  geom_bar(position = "fill") + 
  theme_bw() +
  theme(legend.position = "bottom")
  1. Calculate the overall odds that a passenger survived. (Which plot provides insight here?)

  2. Y and X are both categorical. To visualize their relationship, we can make bar plots of X, split up by category of Y. Use plots 2 & 3 to describe the relationship of survival with ticket class. (What’s the difference between these plots, and which plot makes it easier to answer this question?!)

  3. How does Y depend upon X? Calculate the conditional odds that a 1st class passenger survived, i.e. the odds of survival conditioned on the passenger being 1st class: Odds(survive | 1st class)

  4. Calculate the conditional odds that a 3rd class passenger survived: Odds(survive | 3rd class)

Example 4: Odds ratios

Models help us compare the behavior of Y for different possible X values.

In logistic regression, these comparisons are often measured by odds ratios, a ratio of odds (unsurprisingly) that compares the odds of one group to another.

  1. Calculate & interpret the odds ratio of surviving the Titanic, comparing those in the 1st class to those in the 3rd class.

  2. Calculate & interpret the odds ratio of surviving the Titanic, comparing those in the 3rd class to those in the 1st class.





Exercises

The story

The climbers data is sub-sample of the Himalayan Database distributed through the R for Data Science TidyTuesday project.

This dataset includes information on the results and conditions for various Himalayan climbing expeditions.

Each row corresponds to a single member of a climbing expedition team:

# Load packages and data
library(tidyverse)
climbers <- read_csv("https://mac-stat.github.io/data/climbers_sub.csv") %>% 
  select(peak_name, success, age, oxygen_used, height_metres)

# Check out the first 6 rows
head(climbers)
# A tibble: 6 × 5
  peak_name  success   age oxygen_used height_metres
  <chr>      <lgl>   <dbl> <lgl>               <dbl>
1 Ama Dablam TRUE       28 FALSE                6814
2 Ama Dablam TRUE       27 FALSE                6814
3 Ama Dablam TRUE       35 FALSE                6814
4 Ama Dablam TRUE       37 FALSE                6814
5 Ama Dablam TRUE       43 FALSE                6814
6 Ama Dablam FALSE      38 FALSE                6814

Our goal will be to model whether or not a climber has success (Y) by their age (in years) or oxygen_used (TRUE or FALSE).

Since success is a binary categorical variable, i.e. a climber is either successful or they’re not, we’ll utilize logistic regression.

Exercise 1: Getting to know Y

Our response variable Y, success, is binary (hence categorical). Thus we can explore Y alone using a bar plot and table of counts. Note what you learn here.

climbers %>% 
  count(success)

climbers %>% 
  ggplot(aes(x = success)) + 
  geom_bar()

Exercise 2: Visualizing success vs age

Let’s now explore the relationship of success with age. Since success (Y) is categorical and age (X) is quantitative, it can be tempting to draw something like a side-by-side boxplot:

climbers %>% 
  ggplot(aes(y = age, x = success)) + 
  geom_boxplot()

BUT this only gives us insight into the dependence of age on success, not success on age (it reverses the roles of Y and X).

Specifically, it helps us answer “how does age compare among success groups” not “how does success compare among age groups”.

Instead, we want to calculate the observed success rate (probability of success) at each age, or in each age bracket.

We can calculate these success rates by age, i.e. the proportion of people of each age that succeeded, using group_by() and summarize():

# Calculate success rate by age
climbers %>% 
  group_by(age) %>% 
  summarize(success_rate = mean(success)) %>% 
  head(3)

# Split climbers into larger, more stable age brackets
# (This is good when we don't have many observations at each x!)
# Calculate success rate by age bracket
climbers %>% 
  mutate(age_bracket = cut(age, breaks = 20)) %>% 
  group_by(age_bracket) %>% 
  summarize(success_rate = mean(success)) %>% 
  head(3)

Now, plot the success rates by age and age bracket below. Summarize what you learn about the relationship (note that we weren’t able to observe this information from the boxplots!).

climbers %>% 
  group_by(age) %>% 
  summarize(success_rate = mean(success)) %>% 
  ggplot(aes(x = age, y = success_rate)) +
  geom_point()

climbers %>% 
  mutate(age_bracket = cut(age, breaks = 20)) %>% 
  group_by(age_bracket) %>% 
  summarize(success_rate = mean(success)) %>% 
  ggplot(aes(x = age_bracket, y = success_rate)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Exercise 3: Logistic regression in R

To model the relationship between success and age, we can construct the logistic regression model using the code below.

Point out the 2 new things about this code.

climbing_model_1 <- glm(success ~ age, data = climbers, family = "binomial")

coef(summary(climbing_model_1))
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept)  0.42568563 0.169177382  2.516209 1.186249e-02
age         -0.02397199 0.004489504 -5.339564 9.317048e-08

Exercise 4: Writing out & plotting the model

The estimated model formula is stated below on the log(odds), odds, and probability scales.

  • log(odds of success) = 0.42569 - 0.02397age

  • odds of success = e^{0.42569 - 0.02397\text{age}}

  • probability of success = e^{0.42569 - 0.02397\text{age}} / (e^{0.42569 - 0.02397\text{age}} + 1)

  1. Convince yourselves that you know where the above formulas come from!

  2. Our one model, expressed on different scales, is plotted below. As with the plot of our data on success rates in Exercise 3, these indicate that success rate decreases with age. (This is true on every scale, of course.) Compare the shapes of the models on these different scales.

# Incorporate predictions on the probability, odds, & log-odds scales
climbers_predictions <- climbers %>% 
  mutate(probability = climbing_model_1$fitted.values) %>% 
  mutate(odds = probability / (1-probability)) %>% 
  mutate(log_odds = log(odds))

# Plot the model on the log-odds scale
ggplot(climbers_predictions, aes(x = age, y = log_odds)) + 
  geom_smooth(se = FALSE) 

# Plot the model on the odds scale
ggplot(climbers_predictions, aes(x = age, y = odds)) + 
  geom_smooth(se = FALSE) 

# Plot the model on the probability scale
ggplot(climbers_predictions, aes(x = age, y = probability)) + 
  geom_smooth(se = FALSE) 

#...and zoom out to see broader (and impossible) age range
ggplot(climbers_predictions, aes(x = age, y = as.numeric(success))) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, fullrange = TRUE) + 
  labs(y = "probability") + 
  lims(x = c(-200,200))

Exercise 5: Predictions

Let’s use the model formulas from the above exercise to make predictions. Consider a 20-year-old climber.

  1. Calculate the log(odds) that this climber will succeed. HINT: use the log(odds) model.

  2. Calculate the odds that this climber will succeed. HINT: Either transform the log(odds) prediction or use the odds model.

  3. Calculate the probability that this climber will succeed. HINT: Either transform the odds prediction or use the probability model.

  4. Check your calculations 2 ways:

  • Are they consistent with the model visualizations above?
  • Do they match the results of the predict() function?
# Check the log-odds prediction
predict(climbing_model_1, newdata = data.frame(age = 20))

# Check the odds prediction
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))

# Check the probability prediction
predict(climbing_model_1, newdata = data.frame(age = 20), type = "response")

PAUSE: Interpreting the coefficients

Up next, we’ll interpret the coefficients of our logistic regression model:

log(odds of success) = 0.42569 - 0.02397age

Since these are on the log(odds) scale, which isn’t easy to understand, we can convert them to the odds scale by exponentiating. You’ll need to reference these below:

coef(climbing_model_1)
exp(coef(climbing_model_1))

Exercise 6: Interpreting the intercept

Let’s interpret the intercept of our logistic regression model:

log(odds of success) = 0.42569 - 0.02397age

Translate the below interpretation on the log(odds) scale to the odds scale:

  • The log(odds of success) for 0-year-old climbers (lol) is 0.42569.
  • The odds of success for 0-year-old climbers (lol) are ___.

Exercise 7: Interpreting the age coefficient

Next, let’s interpret the age coefficient:

log(odds of success) = 0.42569 - 0.02397age

On the log(odds) scale (a line), this is just a slope: A 1-year increase in age is associated with a decrease of 0.02397 in the log(odds of success), on average. BUT when we convert the model to the odds scale, the model is NOT linear, thus the exponentiated age coefficient is not a slope. So what does it mean?

  1. Below are the odds of success among various ages:
exp(predict(climbing_model_1, newdata = data.frame(age = 23)))
        1 
0.8819057 
exp(predict(climbing_model_1, newdata = data.frame(age = 22)))
        1 
0.9033021 
exp(predict(climbing_model_1, newdata = data.frame(age = 21)))
        1 
0.9252177 
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))
       1 
0.947665 

Calculate and interpret the odds ratio of success, comparing 23-year-olds to 22-year-olds. (Do not use rounded numbers in this calculation or those below!)

  1. Calculate and interpret the odds ratio of success, comparing 22-year-olds to 21-year-olds.

  2. Calculate and interpret the odds ratio of success, comparing 21-year-olds to 20-year-olds.

  3. Summarize the punchline here.

  • On the log(odds) scale, the age coefficient is 0.02397:
    On average, for every 1-year increase in age, log(odds of success) decrease by 0.02397.

  • On the odds scale, the age coefficient is e^{-0.02397} = 0.976313:
    On average, for every 1-year increase in age, odds of success are ???% as high.

    That is, they decrease by ???%

Exercise 8: Another climbing model

Next, let’s explore the relationship of success (Y) with oxygen_used (X). Since both variables are categorical, we can fill in a bar plot of X with the categories of Y:

# Plot the relationship
climbers %>% 
  ggplot(aes(x = oxygen_used, fill = success)) +
  geom_bar(position = "fill")

# Model the relationship
climbing_model_2 <- glm(success ~ oxygen_used, data = climbers, family = "binomial")
coef(summary(climbing_model_2))

Thus the estimated model formula is:

log(odds of success) = -1.327993 + 2.903864 oxygen_usedTRUE

  1. Describe what you learn about the relationship from the plot alone.

  2. Below are the odds and probabilities of success for climbers that do and don’t use oxygen. Calculate and interpret the odds ratio of success, comparing climbers that do use oxygen to those that don’t. NOTE: Your calculation should be above 1!!

# odds of success for a climber that doesn't use oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))

# probability of success for a climber that doesn't use oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")

# odds of success for a climber that uses oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))

# probability of success for a climber that uses oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")

Exercise 9: Interpreting coefficients

Let’s interpret the coefficients of our logistic regression model:

log(odds of success) = -1.327993 + 2.903864 oxygen_usedTRUE

On the log(odds) scale:

  • The log(odds of success) for climbers that don’t use oxygen is -1.327993.
  • The log(odds of success) is 2.903864 higher for climbers that use oxygen than for those that don’t.

Yuk. Convert those coefficients to the odds scale:

coef(climbing_model_2)
exp(coef(climbing_model_2))
  1. How can we interpret the intercept on the odds scale, 0.265? Where have you observed this number before?!

  2. How can we interpret the oxygen_usedTRUE coefficient on the odds scale, 18.24? Where have you observed this number before?!

Exercise 10: Connecting tables to logistic regression

For the simple logistic regression model with 1 predictor, which is categorical, the model coefficients can be calculated from the table of counts!

Here rows represent whether oxygen was used and columns represent whether the climber was successful:

# Rows correspond to oxygen_used
# Columns correspond to success
climbers %>% 
  tabyl(oxygen_used, success) %>%
  adorn_totals(c("row", "col"))
  1. Use this table to calculate the odds of success for somebody that didn’t use oxygen. Where have you observed this number before?! HINT: Use the FALSE row.

  2. Use this table to calculate the odds ratio of success, comparing climbers that do use oxygen to those that don’t. Where have you observed this number before?! HINT: You’ll first need to calculate the odds of success for people that use oxygen.

Reflection

What binary outcomes might be relevant in your project? What predictor(s) could be relevant in a logistic regression model for that outcome?

Extra practice

Exercise 11: Calculating probabilities and odds

In Exercise 8, we used the predict() function to calculate the odds and probabilities of success for climbers that do and don’t use oxygen:

# odds of success for a climber that doesn't use oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))

# probability of success for a climber that doesn't use oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")

# odds of success for a climber that uses oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))

# probability of success for a climber that uses oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")

Calculate these 4 values by hand, starting from the model formula:

log(odds of success) = -1.327993 + 2.903864 oxygen_usedTRUE

Exercise 12: Interpreting logistic regression coefficients

In general (not just for climbers), let’s summarize how to interpret logistic regression coefficients. Review the details below and convince yourself that these general properties are consistent with your work above. Let Y be a binary categorical response variable (eg: Y = 1 if the event of interest happens and is 0 otherwise). Then a logistic regression model of Y by Y is

\log(\text{odds}) = \beta_0 + \beta_1 X

  • Coefficient interpretation for quantitative X
    \begin{split} \beta_0 & = \text{ LOG(ODDS) when } X=0 \\ e^{\beta_0} & = \text{ ODDS when } X=0 \\ \beta_1 & = \text{ change in LOG(ODDS) per 1 unit increase in } X \\ e^{\beta_1} & = \text{ multiplicative change in ODDS per 1 unit increase in } X \text{ (ie. the odds ratio, comparing the odds for X + 1 vs X)} \\ \end{split}
  • Coefficient interpretation for categorical X
    \begin{split} \beta_0 & = \text{ LOG(ODDS) at the reference category } \\ e^{\beta_0} & = \text{ ODDS at the reference category } \\ \beta_1 & = \text{ unit change in LOG(ODDS) relative to the reference} \\ e^{\beta_1} & = \text{ multiplicative change in ODDS relative to the reference } \text{ (ie. the odds ratio, comparing the odds for the given category vs the reference category)} \\ \end{split}

Exercise 13: R code

A lot of R code was used to produce the plots and other output above. The code is not the point. It’s more important to understand the logistic regression concepts, and the output of the code – you can always look up code. With that in mind, review the code to get a sense for the structure!

Exercise 14: Optional math challenge

Consider a simple logistic regression model:

\log(\text{odds}) = \beta_0 + \beta_1 X

Prove that e^{\beta_1} indicates the MULTIPLICATIVE change in ODDS for a 1-unit increase in X, i.e. that it’s equivalent to the odds ratio for “X+1” vs “X”.





Wrap Up

  • PS 5 is posted and due after spring break
    • You can complete all exercises now.
    • Just note that if you have more than 1 X variable, you’ll need to “keep all other variables fixed” in your interpretations.





Solutions

Warm Up

Example 1: Normal or logistic?

Solution
  1. Normal. Y is quantitative.
  2. Normal. Y is quantitative.
  3. Logistic. Y is binary.

Example 2: probability vs odds vs log(odds)

Solution
# a.
# odds
0.9 / (1 - 0.9)
# log(odds)
log(9)

# b. 
# odds
exp(-1.0986)
# prob
0.333 / (0.333 + 1)

Example 3: Conditional odds

Solution
library(tidyverse)
library(ggmosaic)

titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv")
head(titanic)
# A tibble: 6 × 5
  Name                                          PClass   Age Sex    Survived
  <chr>                                         <chr>  <dbl> <chr>     <dbl>
1 Allen, Miss Elisabeth Walton                  1st    29    female        1
2 Allison, Miss Helen Loraine                   1st     2    female        0
3 Allison, Mr Hudson Joshua Creighton           1st    30    male          0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st    25    female        0
5 Allison, Master Hudson Trevor                 1st     0.92 male          1
6 Anderson, Mr Harry                            1st    47    male          1
titanic %>% 
  count(PClass, Survived)
# A tibble: 7 × 3
  PClass Survived     n
  <chr>     <dbl> <int>
1 1st           0   129
2 1st           1   193
3 2nd           0   160
4 2nd           1   119
5 3rd           0   573
6 3rd           1   138
7 <NA>          0     1
library(janitor)
titanic %>% 
  tabyl(PClass, Survived, show_na = FALSE) %>%
  adorn_totals(c("row", "col"))
 PClass   0   1 Total
    1st 129 193   322
    2nd 160 119   279
    3rd 573 138   711
  Total 862 450  1312
  1. .

    # probability of survival
    450 / 1312
    # odds of survival
    (450 / 1312) / (1 - (450 / 1312))
  2. Survival was most likely for 1st class passengers and least likely for 3rd class passengers. NOTE: The first plot makes more clear that there were more 3rd class passengers, but the second plot makes it easier to compare survival rates among each class.

titanic %>% 
  ggplot(aes(x = PClass, fill = factor(Survived))) +
  geom_bar()

titanic %>% 
  ggplot(aes(x = PClass, fill = factor(Survived))) +
  geom_bar(position = "fill")

  1. .

    # probability of survival for 1st class
    193 / 322
    [1] 0.5993789
    # odds of survival for 1st class
    (193 / 322) / (1 - (193 / 322))
    [1] 1.496124
  2. .

    # probability of survival for 3rd class
    138 / 711
    [1] 0.1940928
    # odds of survival for 3rd class
    (138 / 711) / (1 - (138 / 711))
    [1] 0.2408377

Example 4: Odds ratios

Solution
  1. The odds of survival were more than 6 times higher for 1st class vs 3rd class:

    # odds of survival for 1st class / odds of survival for 3rd class
    1.50 / 0.24
  2. The odds of survival for 3rd class passengers were only 16% as high as those for 1st class passengers (i.e. they were 84% lower):

    # odds of survival for 3rd class / odds of survival for 1st class
    0.24 / 1.50

Exercises

Exercise 1: Getting to know Y

Solution
# Load packages and data
library(tidyverse)
climbers <- read_csv("https://mac-stat.github.io/data/climbers_sub.csv") %>% 
  select(peak_name, success, age, oxygen_used, height_metres)

# Check out the first 6 rows
head(climbers)
# A tibble: 6 × 5
  peak_name  success   age oxygen_used height_metres
  <chr>      <lgl>   <dbl> <lgl>               <dbl>
1 Ama Dablam TRUE       28 FALSE                6814
2 Ama Dablam TRUE       27 FALSE                6814
3 Ama Dablam TRUE       35 FALSE                6814
4 Ama Dablam TRUE       37 FALSE                6814
5 Ama Dablam TRUE       43 FALSE                6814
6 Ama Dablam FALSE      38 FALSE                6814

Roughly 40% of climbers succeed.

climbers %>% 
  count(success)
# A tibble: 2 × 2
  success     n
  <lgl>   <int>
1 FALSE    1269
2 TRUE      807
climbers %>% 
  ggplot(aes(x = success)) + 
  geom_bar()

Exercise 2: Visualizing success vs age

Solution

Success rates start at nearly 60% for teenagers, then decrease with age – success rates approach 0 for climbers in their 70s.

# Calculate success rate by age
climbers %>% 
  group_by(age) %>% 
  summarize(success_rate = mean(success))
# A tibble: 59 × 2
     age success_rate
   <dbl>        <dbl>
 1    17        0    
 2    18        0.429
 3    19        0.667
 4    20        0.857
 5    21        0.577
 6    22        0.486
 7    23        0.526
 8    24        0.333
 9    25        0.463
10    26        0.413
# ℹ 49 more rows
# Split climbers into larger, more stable age brackets
# (This is good when we don't have many observations at each x!)
# Calculate success rate by age bracket
climbers %>% 
  mutate(age_bracket = cut(age, breaks = 20)) %>% 
  group_by(age_bracket)
# A tibble: 2,076 × 6
# Groups:   age_bracket [20]
   peak_name  success   age oxygen_used height_metres age_bracket
   <chr>      <lgl>   <dbl> <lgl>               <dbl> <fct>      
 1 Ama Dablam TRUE       28 FALSE                6814 (25.9,28.8]
 2 Ama Dablam TRUE       27 FALSE                6814 (25.9,28.8]
 3 Ama Dablam TRUE       35 FALSE                6814 (34.7,37.6]
 4 Ama Dablam TRUE       37 FALSE                6814 (34.7,37.6]
 5 Ama Dablam TRUE       43 FALSE                6814 (40.6,43.6]
 6 Ama Dablam FALSE      38 FALSE                6814 (37.6,40.6]
 7 Ama Dablam TRUE       23 FALSE                6814 (22.9,25.9]
 8 Ama Dablam FALSE      36 FALSE                6814 (34.7,37.6]
 9 Ama Dablam TRUE       48 FALSE                6814 (46.5,49.5]
10 Ama Dablam TRUE       25 FALSE                6814 (22.9,25.9]
# ℹ 2,066 more rows
climbers %>% 
  group_by(age) %>% 
  summarize(success_rate = mean(success)) %>% 
  ggplot(aes(x = age, y = success_rate)) +
  geom_point()

climbers %>% 
  mutate(age_bracket = cut(age, breaks = 20)) %>% 
  group_by(age_bracket) %>% 
  summarize(success_rate = mean(success)) %>% 
  ggplot(aes(x = age_bracket, y = success_rate)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Exercise 3: Logistic regression in R

Solution

We’re using glm instead of lm, and add family = "binomial" to specify that this is a logistic model:

climbing_model_1 <- glm(success ~ age, data = climbers, family = "binomial")
coef(summary(climbing_model_1))
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept)  0.42568563 0.169177382  2.516209 1.186249e-02
age         -0.02397199 0.004489504 -5.339564 9.317048e-08

Exercise 4: Writing out & plotting the model

Exercise 5: Predictions

Solution
#a
log_odds <- 0.42568563 - 0.02397199 * 20
log_odds
[1] -0.05375417
#b
odds <- exp(log_odds)
odds
[1] 0.947665
#c
prob <- odds / (odds + 1)
prob
[1] 0.4865647
#d
predict(climbing_model_1, newdata = data.frame(age = 20))
          1 
-0.05375421 
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))
       1 
0.947665 
predict(climbing_model_1, newdata = data.frame(age = 20), type = "response")
        1 
0.4865647 

Exercise 6: Interpreting the intercept

Solution

Among 0 year old climbers (lol), the odds of success is 1.5, i.e. 1.5 times more likely than not.

exp(0.42569)

Exercise 7: Interpreting the age coefficient

Solution
exp(predict(climbing_model_1, newdata = data.frame(age = 23)))
        1 
0.8819057 
exp(predict(climbing_model_1, newdata = data.frame(age = 22)))
        1 
0.9033021 
exp(predict(climbing_model_1, newdata = data.frame(age = 21)))
        1 
0.9252177 
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))
       1 
0.947665 
Part a
Solution

The odds of success for a 23-year-old are 97.6% as high as those for a 22-year-old (i.e. the odds are 2.4% lower).

0.8819057 / 0.9033021
Part b
Solution

The odds of success for a 22-year-old are 97.6% as high as those for a 21-year-old (i.e. the odds are 2.4% lower).

0.9033021 / 0.9252177
Part c
Solution

The odds of success for a 21-year-old are 97.6% as high as those for a 20-year-old (i.e. the odds are 2.4% lower).

0.9252177 / 0.947665
Part d
Solution
  • On the log(odds) scale, the age coefficient is 0.02397:
    On average, for every 1-year increase in age, log(odds of success) decrease by 0.02397.

  • On the odds scale, the age coefficient is e^{-0.02397} = 0.976313:
    On average, for every 1-year increase in age, odds of success are 97.6% as high. That is, they decrease by 2.4%.

Exercise 8: Another climbing model

Solution
# Plot the relationship
climbers %>% 
  ggplot(aes(x = oxygen_used, fill = success)) +
  geom_bar(position = "fill")

# Model the relationship
climbing_model_2 <- glm(success ~ oxygen_used, data = climbers, family = "binomial")
coef(summary(climbing_model_2))
                 Estimate Std. Error   z value      Pr(>|z|)
(Intercept)     -1.327993  0.0639834 -20.75528  1.098514e-95
oxygen_usedTRUE  2.903864  0.1257404  23.09412 5.304291e-118
Part a
Solution

Success is more likely if you use oxygen. The probability of success appears to be between 85% and 90% for people that use oxygen, and under 25% for people that don’t.

Part b
Solution
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))
        1 
0.2650086 
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")
        1 
0.2094915 
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))
       1 
4.834951 
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")
       1 
0.828619 
4.834951 / 0.2650086
[1] 18.24451

Exercise 9: Interpreting coefficients

Part a
Solution

This is the odds of success for somebody that doesn’t use oxygen (the reference level).

Part b
Solution

This is the odds ratio of success, comparing climbers that use oxygen to those that don’t. The odds of success are 18 times higher if using oxygen.

Exercise 10: Connecting tables to logistic regression

Solution
# Rows correspond to oxygen_used
# Columns correspond to success
climbers %>% 
  tabyl(oxygen_used, success) %>%
  adorn_totals(c("row", "col"))
 oxygen_used FALSE TRUE Total
       FALSE  1166  309  1475
        TRUE   103  498   601
       Total  1269  807  2076
Part a
Solution

This is the intercept coefficient on the odds scale!

# probability of success (no oxygen)
309 / 1475
[1] 0.2094915
# odds of success (no oxygen)
(309 / 1475) / (1 - (309 / 1475))
[1] 0.2650086
Part b
Solution

This is the oxygen_usedTRUE coefficient on the odds scale!

# probability of success (oxygen)
498 / 601
[1] 0.828619
# odds of success (oxygen)
(498 / 601) / (1 - (498 / 601))
[1] 4.834951
# odds ratio 
4.834951 / 0.2650086
[1] 18.24451

Exercise 11: Calculating probabilities and odds

Solution

For people that don’t use oxygen:

# log(odds)
-1.327993

# odds
exp(-1.327993)

# probability
exp(-1.327993) / (exp(-1.327993) + 1)

For people that do use oxygen:

# log(odds)
-1.327993 + 2.903864

# odds
exp(-1.327993 + 2.903864)

# probability
exp(-1.327993 + 2.903864) / (exp(-1.327993 + 2.903864) + 1)