library(tidyverse)
titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv") %>%
mutate(Survived = as.factor(Survived)) %>%
filter(!is.na(PClass))
head(titanic)16 Simple logistic regression
Settling In
- Sit with your draft group (see Slack)
- Introduce yourself
- Chat about data sets that you were interested in.
- Help each other get ready to take notes!
- Open your notebook.
- Open the online manual to the “Course Schedule” and click on today’s activity. That brings you here!
- Download “16-logistic-univariate.qmd” and open it in RStudio. Read the “Organizing your files” directions at the top of the file!!
- Open your notebook.
Recap
Thus far we’ve used linear regression models (simple and multiple) to explore the relationship of a quantitative response variable Y with some predictor X (quantitative or categorical):
E[Y | X_1, X_2, ..., X_p] = \beta_0 + \beta_1 X_1 + \beta_2 X_3 + ... + \beta_p X_p
Another statistical superpower?! We can extend these ideas to the common scenario in which Y is categorical. In STAT 155 we’ll specifically focus on using logistic regression models for binary categorical response variables Y (with 2 categories). This dramatically expands the types of relationships we’re able to explore.
By the end of this lesson, you should be able to:
- Distinguish between probabilities and odds, and convert one to the other
- Explain the differences between linear regression and logistic regression for modeling binary outcomes
- Explore simple logistic regression models
- visualize relationships
- build models in R
- interpret coefficients
- make predictions
- converting from log odds to odds to probability scales

Logistic Regression
If our outcome is categorical with 2 outcomes (A and B), we create an indicator version of the outcome, Y.
Y = \begin{cases}1 \quad \text{if outcome A happens}\\0 \quad \text{if outcome A doesn't happens}\end{cases}
The “average” value of the indicator Y is the probability of the event, Y=1,
E[Y] = P(Y = 1) = P(\text{the event}) = p
We model the transformed version of E[Y | X], the expected value of Y “given” explanatory variable X using a linear model,
\log \left(\frac{E[Y | X]}{1-E[Y | X ]}\right) = \beta_0 + \beta_1X
which can be written as,
\log \left(Odds[Y = 1 | X]\right) = \beta_0 + \beta_1X
Model Formula Rearrangements
If we exponentiate both sides, e^{LHS} = e^{RHS}, we get
Odds[Y = 1 | X] = \left(\frac{E[Y | X]}{1-E[Y | X ]}\right) = e^{\beta_0 + \beta_1X}
If we use the fact that Odds = \frac{p}{1-p} can be written as p = \frac{Odds}{1 + Odds}, we get
P(Y = 1 | X) = E[Y | X] = \frac{e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}}
Interpretations / Comparisons
To do interpretations, we compare two groups that have 1 unit different in the X (eventually, when we have more X’s, we’ll keep all other variables in the model constant).
Here is the difference in the log(Odds) between those two groups,
\log \left(Odds[Y = 1 | X = x+1 ]\right) - \log \left(Odds[Y = 1 | X = x ]\right) = (\beta_0 + \beta_1(x+1)) - (\beta_0 + \beta_1x)
We can simplify the right hand side (RHS),
\log \left(Odds[Y = 1 | X = x+1 ]\right) - \log \left(Odds[Y = 1 | X = x ]\right) = \beta_1
Now, simplify the left hand side (LHS) using logarithm rules,
\log \left(\frac{Odds[Y = 1 | X = x+1 ] }{ Odds[Y = 1 | X = x ]}\right) = \beta_1
Exponentiate both sides,
\left(\frac{Odds[Y = 1 | X = x+1 ] }{ Odds[Y = 1 | X = x ]} \right)= e^{\beta_1}
We find that e^{\beta_1} gives us the Odds Ratio.
Odds ratios
Let “odds for group A” be the conditional odds of an event for some group A.
Let “odds for group B” be the conditional odds of an event for some group B.
Then the odds ratio (OR) of the event, comparing group A to group B is:
OR = (odds for group A) / (odds for group B)
Then OR >= 0 where:
If OR > 1, then the odds of the event are higher for group A than group B. Examples:
- OR = 2: The odds of the event are 2 times higher (100% higher) for group A than group B.
- OR = 1.5: The odds of the event are one and a half times higher (50% higher) for group A than group B.
If OR = 1, then the odds of the event are the same for groups A and B.
If OR < 1, then the odds of the event are lower for group A than group B. Examples:
- OR = 0.5: The odds of the event are half as high (50% as high) for group A than group B.
- OR = 0.2:
- The odds of the event for group A are 0.2 times (20% as high as) the odds for group B.
- The odds of the event are 80% lower for group A than group B.
English Language
Increase
- “increase by 4 percentage points” implies + 4 (if the outcome unit was percent)
- “increase by 4%” implies *1.04
- “increase by a factor of 1.04” implies *1.04
- “1.04 time higher” implies *1.04
Decrease
- “decrease by 96 percentage points” implies - 96 (if the outcome unit was percent)
- “decrease by 5%” implies *0.95
- “decrease by a factor of 0.95” implies *0.95
- “0.95 times that of” implies *0.95
Warm-up
We’ve been focusing on “Normal” linear regression models for quantitative response variables Y.
But not all response variables are quantitative! We’ll now explore logistic regression models for binary categorical response variables Y.
This dramatically expands the types of relationships we’re able to explore. Though the details of our models will change, the spirit and considerations are the same:
- we want to learn about Y from predictors X
- we want to build, interpret, and evaluate our models
- we need to be mindful of multicollinearity, overfitting, etc in this process
Example 1: Normal or logistic?
For each scenario below, indicate whether “Normal” or logistic regression would be the appropriate modeling tool.
We want to model a person’s commute time (Y) by their distance to work (X).
We want to model a person’s commute time (Y) by whether or not they take public transit (X).
We want to model whether or not a person gets a speeding ticket (Y) based on their driving speed (X).
Example 2: probability vs odds vs log(odds)
- There’s a 0.9 probability that your team wins its next game. Calculate the corresponding odds and log(odds) of this event. NOTE:
log()is the natural log function in RStudio.
- The log(odds) that your bus is late are -1.0986. Calculate the corresponding odds and probability of this event. NOTE: Use
exp().
Example 3: Conditional odds
The Titanic was a British passenger ship that famously sank in 1912 after hitting an iceberg in the North Atlantic Ocean.
Approximately 2200 passengers were on board, and it’s estimated that 1500 did not survive the crash.
The titanic dataset has info on a sample of these passengers:
Let’s use it to explore the relationship of:
- Y =
Survived(1 = survived, 0 = did not survive) - X =
PClass(ticket class = 1st, 2nd, 3rd)
To this end, check out some tables of counts:
titanic %>%
count(PClass, Survived)
library(janitor)
titanic %>%
tabyl(PClass, Survived) %>%
adorn_totals(c("row", "col"))And a plot of Y alone, and 2 plots of Y with X.
titanic %>%
ggplot(aes(x = Survived)) +
geom_bar() +
theme_bw()
titanic %>%
ggplot(aes(x = PClass, fill = Survived)) +
geom_bar() +
theme_bw() +
theme(legend.position = "bottom")
titanic %>%
ggplot(aes(x = PClass, fill = Survived)) +
geom_bar(position = "fill") +
theme_bw() +
theme(legend.position = "bottom")Calculate the overall odds that a passenger survived. (Which plot provides insight here?)
Y and X are both categorical. To visualize their relationship, we can make bar plots of X, split up by category of Y. Use plots 2 & 3 to describe the relationship of survival with ticket class. (What’s the difference between these plots, and which plot makes it easier to answer this question?!)
How does Y depend upon X? Calculate the conditional odds that a 1st class passenger survived, i.e. the odds of survival conditioned on the passenger being 1st class: Odds(survive | 1st class)
Calculate the conditional odds that a 3rd class passenger survived: Odds(survive | 3rd class)
Example 4: Odds ratios
Models help us compare the behavior of Y for different possible X values.
In logistic regression, these comparisons are often measured by odds ratios, a ratio of odds (unsurprisingly) that compares the odds of one group to another.
Calculate & interpret the odds ratio of surviving the Titanic, comparing those in the 1st class to those in the 3rd class.
Calculate & interpret the odds ratio of surviving the Titanic, comparing those in the 3rd class to those in the 1st class.
Exercises
The story
The climbers data is sub-sample of the Himalayan Database distributed through the R for Data Science TidyTuesday project.
This dataset includes information on the results and conditions for various Himalayan climbing expeditions.
Each row corresponds to a single member of a climbing expedition team:
# Load packages and data
library(tidyverse)
climbers <- read_csv("https://mac-stat.github.io/data/climbers_sub.csv") %>%
select(peak_name, success, age, oxygen_used, height_metres)
# Check out the first 6 rows
head(climbers)# A tibble: 6 × 5
peak_name success age oxygen_used height_metres
<chr> <lgl> <dbl> <lgl> <dbl>
1 Ama Dablam TRUE 28 FALSE 6814
2 Ama Dablam TRUE 27 FALSE 6814
3 Ama Dablam TRUE 35 FALSE 6814
4 Ama Dablam TRUE 37 FALSE 6814
5 Ama Dablam TRUE 43 FALSE 6814
6 Ama Dablam FALSE 38 FALSE 6814
Our goal will be to model whether or not a climber has success (Y) by their age (in years) or oxygen_used (TRUE or FALSE).
Since success is a binary categorical variable, i.e. a climber is either successful or they’re not, we’ll utilize logistic regression.
Exercise 1: Getting to know Y
Our response variable Y, success, is binary (hence categorical). Thus we can explore Y alone using a bar plot and table of counts. Note what you learn here.
climbers %>%
count(success)
climbers %>%
ggplot(aes(x = success)) +
geom_bar()Exercise 2: Visualizing success vs age
Let’s now explore the relationship of success with age. Since success (Y) is categorical and age (X) is quantitative, it can be tempting to draw something like a side-by-side boxplot:
climbers %>%
ggplot(aes(y = age, x = success)) +
geom_boxplot()BUT this only gives us insight into the dependence of age on success, not success on age (it reverses the roles of Y and X).
Specifically, it helps us answer “how does age compare among success groups” not “how does success compare among age groups”.
Instead, we want to calculate the observed success rate (probability of success) at each age, or in each age bracket.
We can calculate these success rates by age, i.e. the proportion of people of each age that succeeded, using group_by() and summarize():
# Calculate success rate by age
climbers %>%
group_by(age) %>%
summarize(success_rate = mean(success)) %>%
head(3)
# Split climbers into larger, more stable age brackets
# (This is good when we don't have many observations at each x!)
# Calculate success rate by age bracket
climbers %>%
mutate(age_bracket = cut(age, breaks = 20)) %>%
group_by(age_bracket) %>%
summarize(success_rate = mean(success)) %>%
head(3)Now, plot the success rates by age and age bracket below. Summarize what you learn about the relationship (note that we weren’t able to observe this information from the boxplots!).
climbers %>%
group_by(age) %>%
summarize(success_rate = mean(success)) %>%
ggplot(aes(x = age, y = success_rate)) +
geom_point()
climbers %>%
mutate(age_bracket = cut(age, breaks = 20)) %>%
group_by(age_bracket) %>%
summarize(success_rate = mean(success)) %>%
ggplot(aes(x = age_bracket, y = success_rate)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))Exercise 3: Logistic regression in R
To model the relationship between success and age, we can construct the logistic regression model using the code below.
Point out the 2 new things about this code.
climbing_model_1 <- glm(success ~ age, data = climbers, family = "binomial")
coef(summary(climbing_model_1)) Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.42568563 0.169177382 2.516209 1.186249e-02
age -0.02397199 0.004489504 -5.339564 9.317048e-08
Exercise 4: Writing out & plotting the model
The estimated model formula is stated below on the log(odds), odds, and probability scales.
log(odds of success) = 0.42569 - 0.02397age
odds of success = e^{0.42569 - 0.02397\text{age}}
probability of success = e^{0.42569 - 0.02397\text{age}} / (e^{0.42569 - 0.02397\text{age}} + 1)
Convince yourselves that you know where the above formulas come from!
Our one model, expressed on different scales, is plotted below. As with the plot of our data on success rates in Exercise 3, these indicate that success rate decreases with age. (This is true on every scale, of course.) Compare the shapes of the models on these different scales.
# Incorporate predictions on the probability, odds, & log-odds scales
climbers_predictions <- climbers %>%
mutate(probability = climbing_model_1$fitted.values) %>%
mutate(odds = probability / (1-probability)) %>%
mutate(log_odds = log(odds))
# Plot the model on the log-odds scale
ggplot(climbers_predictions, aes(x = age, y = log_odds)) +
geom_smooth(se = FALSE)
# Plot the model on the odds scale
ggplot(climbers_predictions, aes(x = age, y = odds)) +
geom_smooth(se = FALSE)
# Plot the model on the probability scale
ggplot(climbers_predictions, aes(x = age, y = probability)) +
geom_smooth(se = FALSE)
#...and zoom out to see broader (and impossible) age range
ggplot(climbers_predictions, aes(x = age, y = as.numeric(success))) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, fullrange = TRUE) +
labs(y = "probability") +
lims(x = c(-200,200))Exercise 5: Predictions
Let’s use the model formulas from the above exercise to make predictions. Consider a 20-year-old climber.
Calculate the log(odds) that this climber will succeed. HINT: use the log(odds) model.
Calculate the odds that this climber will succeed. HINT: Either transform the log(odds) prediction or use the odds model.
Calculate the probability that this climber will succeed. HINT: Either transform the odds prediction or use the probability model.
Check your calculations 2 ways:
- Are they consistent with the model visualizations above?
- Do they match the results of the
predict()function?
# Check the log-odds prediction
predict(climbing_model_1, newdata = data.frame(age = 20))
# Check the odds prediction
exp(predict(climbing_model_1, newdata = data.frame(age = 20)))
# Check the probability prediction
predict(climbing_model_1, newdata = data.frame(age = 20), type = "response")PAUSE: Interpreting the coefficients
Up next, we’ll interpret the coefficients of our logistic regression model:
log(odds of success) = 0.42569 - 0.02397age
Since these are on the log(odds) scale, which isn’t easy to understand, we can convert them to the odds scale by exponentiating. You’ll need to reference these below:
coef(climbing_model_1)
exp(coef(climbing_model_1))Exercise 6: Interpreting the intercept
Let’s interpret the intercept of our logistic regression model:
log(odds of success) = 0.42569 - 0.02397age
Translate the below interpretation on the log(odds) scale to the odds scale:
- The log(odds of success) for 0-year-old climbers (lol) is 0.42569.
- The odds of success for 0-year-old climbers (lol) are ___.
Exercise 7: Interpreting the age coefficient
Next, let’s interpret the age coefficient:
log(odds of success) = 0.42569 - 0.02397age
On the log(odds) scale (a line), this is just a slope: A 1-year increase in age is associated with a decrease of 0.02397 in the log(odds of success), on average. BUT when we convert the model to the odds scale, the model is NOT linear, thus the exponentiated age coefficient is not a slope. So what does it mean?
- Below are the odds of success among various ages:
exp(predict(climbing_model_1, newdata = data.frame(age = 23))) 1
0.8819057
exp(predict(climbing_model_1, newdata = data.frame(age = 22))) 1
0.9033021
exp(predict(climbing_model_1, newdata = data.frame(age = 21))) 1
0.9252177
exp(predict(climbing_model_1, newdata = data.frame(age = 20))) 1
0.947665
Calculate and interpret the odds ratio of success, comparing 23-year-olds to 22-year-olds. (Do not use rounded numbers in this calculation or those below!)
Calculate and interpret the odds ratio of success, comparing 22-year-olds to 21-year-olds.
Calculate and interpret the odds ratio of success, comparing 21-year-olds to 20-year-olds.
Summarize the punchline here.
On the log(odds) scale, the
agecoefficient is 0.02397:
On average, for every 1-year increase in age, log(odds of success) decrease by 0.02397.On the odds scale, the
agecoefficient is e^{-0.02397} = 0.976313:
On average, for every 1-year increase in age, odds of success are ???% as high.That is, they decrease by ???%
Exercise 8: Another climbing model
Next, let’s explore the relationship of success (Y) with oxygen_used (X). Since both variables are categorical, we can fill in a bar plot of X with the categories of Y:
# Plot the relationship
climbers %>%
ggplot(aes(x = oxygen_used, fill = success)) +
geom_bar(position = "fill")
# Model the relationship
climbing_model_2 <- glm(success ~ oxygen_used, data = climbers, family = "binomial")
coef(summary(climbing_model_2))Thus the estimated model formula is:
log(odds of success) = -1.327993 + 2.903864 oxygen_usedTRUE
Describe what you learn about the relationship from the plot alone.
Below are the odds and probabilities of success for climbers that do and don’t use oxygen. Calculate and interpret the odds ratio of success, comparing climbers that do use oxygen to those that don’t. NOTE: Your calculation should be above 1!!
# odds of success for a climber that doesn't use oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))
# probability of success for a climber that doesn't use oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")
# odds of success for a climber that uses oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))
# probability of success for a climber that uses oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")Exercise 9: Interpreting coefficients
Let’s interpret the coefficients of our logistic regression model:
log(odds of success) = -1.327993 + 2.903864 oxygen_usedTRUE
On the log(odds) scale:
- The log(odds of success) for climbers that don’t use oxygen is -1.327993.
- The log(odds of success) is 2.903864 higher for climbers that use oxygen than for those that don’t.
Yuk. Convert those coefficients to the odds scale:
coef(climbing_model_2)
exp(coef(climbing_model_2))How can we interpret the intercept on the odds scale, 0.265? Where have you observed this number before?!
How can we interpret the
oxygen_usedTRUEcoefficient on the odds scale, 18.24? Where have you observed this number before?!
Exercise 10: Connecting tables to logistic regression
For the simple logistic regression model with 1 predictor, which is categorical, the model coefficients can be calculated from the table of counts!
Here rows represent whether oxygen was used and columns represent whether the climber was successful:
# Rows correspond to oxygen_used
# Columns correspond to success
climbers %>%
tabyl(oxygen_used, success) %>%
adorn_totals(c("row", "col"))Use this table to calculate the odds of success for somebody that didn’t use oxygen. Where have you observed this number before?! HINT: Use the FALSE row.
Use this table to calculate the odds ratio of success, comparing climbers that do use oxygen to those that don’t. Where have you observed this number before?! HINT: You’ll first need to calculate the odds of success for people that use oxygen.
Reflection
What binary outcomes might be relevant in your project? What predictor(s) could be relevant in a logistic regression model for that outcome?
Extra practice
Exercise 11: Calculating probabilities and odds
In Exercise 8, we used the predict() function to calculate the odds and probabilities of success for climbers that do and don’t use oxygen:
# odds of success for a climber that doesn't use oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE)))
# probability of success for a climber that doesn't use oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response")
# odds of success for a climber that uses oxygen
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE)))
# probability of success for a climber that uses oxygen
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response")Calculate these 4 values by hand, starting from the model formula:
log(odds of success) = -1.327993 + 2.903864 oxygen_usedTRUE
Exercise 12: Interpreting logistic regression coefficients
In general (not just for climbers), let’s summarize how to interpret logistic regression coefficients. Review the details below and convince yourself that these general properties are consistent with your work above. Let Y be a binary categorical response variable (eg: Y = 1 if the event of interest happens and is 0 otherwise). Then a logistic regression model of Y by Y is
\log(\text{odds}) = \beta_0 + \beta_1 X
- Coefficient interpretation for quantitative X
\begin{split} \beta_0 & = \text{ LOG(ODDS) when } X=0 \\ e^{\beta_0} & = \text{ ODDS when } X=0 \\ \beta_1 & = \text{ change in LOG(ODDS) per 1 unit increase in } X \\ e^{\beta_1} & = \text{ multiplicative change in ODDS per 1 unit increase in } X \text{ (ie. the odds ratio, comparing the odds for X + 1 vs X)} \\ \end{split} - Coefficient interpretation for categorical X
\begin{split} \beta_0 & = \text{ LOG(ODDS) at the reference category } \\ e^{\beta_0} & = \text{ ODDS at the reference category } \\ \beta_1 & = \text{ unit change in LOG(ODDS) relative to the reference} \\ e^{\beta_1} & = \text{ multiplicative change in ODDS relative to the reference } \text{ (ie. the odds ratio, comparing the odds for the given category vs the reference category)} \\ \end{split}
Exercise 13: R code
A lot of R code was used to produce the plots and other output above. The code is not the point. It’s more important to understand the logistic regression concepts, and the output of the code – you can always look up code. With that in mind, review the code to get a sense for the structure!
Exercise 14: Optional math challenge
Consider a simple logistic regression model:
\log(\text{odds}) = \beta_0 + \beta_1 X
Prove that e^{\beta_1} indicates the MULTIPLICATIVE change in ODDS for a 1-unit increase in X, i.e. that it’s equivalent to the odds ratio for “X+1” vs “X”.
Wrap Up
- PS 5 is posted and due after spring break
- You can complete all exercises now.
- Just note that if you have more than 1 X variable, you’ll need to “keep all other variables fixed” in your interpretations.
Solutions
Warm Up
Example 1: Normal or logistic?
Solution
- Normal. Y is quantitative.
- Normal. Y is quantitative.
- Logistic. Y is binary.
Example 2: probability vs odds vs log(odds)
Solution
# a.
# odds
0.9 / (1 - 0.9)
# log(odds)
log(9)
# b.
# odds
exp(-1.0986)
# prob
0.333 / (0.333 + 1)Example 3: Conditional odds
Solution
library(tidyverse)
library(ggmosaic)
titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv")
head(titanic)# A tibble: 6 × 5
Name PClass Age Sex Survived
<chr> <chr> <dbl> <chr> <dbl>
1 Allen, Miss Elisabeth Walton 1st 29 female 1
2 Allison, Miss Helen Loraine 1st 2 female 0
3 Allison, Mr Hudson Joshua Creighton 1st 30 male 0
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25 female 0
5 Allison, Master Hudson Trevor 1st 0.92 male 1
6 Anderson, Mr Harry 1st 47 male 1
titanic %>%
count(PClass, Survived)# A tibble: 7 × 3
PClass Survived n
<chr> <dbl> <int>
1 1st 0 129
2 1st 1 193
3 2nd 0 160
4 2nd 1 119
5 3rd 0 573
6 3rd 1 138
7 <NA> 0 1
library(janitor)
titanic %>%
tabyl(PClass, Survived, show_na = FALSE) %>%
adorn_totals(c("row", "col")) PClass 0 1 Total
1st 129 193 322
2nd 160 119 279
3rd 573 138 711
Total 862 450 1312
.
# probability of survival 450 / 1312 # odds of survival (450 / 1312) / (1 - (450 / 1312))Survival was most likely for 1st class passengers and least likely for 3rd class passengers. NOTE: The first plot makes more clear that there were more 3rd class passengers, but the second plot makes it easier to compare survival rates among each class.
titanic %>%
ggplot(aes(x = PClass, fill = factor(Survived))) +
geom_bar()
titanic %>%
ggplot(aes(x = PClass, fill = factor(Survived))) +
geom_bar(position = "fill")
.
# probability of survival for 1st class 193 / 322[1] 0.5993789# odds of survival for 1st class (193 / 322) / (1 - (193 / 322))[1] 1.496124.
# probability of survival for 3rd class 138 / 711[1] 0.1940928# odds of survival for 3rd class (138 / 711) / (1 - (138 / 711))[1] 0.2408377
Example 4: Odds ratios
Solution
The odds of survival were more than 6 times higher for 1st class vs 3rd class:
# odds of survival for 1st class / odds of survival for 3rd class 1.50 / 0.24The odds of survival for 3rd class passengers were only 16% as high as those for 1st class passengers (i.e. they were 84% lower):
# odds of survival for 3rd class / odds of survival for 1st class 0.24 / 1.50
Exercises
Exercise 1: Getting to know Y
Solution
# Load packages and data
library(tidyverse)
climbers <- read_csv("https://mac-stat.github.io/data/climbers_sub.csv") %>%
select(peak_name, success, age, oxygen_used, height_metres)
# Check out the first 6 rows
head(climbers)# A tibble: 6 × 5
peak_name success age oxygen_used height_metres
<chr> <lgl> <dbl> <lgl> <dbl>
1 Ama Dablam TRUE 28 FALSE 6814
2 Ama Dablam TRUE 27 FALSE 6814
3 Ama Dablam TRUE 35 FALSE 6814
4 Ama Dablam TRUE 37 FALSE 6814
5 Ama Dablam TRUE 43 FALSE 6814
6 Ama Dablam FALSE 38 FALSE 6814
Roughly 40% of climbers succeed.
climbers %>%
count(success)# A tibble: 2 × 2
success n
<lgl> <int>
1 FALSE 1269
2 TRUE 807
climbers %>%
ggplot(aes(x = success)) +
geom_bar()
Exercise 2: Visualizing success vs age
Solution
Success rates start at nearly 60% for teenagers, then decrease with age – success rates approach 0 for climbers in their 70s.
# Calculate success rate by age
climbers %>%
group_by(age) %>%
summarize(success_rate = mean(success))# A tibble: 59 × 2
age success_rate
<dbl> <dbl>
1 17 0
2 18 0.429
3 19 0.667
4 20 0.857
5 21 0.577
6 22 0.486
7 23 0.526
8 24 0.333
9 25 0.463
10 26 0.413
# ℹ 49 more rows
# Split climbers into larger, more stable age brackets
# (This is good when we don't have many observations at each x!)
# Calculate success rate by age bracket
climbers %>%
mutate(age_bracket = cut(age, breaks = 20)) %>%
group_by(age_bracket)# A tibble: 2,076 × 6
# Groups: age_bracket [20]
peak_name success age oxygen_used height_metres age_bracket
<chr> <lgl> <dbl> <lgl> <dbl> <fct>
1 Ama Dablam TRUE 28 FALSE 6814 (25.9,28.8]
2 Ama Dablam TRUE 27 FALSE 6814 (25.9,28.8]
3 Ama Dablam TRUE 35 FALSE 6814 (34.7,37.6]
4 Ama Dablam TRUE 37 FALSE 6814 (34.7,37.6]
5 Ama Dablam TRUE 43 FALSE 6814 (40.6,43.6]
6 Ama Dablam FALSE 38 FALSE 6814 (37.6,40.6]
7 Ama Dablam TRUE 23 FALSE 6814 (22.9,25.9]
8 Ama Dablam FALSE 36 FALSE 6814 (34.7,37.6]
9 Ama Dablam TRUE 48 FALSE 6814 (46.5,49.5]
10 Ama Dablam TRUE 25 FALSE 6814 (22.9,25.9]
# ℹ 2,066 more rows
climbers %>%
group_by(age) %>%
summarize(success_rate = mean(success)) %>%
ggplot(aes(x = age, y = success_rate)) +
geom_point()
climbers %>%
mutate(age_bracket = cut(age, breaks = 20)) %>%
group_by(age_bracket) %>%
summarize(success_rate = mean(success)) %>%
ggplot(aes(x = age_bracket, y = success_rate)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Exercise 3: Logistic regression in R
Solution
We’re using glm instead of lm, and add family = "binomial" to specify that this is a logistic model:
climbing_model_1 <- glm(success ~ age, data = climbers, family = "binomial")
coef(summary(climbing_model_1)) Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.42568563 0.169177382 2.516209 1.186249e-02
age -0.02397199 0.004489504 -5.339564 9.317048e-08
Exercise 4: Writing out & plotting the model
Exercise 5: Predictions
Solution
#a
log_odds <- 0.42568563 - 0.02397199 * 20
log_odds[1] -0.05375417
#b
odds <- exp(log_odds)
odds[1] 0.947665
#c
prob <- odds / (odds + 1)
prob[1] 0.4865647
#d
predict(climbing_model_1, newdata = data.frame(age = 20)) 1
-0.05375421
exp(predict(climbing_model_1, newdata = data.frame(age = 20))) 1
0.947665
predict(climbing_model_1, newdata = data.frame(age = 20), type = "response") 1
0.4865647
Exercise 6: Interpreting the intercept
Solution
Among 0 year old climbers (lol), the odds of success is 1.5, i.e. 1.5 times more likely than not.
exp(0.42569)Exercise 7: Interpreting the age coefficient
Solution
exp(predict(climbing_model_1, newdata = data.frame(age = 23))) 1
0.8819057
exp(predict(climbing_model_1, newdata = data.frame(age = 22))) 1
0.9033021
exp(predict(climbing_model_1, newdata = data.frame(age = 21))) 1
0.9252177
exp(predict(climbing_model_1, newdata = data.frame(age = 20))) 1
0.947665
Part a
Solution
The odds of success for a 23-year-old are 97.6% as high as those for a 22-year-old (i.e. the odds are 2.4% lower).
0.8819057 / 0.9033021Part b
Solution
The odds of success for a 22-year-old are 97.6% as high as those for a 21-year-old (i.e. the odds are 2.4% lower).
0.9033021 / 0.9252177Part c
Solution
The odds of success for a 21-year-old are 97.6% as high as those for a 20-year-old (i.e. the odds are 2.4% lower).
0.9252177 / 0.947665Part d
Solution
On the log(odds) scale, the
agecoefficient is 0.02397:
On average, for every 1-year increase in age, log(odds of success) decrease by 0.02397.On the odds scale, the
agecoefficient is e^{-0.02397} = 0.976313:
On average, for every 1-year increase in age, odds of success are 97.6% as high. That is, they decrease by 2.4%.
Exercise 8: Another climbing model
Solution
# Plot the relationship
climbers %>%
ggplot(aes(x = oxygen_used, fill = success)) +
geom_bar(position = "fill")
# Model the relationship
climbing_model_2 <- glm(success ~ oxygen_used, data = climbers, family = "binomial")
coef(summary(climbing_model_2)) Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.327993 0.0639834 -20.75528 1.098514e-95
oxygen_usedTRUE 2.903864 0.1257404 23.09412 5.304291e-118
Part a
Solution
Success is more likely if you use oxygen. The probability of success appears to be between 85% and 90% for people that use oxygen, and under 25% for people that don’t.
Part b
Solution
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE))) 1
0.2650086
predict(climbing_model_2, newdata = data.frame(oxygen_used = FALSE), type = "response") 1
0.2094915
exp(predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE))) 1
4.834951
predict(climbing_model_2, newdata = data.frame(oxygen_used = TRUE), type = "response") 1
0.828619
4.834951 / 0.2650086[1] 18.24451
Exercise 9: Interpreting coefficients
Part a
Solution
This is the odds of success for somebody that doesn’t use oxygen (the reference level).
Part b
Solution
This is the odds ratio of success, comparing climbers that use oxygen to those that don’t. The odds of success are 18 times higher if using oxygen.
Exercise 10: Connecting tables to logistic regression
Solution
# Rows correspond to oxygen_used
# Columns correspond to success
climbers %>%
tabyl(oxygen_used, success) %>%
adorn_totals(c("row", "col")) oxygen_used FALSE TRUE Total
FALSE 1166 309 1475
TRUE 103 498 601
Total 1269 807 2076
Part a
Solution
This is the intercept coefficient on the odds scale!
# probability of success (no oxygen)
309 / 1475[1] 0.2094915
# odds of success (no oxygen)
(309 / 1475) / (1 - (309 / 1475))[1] 0.2650086
Part b
Solution
This is the oxygen_usedTRUE coefficient on the odds scale!
# probability of success (oxygen)
498 / 601[1] 0.828619
# odds of success (oxygen)
(498 / 601) / (1 - (498 / 601))[1] 4.834951
# odds ratio
4.834951 / 0.2650086[1] 18.24451
Exercise 11: Calculating probabilities and odds
Solution
For people that don’t use oxygen:
# log(odds)
-1.327993
# odds
exp(-1.327993)
# probability
exp(-1.327993) / (exp(-1.327993) + 1)For people that do use oxygen:
# log(odds)
-1.327993 + 2.903864
# odds
exp(-1.327993 + 2.903864)
# probability
exp(-1.327993 + 2.903864) / (exp(-1.327993 + 2.903864) + 1)