Topic 4 Cross-validation
Learning Goals
- Inform and justify data analysis and modeling process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs
- Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
- Explain what role CV has in a predictive modeling analysis and its connection to overfitting
- Explain the pros/cons of higher vs. lower v in v-fold CV in terms of sample size and computing time
- Implement cross-validation in R using the
tidymodels
package
Slides from today are available here.
Writing Good Sentences
From Chapter 9 of Communicating with Data:
When revising sentences, here are four possible actions you can take:
Trim. Eliminate empty phrases, trim fat phrases, reduce modifiers, and drop redundant adjectives
Straighten. Convert a convoluted sentence into a straightforward one, reorder phrases, and break the sentence into multiple sentences
Emphasis. Order concepts by importances, balance general with specific, and define statistical terms
Word choice. Replace weak nous with concrete ones, passive verbs with active voice, and match the connotation of words with the context of the sentence
Instructions
- One person from the group makes a copy of this document and then shares that copy with everyone in the group and the instructor.
- Change “editing” (look for the pencil at the top right of document) to “suggesting”
- Each person focuses on one sentence (2,3,8 or 9) in the document to revise.
- Spend 3 minutes working to revise the sentence.
- Rotate to the next sentence and continue revising with the goal of improving the sentences as a group.
Exercises
You can download a template RMarkdown file to start from here.
Context
We’ll be working with a dataset containing physical measurements on 80 adult males. These measurements include body fat percentage estimates as well as body circumference measurements.
fatBrozek
: Percent body fat using Brozek’s equation: 457/Density - 414.2fatSiri
: Percent body fat using Siri’s equation: 495/Density - 450density
: Density determined from underwater weighing (gm/cm^3).age
: Age (years)weight
: Weight (lbs)height
: Height (inches)neck
: Neck circumference (cm)chest
: Chest circumference (cm)abdomen
: Abdomen circumference (cm)hip
: Hip circumference (cm)thigh
: Thigh circumference (cm)knee
: Knee circumference (cm)ankle
: Ankle circumference (cm)biceps
: Biceps (extended) circumference (cm)forearm
: Forearm circumference (cm)wrist
: Wrist circumference (cm)
It takes a lot of effort to estimate body fat percentage accurately through underwater weighing. The goal is to build the best predictive model for fatSiri
using just circumference measurements, which are more easily attainable. (We won’t use fatBrozek
or density
as predictors because they’re other outcome variables.)
library(readr)
library(ggplot2)
library(dplyr)
library(tidymodels)
tidymodels_prefer()
<- read_csv("https://www.dropbox.com/s/js2gxnazybokbzh/bodyfat_train.csv?dl=1")
bodyfat_train
# Remove the fatBrozek and density variables
<- bodyfat_train %>%
bodyfat_train select(-fatBrozek, -density, -hipin)
Consider the 4 models you’ve used before:
<-
lm_spec linear_reg() %>%
set_engine(engine = 'lm') %>%
set_mode('regression')
<- fit(lm_spec,
mod1 ~ age+weight+neck+abdomen+thigh+forearm,
fatSiri data = bodyfat_train)
<- fit(lm_spec,
mod2 ~ age+weight+neck+abdomen+thigh+forearm+biceps,
fatSiri data = bodyfat_train)
<- fit(lm_spec,
mod3 ~ age+weight+neck+abdomen+thigh+forearm+biceps+chest+hip,
fatSiri data = bodyfat_train)
<- fit(lm_spec,
mod4 ~ ., # The . means all predictors
fatSiri data = bodyfat_train)
Exercise 1: Cross-validation in Concept
We are going to repeat what we did last week but use cross-validation to help us evaluate models in terms of the predictive performance on “new” data to help us choose a good model.
- In pairs or triplets, take turns explaining to each other the steps of cross validation (CV) in concept and then how you might use 10-fold CV with these 80 individual data points.
Exercise 2: Cross-validation with tidymodels
- Complete the code below to perform 10-fold cross-validation for
mod1
to estimate the test RMSE (\(\text{CV}_{(10)}\)). Do we need to useset.seed()
? Why or why not? (Is there a number of folds for which we would not need to set the seed?)
# Do we need to use set.seed()?
<- vfold_cv(??, v = 10)
bodyfat_cv
<- workflow() %>%
model_wf add_formula(??) %>%
add_model(lm_spec)
<- fit_resamples(model_wf,
mod1_cv resamples = bodyfat_cv,
metrics = metric_set(rmse, rsq, mae)
)
Look at
mod1_cv %>% unnest(.metrics)
, and use this to calculate the 10-fold cross-validated RMSE by hand. (Note: We haven’t done this together, but how can you adapt code that we’ve used before?)Check your answer to part c by directly printing out the CV metrics:
mod1_cv %>% collect_metrics()
. Interpret this metric.
Exercise 3: Looking at the evaluation metrics
Look at the completed table below of evaluation metrics for the 4 models.
- Which model performed the best on the training data?
- Which model performed best on test set (through CV)?
- Explain why there’s a discrepancy between these 2 answers and why CV, in general, can help prevent overfitting.
Model | Training RMSE | \(\text{CV}_{(10)}\) |
---|---|---|
mod1 |
3.810712 | 4.389568 |
mod2 |
3.766645 | 4.438637 |
mod3 |
3.752362 | 4.517281 |
mod4 |
3.572299 | 4.543343 |
Exercise 4: Practical issues: choosing the number of folds
- In terms of sample size, what are the pros/cons of low vs. high number of folds?
- In terms of computational time, what are the pros/cons of low vs. high number of folds?
- If possible, it is advisable to choose the number of folds to be a divisor of the sample size. Why do you think that is?
Digging deeper
If you have time, consider these exercises to further explore concepts related to today’s ideas.
Consider leave-one-out-cross-validation (LOOCV).
- Would we need
set.seed()
? Why or why not? - How might you adapt the code above to implement this?
- Using the information from
your_output %>% unnest(.metrics)
(which is a dataset), construct a visualization to examine the variability of RMSE from case to case. What might explain any very large values? What does this highlight about the quality of estimation of the LOOCV process?
- Would we need