Overfitting

Brianna Heggeseth

As we gather

  • Sit in the randomly assigned groups. Introduce yourselves and choose a team name (you will need this later).

Announcements

  • Thursday at 11:15am - MSCS Coffee Break & Summer Research Info
    • Smail Gallery + OLRI 254
  • Prepare to take notes.
    • Locate the Rmd for Part 1 of today’s activity in the Schedule of the course website (see bottom of slides for url). Do NOT open Part 2, yet.
    • Save this Rmd in the “STAT 253 > Notes” folder.

Small Group Discussion

Go to https://bcheggeseth.github.io/253_spring_2024/overfitting.html


Go to > Small Group Discussion: Model Evaluation Experiment.

  • Let’s build and evaluate a predictive model of an adult’s height (\(y\)) using some predictors \(x_i\) (eg: age, height, etc).
  • Each group will be given a different sample of 40 adults.
  • Start by predicting height (in) using hip circumference (cm).
  • Evaluate the model on your sample.

Be prepared to share your answers to:

  • How good is your simple model?

  • What would happen if we added more predictors?

In-Class Activity - Part 1

Your group has 5 minutes to complete exercise 1 and 2 (choosing one of three models).


Reflection / Reactions to the Group Choices?


Now work on exercises 3 - 5.

Notes - Overfitting

When we add more and more predictors into a model, it can become overfit to the noise in our sample data:

  • our model loses the broader trend / big picture
  • thus does not generalize to new data
  • thus results in bad predictions and a bad understanding of the relationship among the new data points

Notes - Overfitting Prevention

Training and Testing

  • In-sample metrics, i.e. measures of how well the model performs on the same sample data that we used to build it, tend to be overly optimistic and lead to overfitting.
  • Instead, we should build and evaluate, or train and test, our model using different data.

Notes - R Code

Split the sample data into training and test sets

# Set the random number seed
set.seed(___)

# Split the sample_data
# "prop" is the proportion of data assigned to the training set
# it must be some number between 0 and 1
data_split <- initial_split(sample_data, strata = y, prop = ___)

# Get the training data from the split
data_train <- data_split %>% 
  training()

# Get the testing data from the split
data_test <- data_split %>% 
  testing()

Notes - R Code

Build a training model

# STEP 1: model specification
lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

# STEP 2: model estimation using the training data
model_train <- lm_spec %>% 
  fit(y ~ x1 + x2, data = data_train)

Notes - R Code

Use the training model to make predictions for the test data

# Make predictions
model_train %>% 
  augment(new_data = data_test)

Evaluate the training model using the test data

# Calculate the test MAE
model_train %>% 
  augment(new_data = data_test) %>% 
  mae(truth = y, estimate = .pred)

In-Class Activity - Part 2

Go back to https://bcheggeseth.github.io/253_spring_2024/schedule.html

Open Part 2 Rmd file.

  • Work through the exercises on cars data as a group.
  • Same directions as before
    • Be kind to yourself
    • Be kind to each other & collaborate
  • Ask me questions as I move around the room.

After Class

  • Finishing the activity

    • If you didn’t finish the activity, no problem! Be sure to complete the activity outside of class, review the solutions in the online manual, and ask any questions on Slack or in office hours.
    • Re-organize and review your notes to help deepen your understanding, solidify your learning, and make homework go more smoothly!
  • An R code video, posted on today’s section on Moodle, talks through the new code. This video is OPTIONAL. Decide what’s right for you.

  • Continue to check in on Slack. I’ll be posting announcements there from now on.

Upcoming due dates

  • Tuesday, 10 minutes before your section: Checkpoint 3. There are two (short) videos to watch in advance.
  • Thursday 2/1: Homework 2
    • Start today, even if you just review the directions and scan the exercises. You will be sad if you start too late – HW is not designed to do in one sitting.
      • Using Slack, invite others to work on homework with you.