20  Principal Component Regression

Unsupervised & supervised learning are friends!

Settling In

  • Sit with the same group as last class
  • Hand in your Quiz 2 Revisions
  • Prepare to take notes
  • Catch up on any announcements you’ve missed on Slack



Announcements

Quiz 3 is coming up next week!

  • Format: same as Quizzes 1 and 2
  • Content: cumulative, but focus on unsupervised learning
  • Study Tips:
    • Create a study guide using the “Learning Goals” page on the course website
    • Fill out the STAT 253 Concepts Maps (slides 9–11)
    • Work on Group Assignment 3
    • Review old CPs, HWs, and in-class exercises
    • Come to office hours with questions!




Notes: PC Regression

Context

We’ve been distinguishing 2 broad areas in machine learning:

  • supervised learning: when we want to predict / classify some outcome y using predictors x
  • unsupervised learning: when we don’t have any outcome variable y, only features x
    • clustering: examine structure among the rows with respect to x
    • dimension reduction: examine & combine structure among the columns x

. . .

BUT sometimes we can combine these ideas.


Combining Forces: Clustering + Regression

  1. Use dimension reduction to visualize / summarize lots of features and notice interesting groups.
    Example: many physical characteristics of penguins, many characteristics of songs, etc

  2. Use clustering to identify interesting groups.
    Example: types (species) of penguins, types (genres) of songs, etc

  3. These groups might then become our \(y\) outcome variable in future analysis.
    Example: classify new songs as one of the “genres” we identified

EXAMPLE: K-means clustering + Classification of news articles


Dimension Reduction + Regression: Dealing with lots of predictors

Suppose we have an outcome variable \(y\) (quantitative OR categorical) and lots of potential predictors \(x_1, x_2, ..., x_p\).

Perhaps we even have more predictors than data points (\(p > n\))!

. . .



This idea of measuring lots of things on a sample is common in genetics, image processing, video processing, or really any scenario where we can grab data on a bunch of different features at once.

For simplicity, computational efficiency, avoiding overfitting, etc, it might benefit us to simplify our set of predictors.

. . .



There are a few approaches:

  • variable selection (eg: using backward stepwise)
    Simply kick out some of the predictors. NOTE: This doesn’t work when \(p > n\).

  • regularization (eg: using LASSO)
    Shrink the coefficients toward / to 0. NOTE: This sorta works when \(p > n\).

  • feature extraction (eg: using PCA)
    Identify & utilize only the most salient features of the original predictors. Specifically, combine the original, possibly correlated predictors into a smaller set of uncorrelated predictors which retain most of the original information. NOTE: This does work when \(p > n\).


Principal Component Regression (PCR)

  • Step 1
    Ignore \(y\) for now. Use PCA to combine the \(p\) original, correlated predictors \(x\) into a set of \(p\) uncorrelated PCs.

  • Step 2
    Keep only the first \(k\) PCs which retain a “sufficient” amount of information from the original predictors.

  • Step 3
    Model \(y\) by these first \(k\) PCs.


PCR vs Partial Least Squares

When combining the original predictors \(x\) into a smaller set of PCs, PCA ignores \(y\). Thus PCA might not produce the strongest possible predictors of \(y\).

. . .



Partial least squares provides an alternative.

. . .



Like PCA, it combines the original predictors into a smaller set of uncorrelated features, but considers which predictors are most associated with \(y\) in the process.

Chapter 6.3.2 in ISLR provides an optional overview.





Small Group Discussion

EXAMPLE 1

For each scenario below, indicate which would (typically) be preferable in modeling y by a large set of predictors x: (1) PCR; or (2) variable selection or regularization.

  1. We have more potential predictors than data points (\(p > n\)).
  2. It’s important to understand the specific relationships between y and x.
  3. The x are NOT very correlated.





Exercises

  • Make the most of your work time in class!
  • These exercises are on HW7.
  • IMPORTANT: Remember to set.seed(253) on any exercises that involve randomness.
  • Save at least 15 minutes to get started on Group Assignment 3




Group Assignment 3

Before you leave class today:

  1. Get data on your local computers
  2. Start exploring the data:
    • Familiarize yourself with the variables
    • Create initial visualizations
    • Determine if any data cleaning is needed (remove, modify, or create variables; remove or fill in missing values; remove observations)
  3. Make a plan:
    • How to decide which features to use, how many and which algorithms to try, how to evaluate each algorithm
    • Set up communication avenues for out-of-class discussions (slack channel? in-person meetings? etc.)
    • Divide / delegate leadership on tasks





Future Reference

Notes: R code

Suppose we have a set of sample_data with multiple predictors x, a quantitative outcome y, and (possibly) a column named data_id which labels each data point. We could adjust this code if y were categorical.

RUN THE PCR algorithm

library(tidymodels)
library(tidyverse)

# STEP 1: specify a linear regression model
lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

# STEP 2: variable recipe
# Add a pre-processing step that does PCA on the predictors
# num_comp is the number of PCs to keep (we need to tune it!)
pcr_recipe <- recipe(y ~ ., data = sample_data) %>% 
  update_role(data_id, new_role = "id") %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = tune())

# STEP 3: workflow
pcr_workflow <- workflow() %>% 
  add_recipe(pcr_recipe) %>% 
  add_model(lm_spec)
  
# STEP 4: Estimate multiple PCR models trying out different numbers of PCs to keep
# For the range, the biggest number you can try is the number of predictors you started with
# Put the same number in levels
set.seed(___)
pcr_models <- pcr_workflow %>% 
  tune_grid(
    grid = grid_regular(num_comp(range = c(1, ___)), levels = ___),
    resamples = vfold_cv(sample_data, v = 10),
    metrics = metric_set(mae)
  )

FOLLOW-UP

Processing and applying the results is the same as for our other tidymodels algorithms!




Solutions

Small Group Discussion

EXAMPLE 1

Solution:
    1. (typically) can’t do variable selection, regularization when \(p > n\).
    1. The PCs lose the original meaning of the predictors
    1. PCR wouldn’t simplify things much (need a lot of PCs to retain info).


Exercises

Solutions will not be provided. These exercises are part of your homework this week.





Wrapping Up

Upcoming Due Dates:

  • HW7: due TOMORROW
  • Quiz 3: next Thursday
  • Group Assignment 3: next Friday
  • Final Learning Reflection: finals week