20 Principal Component Regression

Unsupervised & supervised learning are friends!

Settling In

Sit with the same group as last class
Hand in your Quiz 2 Revisions
Prepare to take notes
Catch up on any announcements you’ve missed on Slack

Announcements

Quiz 3 is coming up next week!

Format: same as Quizzes 1 and 2
Content: cumulative, but focus on unsupervised learning
Study Tips:
- Create a study guide using the “Learning Goals” page on the course website
- Fill out the STAT 253 Concepts Maps (slides 9–11)
- Work on Group Assignment 3
- Review old CPs, HWs, and in-class exercises
- Come to office hours with questions!

Notes: PC Regression

Context

We’ve been distinguishing 2 broad areas in machine learning:

supervised learning: when we want to predict / classify some outcome y using predictors x
unsupervised learning: when we don’t have any outcome variable y, only features x
- clustering: examine structure among the rows with respect to x
- dimension reduction: examine & combine structure among the columns x

. . .

BUT sometimes we can combine these ideas.

Combining Forces: Clustering + Regression

Use dimension reduction to visualize / summarize lots of features and notice interesting groups.
Example: many physical characteristics of penguins, many characteristics of songs, etc
Use clustering to identify interesting groups.
Example: types (species) of penguins, types (genres) of songs, etc
These groups might then become our \(y\) outcome variable in future analysis.
Example: classify new songs as one of the “genres” we identified

EXAMPLE: K-means clustering + Classification of news articles

Dimension Reduction + Regression: Dealing with lots of predictors

Suppose we have an outcome variable \(y\) (quantitative OR categorical) and lots of potential predictors \(x_1, x_2, ..., x_p\).

Perhaps we even have more predictors than data points (\(p > n\))!

. . .

This idea of measuring lots of things on a sample is common in genetics, image processing, video processing, or really any scenario where we can grab data on a bunch of different features at once.

For simplicity, computational efficiency, avoiding overfitting, etc, it might benefit us to simplify our set of predictors.

. . .

There are a few approaches:

variable selection (eg: using backward stepwise)
Simply kick out some of the predictors. NOTE: This doesn’t work when \(p > n\).
regularization (eg: using LASSO)
Shrink the coefficients toward / to 0. NOTE: This sorta works when \(p > n\).
feature extraction (eg: using PCA)
Identify & utilize only the most salient features of the original predictors. Specifically, combine the original, possibly correlated predictors into a smaller set of uncorrelated predictors which retain most of the original information. NOTE: This does work when \(p > n\).

Principal Component Regression (PCR)

Step 1
Ignore \(y\) for now. Use PCA to combine the \(p\) original, correlated predictors \(x\) into a set of \(p\) uncorrelated PCs.
Step 2
Keep only the first \(k\) PCs which retain a “sufficient” amount of information from the original predictors.
Step 3
Model \(y\) by these first \(k\) PCs.

PCR vs Partial Least Squares

When combining the original predictors \(x\) into a smaller set of PCs, PCA ignores \(y\). Thus PCA might not produce the strongest possible predictors of \(y\).

. . .

Partial least squares provides an alternative.

. . .

Like PCA, it combines the original predictors into a smaller set of uncorrelated features, but considers which predictors are most associated with \(y\) in the process.

Chapter 6.3.2 in ISLR provides an optional overview.

Small Group Discussion

EXAMPLE 1

For each scenario below, indicate which would (typically) be preferable in modeling y by a large set of predictors x: (1) PCR; or (2) variable selection or regularization.

We have more potential predictors than data points (\(p > n\)).
It’s important to understand the specific relationships between y and x.
The x are NOT very correlated.

Exercises

Make the most of your work time in class!
These exercises are on HW7.
IMPORTANT: Remember to set.seed(253) on any exercises that involve randomness.
Save at least 15 minutes to get started on Group Assignment 3

Group Assignment 3

Before you leave class today:

Get data on your local computers
Start exploring the data:
- Familiarize yourself with the variables
- Create initial visualizations
- Determine if any data cleaning is needed (remove, modify, or create variables; remove or fill in missing values; remove observations)
Make a plan:
- How to decide which features to use, how many and which algorithms to try, how to evaluate each algorithm
- Set up communication avenues for out-of-class discussions (slack channel? in-person meetings? etc.)
- Divide / delegate leadership on tasks

Future Reference

Notes: R code

Suppose we have a set of sample_data with multiple predictors x, a quantitative outcome y, and (possibly) a column named data_id which labels each data point. We could adjust this code if y were categorical.

RUN THE PCR algorithm

library(tidymodels)
library(tidyverse)

# STEP 1: specify a linear regression model
lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

# STEP 2: variable recipe
# Add a pre-processing step that does PCA on the predictors
# num_comp is the number of PCs to keep (we need to tune it!)
pcr_recipe <- recipe(y ~ ., data = sample_data) %>% 
  update_role(data_id, new_role = "id") %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = tune())

# STEP 3: workflow
pcr_workflow <- workflow() %>% 
  add_recipe(pcr_recipe) %>% 
  add_model(lm_spec)
  
# STEP 4: Estimate multiple PCR models trying out different numbers of PCs to keep
# For the range, the biggest number you can try is the number of predictors you started with
# Put the same number in levels
set.seed(___)
pcr_models <- pcr_workflow %>% 
  tune_grid(
    grid = grid_regular(num_comp(range = c(1, ___)), levels = ___),
    resamples = vfold_cv(sample_data, v = 10),
    metrics = metric_set(mae)
  )

FOLLOW-UP

Processing and applying the results is the same as for our other tidymodels algorithms!

Solutions

Small Group Discussion

EXAMPLE 1

Solution:

1. (typically) can’t do variable selection, regularization when \(p > n\).
1. The PCs lose the original meaning of the predictors
1. PCR wouldn’t simplify things much (need a lot of PCs to retain info).

Exercises

Solutions will not be provided. These exercises are part of your homework this week.

Wrapping Up

Upcoming Due Dates:

HW7: due TOMORROW
Quiz 3: next Thursday
Group Assignment 3: next Friday
Final Learning Reflection: finals week