Principal Component Regression

Brianna Heggeseth

As we gather

Sit with a group of a total of 2-3 people to work with on Group Assignment 3:
- If you are in a group of 3, you may only repeat 1 partner from Group Assignments 1 or 2. In other words, you need to have at least 1 person in the group you haven’t worked with on a group assignment.
- If you are in a group of 2, you should work with someone you haven’t yet worked with on a group assignment.
- Introduce yourself!

Announcements

MSCS Events

Thursday at 11:15am - MSCS Coffee Break
- Today: MSCS Seminar (Prof. Laura Lyman)
Next Tuesday 11:30am - 12:50pm - MSCS faculty listening session
- In OLRI 254. Pizza will be provided!

Concept Quiz 3

Part 1

on paper
closed people, closed laptop
you can bring in an 8.5x11 inch sheet with notes. you can type, write small, write big, etc. you will hand this in with Part 1
Part 1 is due by the end of the class
you might be asked to interpret some R output, but I won’t ask you to provide any code

Part 2

on computers
you can chat with any current STAT 253 student, but nobody else (including preceptors)
you can DM or email me clarifying questions and if there is something confusing, I’ll share my answer with the entire class
you can use any materials from this STAT 253 course (from course site or Moodle or textbook), but no internet, ChatGPT, etc
this is designed to finish during class, but you can hand it in any time within 24 hours of your class end time (eg: 11:10am the next day for the 9:40am section)

Content

All topics with focus on unsupervised learning
questions range in style, including multiple choice, fill in the blank, short response, matching, etc

Preparing for Concept Quiz 3

STAT 253 concept maps
- Slides 10-12!

Q & A

Any questions about:

hierarchical clustering
- algorithm steps
- dendrogram construction
- dendrogram to clusters (how to choose K)
- 4 different definitions of distance between clusters & impacts on dendrogram
- how to gain insight from clusters
- pros and cons
kmeans clustering
- algorithm steps
- how to choose K
- how to gain insight from clusters
- pros and cons
principal component analysis
- conceptual understanding of algorithm (matrix math is optional but highly encouraged)
- implementation steps in R
- loadings - what they mean
- scores - what they mean

Context

We’ve been distinguishing 2 broad areas in machine learning:

supervised learning: when we want to predict / classify some outcome \(y\) using predictors \(x\)
unsupervised learning: when we don’t have any outcome variable \(y\), only features \(x\)
- clustering: examine structure among the rows with respect to \(x\)
- dimension reduction: examine & combine structure among the columns \(x\)

BUT sometimes we can combine these ideas.

Combining Forces

Use dimension reduction to visualize / summarize lots of features and notice interesting groups.
Example: many physical characteristics of penguins, many characteristics of songs, etc
Use clustering to identify interesting groups.
Example: types (species) of penguins, types (genres) of songs, etc

These groups might then become our \(y\) outcome variable in future analysis.
Example: classify new songs as one of the “genres” we identified

EXAMPLE: K-means clustering + Classification of news articles

Dimension Reduction + Prediction

Suppose we have an outcome variable \(y\) (quantitative OR categorical) and lots of potential predictors \(x_1, x_2, ..., x_p\).

Perhaps we even have more predictors than data points (\(p > n\))!

For simplicity, computational efficiency, avoiding overfitting, etc, it might benefit us to simplify our set of predictors.

There are a few approaches:

variable selection (eg: using backward stepwise)
Simply kick out some of the predictors. NOTE: This doesn’t work when \(p > n\).
regularization (eg: using LASSO)
Shrink the coefficients toward / to 0. NOTE: This doesn’t work when \(p > n\).
feature extraction (eg: using PCA)
Identify & utilize only the most salient features of the original predictors. Specifically, combine the original, possibly correlated predictors into a smaller set of uncorrelated predictors which retain most of the original information. NOTE: This does work when \(p > n\).

PCR

PRINCIPAL COMPONENT REGRESSION (PCR)

Step 1
Ignore \(y\) for now. Use PCA to combine the \(p\) original, correlated predictors \(x\) into a set of \(p\) uncorrelated PC’s.
Step 2
Keep only the first \(k\) PCs which retain a “sufficient” amount of information from the original predictors.
Step 3
Model \(y\) by these first \(k\) PCs.

Partial Least Squares

When combining the original predictors \(x\) into a smaller set of PCs, PCA ignores \(y\). Thus PCA might not produce the strongest possible predictors of \(y\).

Partial least squares provides an alternative.

Like PCA, it combines the original predictors into a smaller set of uncorrelated features, but considers which predictors are most associated with \(y\) in the process.

Chapter 6.3.2 in ISLR provides an optional overview.

Example 1

For each scenario below, indicate which would (typically) be preferable in modeling \(y\) by a large set of predictors x:

PCR; or
variable selection or regularization.

We have more potential predictors than data points (\(p > n\)).
It’s important to understand the specific relationships between \(y\) and \(x\).
The \(x\) are NOT very correlated.

Small Group Work

Work on HW 7 exercises 9 and 10
If you finish Homework 7, start Group Assignment 3.

After Class

Upcoming due dates

4/23 : HW7
4/25 : Concept Quiz 3
4/29 : Group Assignment 3