Principal Component Regression

Brianna Heggeseth

As we gather

  • All topics with focus on unsupervised learning
  • questions range in style, including multiple choice, fill in the blank, short response, matching, etc

  • hierarchical clustering
    • algorithm steps
    • dendrogram construction
    • dendrogram to clusters (how to choose K)
    • 4 different definitions of distance between clusters & impacts on dendrogram
    • how to gain insight from clusters
    • pros and cons
  • kmeans clustering
    • algorithm steps
    • how to choose K
    • how to gain insight from clusters
    • pros and cons
  • principal component analysis
    • conceptual understanding of algorithm (matrix math is optional but highly encouraged)
    • implementation steps in R
    • loadings - what they mean
    • scores - what they mean


We’ve been distinguishing 2 broad areas in machine learning:

  • supervised learning: when we want to predict / classify some outcome \(y\) using predictors \(x\)
  • unsupervised learning: when we don’t have any outcome variable \(y\), only features \(x\)
    • clustering: examine structure among the rows with respect to \(x\)
    • dimension reduction: examine & combine structure among the columns \(x\)

BUT sometimes we can combine these ideas.

Combining Forces

  1. Use dimension reduction to visualize / summarize lots of features and notice interesting groups.
    Example: many physical characteristics of penguins, many characteristics of songs, etc

  2. Use clustering to identify interesting groups.
    Example: types (species) of penguins, types (genres) of songs, etc

  1. These groups might then become our \(y\) outcome variable in future analysis.
    Example: classify new songs as one of the “genres” we identified

Dimension Reduction + Prediction

Suppose we have an outcome variable \(y\) (quantitative OR categorical) and lots of potential predictors \(x_1, x_2, ..., x_p\).

Perhaps we even have more predictors than data points (\(p > n\))!

For simplicity, computational efficiency, avoiding overfitting, etc, it might benefit us to simplify our set of predictors.

There are a few approaches:

  • variable selection (eg: using backward stepwise)
    Simply kick out some of the predictors. NOTE: This doesn’t work when \(p > n\).

  • regularization (eg: using LASSO)
    Shrink the coefficients toward / to 0. NOTE: This doesn’t work when \(p > n\).

  • feature extraction (eg: using PCA)
    Identify & utilize only the most salient features of the original predictors. Specifically, combine the original, possibly correlated predictors into a smaller set of uncorrelated predictors which retain most of the original information. NOTE: This does work when \(p > n\).



  • Step 1
    Ignore \(y\) for now. Use PCA to combine the \(p\) original, correlated predictors \(x\) into a set of \(p\) uncorrelated PC’s.

  • Step 2
    Keep only the first \(k\) PCs which retain a “sufficient” amount of information from the original predictors.

  • Step 3
    Model \(y\) by these first \(k\) PCs.

Partial Least Squares

When combining the original predictors \(x\) into a smaller set of PCs, PCA ignores \(y\). Thus PCA might not produce the strongest possible predictors of \(y\).

Partial least squares provides an alternative.

Like PCA, it combines the original predictors into a smaller set of uncorrelated features, but considers which predictors are most associated with \(y\) in the process.

Chapter 6.3.2 in ISLR provides an optional overview.

Example 1

For each scenario below, indicate which would (typically) be preferable in modeling \(y\) by a large set of predictors x:

  1. PCR; or
  2. variable selection or regularization.
  1. We have more potential predictors than data points (\(p > n\)).
  2. It’s important to understand the specific relationships between \(y\) and \(x\).
  3. The \(x\) are NOT very correlated.

