Principal Component Analysis

Brianna Heggeseth

As we gather

  • Sit with new people!
    • Try to sit with at least 1 person you don’t know well
    • Introduce yourself

Announcements

MSCS Events

  • Thursday at 11:15am - MSCS Coffee Break

    • April 18: MSCS Seminar (Prof. Laura Lyman)

Context

GOALS

In unsupervised learning we don’t have any y outcome variable, we’re just exploring the structure of our data.

This can be divided into 2 types of tasks:

  • clustering
    • GOAL: examine structure & similarities among the individual observations (rows) of our dataset
    • METHODS: hierarchical and K-means clustering
  • dimension reduction
    • GOAL: examine & simplify structure among the features (columns) of our dataset
    • METHODS: principal components analysis (and many others, including singular value decomposition (SVD), Uniform Manifold Approximation and Projection (UMAP))

Dimension Reduction

Especially when we have a lot of features, dimension reduction helps:

  • identify patterns among the features
  • conserve computational resources
  • feature engineering: create salient features to use in regression & classification (will discuss next class)

Principal Component Analysis

PCA Details

Suppose we start with high dimensional data with p correlated features: \(x_1\), \(x_2\), …, \(x_p\).

We want to turn these into a smaller set of k < p features or principal components \(PC_1\), \(PC_2\), …., \(PC_k\) that:

  • are uncorrelated (i.e. each contain unique information)
  • preserve the majority of information or variability in the original data

Step 1

Define the p principal components as linear combinations of the original x features.

These combinations are specified by loadings or coefficients notated as \(a_{ij}\)’s:

\[\begin{split} PC_1 & = a_{11} x_1 + a_{12} x_2 + \cdots + a_{1p} x_p \\ PC_2 & = a_{21} x_1 + a_{22} x_2 + \cdots + a_{2p} x_p \\ \vdots & \\ PC_p & = a_{p1} x_1 + a_{p2} x_2 + \cdots + a_{pp} x_p \\ \end{split}\]

The first PC \(PC_1\) is the direction of maximal variability – it retains the greatest variability or information in the original data.

The subsequent PCs are defined to have maximal variation among the directions orthogonal to / perpendicular to / uncorrelated with the previously constructed PCs.

Step 2

Keep only the subset of PCs which retain “enough” of the variability / information in the original dataset.

Data Details

Recall the Australian weather data from Homework 2

# Import the data and load some packages
library(tidyverse)
library(rattle)
data(weatherAUS)

# Note that this has missing values
colSums(is.na(weatherAUS))
         Date      Location       MinTemp       MaxTemp      Rainfall 
            0             0          3285          3085          5930 
  Evaporation      Sunshine   WindGustDir WindGustSpeed    WindDir9am 
       109530        118202         15766         15659         16135 
   WindDir3pm  WindSpeed9am  WindSpeed3pm   Humidity9am   Humidity3pm 
         8668          3842          7254          4351          8329 
  Pressure9am   Pressure3pm      Cloud9am      Cloud3pm       Temp9am 
        23000         22981         89076         95092          3310 
      Temp3pm     RainToday       RISK_MM  RainTomorrow 
         7310          5930          5929          5929 

PCA cannot handle missing values.

We could simply eliminate days with any missing values, but this would kick out a lot of useful info.

Instead, we’ll use KNN to impute the missing values using the VIM package.

# If your VIM package works, use this chunk to process the data
library(VIM)

# It would be better to impute before filtering & selecting
# BUT it's very computationally expensive in this case
weather_temp <- weatherAUS %>% 
  filter(Date == "2008-12-01") %>% 
  dplyr::select(-Date, -RainTomorrow, -Temp3pm, -WindGustDir, -WindDir9am, -WindDir3pm) %>% 
  VIM::kNN(imp_var = FALSE)

# Now convert Location to the row name (not a feature)
weather_temp <- weather_temp %>% 
  column_to_rownames("Location") 

# Create a new data frame that processes logical and factor features into dummy variables
weather_data <- data.frame(model.matrix(~ . - 1, data = weather_temp))
rownames(weather_data) <- rownames(weather_temp)

Example 1

EXAMPLE 1: Research goals

Check out the weather_data:

head(weather_data)
           MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed
Albury        13.4    22.9      0.6         7.4     10.1            44
Newcastle     13.2    27.2      0.0         7.4     13.0            44
Penrith       15.2    32.6      0.0         7.4     10.9            59
Sydney        17.6    31.3      0.0         7.6     10.9            44
Wollongong     9.5    17.9      0.4         6.8     10.1            52
Canberra      13.6    25.2      0.0         9.6     13.0            80
           WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
Albury               20           24          71          22      1007.7
Newcastle             6           19          50          24      1013.9
Penrith              13           22          35          23      1009.1
Sydney                2           24          29          21      1009.1
Wollongong           20           24          52          44      1007.9
Canberra             26           43          31          28      1006.3
           Pressure3pm Cloud9am Cloud3pm Temp9am RainTodayNo RainTodayYes
Albury          1007.1        8        5    16.9           1            0
Newcastle       1010.1        3        4    21.8           1            0
Penrith         1007.1        3        4    24.4           1            0
Sydney          1004.6        3        7    24.9           1            0
Wollongong      1003.3        5        5    14.0           0            1
Canberra        1004.4        1        6    19.9           1            0
           RISK_MM
Albury           0
Newcastle        0
Penrith          0
Sydney           0
Wollongong       0
Canberra         0
  1. Identify a research goal that could be addressed using one of our clustering algorithms.

  2. Identify a research goal that could be addressed using our PCA dimension reduction algorithm

Example 2

Let’s start with just 3 correlated features:

\(x_1\) (Temp9am), \(x_2\) (MinTemp), and \(x_3\) (WindSpeed9am)

The goal of PCA will be to combine these correlated features into a smaller set of uncorrelated principal components (PCs) without losing a significant amount of information.

  1. The first PC will be defined to retain the greatest variability, hence information in the original features. What do you expect the first PC to be like?

  2. How many PCs do you think we’ll need to keep without losing too much of the original information?

Example 3

Perform a PCA on the small_example data:

# This code is nice and short!
# scale = TRUE, center = TRUE first standardizes the features
pca_small <- prcomp(small_example, scale = TRUE, center = TRUE)

This creates 3 PCs which are each different combinations of the (standardized) original features:

# Original (standardized) features
scale(small_example) %>% 
  head()
               Temp9am     MinTemp WindSpeed9am
Albury     -0.49164877 -0.10016777    0.5179183
Newcastle   0.25610853 -0.13455372   -1.3350783
Penrith     0.65287772  0.20930578   -0.4085800
Sydney      0.72917948  0.62193718   -1.8645059
Wollongong -0.93419901 -0.77069379    0.5179183
Canberra   -0.03383817 -0.06578182    1.3120597
# PCs
pca_small %>% 
  pluck("x") %>% 
  head()
                  PC1         PC2         PC3
Albury     -0.6119817  0.27733759 -0.26182765
Newcastle   0.6944764 -1.15487603  0.22381760
Penrith     0.7312569 -0.09374376  0.30573078
Sydney      1.7089610 -1.21413476  0.01482615
Wollongong -1.3091122 -0.09282229 -0.11200583
Canberra   -0.6683498  1.12907017  0.07404101

Specifically, these PCs are linear combinations of the (standardized) original x features, defined by loadings a:

\(PC_1 = a_{11}x_1 + a_{12}x_2 + a_{13}x_3\)

\(PC_2 = a_{21}x_1 + a_{22}x_2 + a_{23}x_3\)

\(PC_3 = a_{31}x_1 + a_{32}x_2 + a_{33}x_3\)

And these linear combinations are defined so that the PCs are uncorrelated, thus each contain unique weather information about the cities!

  1. Use the loadings below to specify the formula for the first PC.

    PC1 = ___*Temp9am + ___*MinTemp + ___*WindSpeed9am

                    PC1       PC2         PC3
Temp9am       0.6312659 0.2967160  0.71656333
MinTemp       0.6230387 0.3562101 -0.69637431
WindSpeed9am -0.4618725 0.8860440  0.03999775

SOLUTION: PC1 = 0.6312659 Temp9am + 0.6230387 MinTemp - 0.4618725 WindSpeed9am

  1. For just the first city, confirm that its PC1 coordinate or score can be calculated from its original coordinates using the formula in part a:
# Original (standardized) coordinates
scale(small_example) %>% 
  head(1)
          Temp9am    MinTemp WindSpeed9am
Albury -0.4916488 -0.1001678    0.5179183
# PC coordinates
pca_small %>%   
  pluck("x") %>% 
  head(1)
              PC1       PC2        PC3
Albury -0.6119817 0.2773376 -0.2618276

SOLUTION:

(0.6312659 * -0.4916488)  + (0.6230387 * -0.1001678) - (0.4618725 * 0.5179183)
[1] -0.6119818

Example 4

Plots can help us interpret the above numerical loadings, hence the important components of each PC.

  1. Which features contribute the most, either positively or negatively, to the first PC?

  2. What about the second PC?

Example 5

When we have a lot of features x, the above plots get messy. A loadings plot or correlation circle is another way to visualize PC1 and PC2 (the most important PCs):

  • each arrow represents a feature x
  • the x-coordinate of an arrow reflects the correlation between x and PC1
  • the y-coordinate of an arrow reflects the correlation between x and PC2
  • arrow length reflects how much the feature contributes to the first 2 PCs

It is powerful in that it can provide a 2-dimensional visualization of high dimensional data (just 3 dimensions in our small example here)!

  1. Positively correlated features point in similar directions. The opposite is true for negatively correlated features. What do you learn here?

  2. Which features are most highly correlated with, hence contribute the most to, the first PC (x-axis)? (Is this consistent with what we observed in the earlier plots?)

  3. What about the second PC?

Example 6

Now that we better understand the structures of the PCs, let’s examine the relative amount of information they each capture from the original set of features:

# Load package for tidy table
library(tidymodels)

# Measure information captured by each PC
# Calculate variance from standard deviation
pca_small %>% 
  tidy(matrix = "eigenvalues") %>% 
  mutate(var = std.dev^2)
# A tibble: 3 × 5
     PC std.dev percent cumulative    var
  <dbl>   <dbl>   <dbl>      <dbl>  <dbl>
1     1   1.50   0.753       0.753 2.26  
2     2   0.812  0.220       0.973 0.660 
3     3   0.284  0.0268      1     0.0805

NOTE:

  • var = amount of variability, hence information, in the original features captured by each PC
  • percent = % of original information captured by each PC
  • cumulative = cumulative % of original information captured by the PCs
  1. What % of the original information is captured by PC1? Confirm using both the var and percent columns.
  1. What % of the original information is captured by PC2?

  2. In total, 100% of the original information is captured by PC1, PC2, and PC3. What % of the original info would we retain if we only kept PC1 and PC2, i.e. if we reduced the PC dimensions by 1? Confirm using both the percent and cumulative columns.

Example 7

Especially when we start with lots of features, graphical summaries of the above tidy summary can help understand the variation captured by the PCs:

Based on these summaries, how many and which of the 3 PCs does it make sense to keep?

Thus by how much can we reduce the dimensions of our dataset?

Example 8

Finally, now that we better understand the “meaning” of our 3 new PCs, let’s explore their outcomes for each city (row) in the dataset.

A score plot maps out the scores of the first, and most important, 2 PCs for each city. PC1 is on the x-axis and PC2 on the y-axis.

Does there appear to be any geographical explanation of which cities are similar with respect to their PC1 and PC2 scores?

Small Group Work

For the rest of the class, work together Example 9 and 10 (solutions on site) and then on Ex 7-8 on HW7 (Rmd on Moodle).

After Class

Upcoming due dates

  • 4/23 : HW7
  • 4/25 : Concept Quiz 3
  • 4/29 : Group Assignment 3