Topic 18 Principal Components Analysis
Learning Goals
- Explain the goal of dimension reduction and how this can be useful in a supervised learning setting
- Interpret and use the information provided by principal component loadings and scores
- Interpret and use a scree plot to guide dimension reduction
Slides from today are available here.
Exercises
You can download a template RMarkdown file to start from here.
Exercise 1: Core concepts
For this first exercise, we will work through some key ideas and terminology related to PCA using the information below, which comes from a small data set of 3 variables: \(x_1, x_2, x_3\).
\[ \text{PC1} = 0.672 x_1 - 0.287 x_2 - 0.683 x_3 \] \[ \text{PC2} = -0.244 x_1 - 0.956 x_2 + 0.162 x_3 \] \[ \text{PC3} = 0.699 x_1 - 0.058 x_2 - 0.713 x_3 \]
What are the loadings of principal components 1 to 3? In general, what information does a loading give us?
What are the two most important variables for forming PC1? PC2? PC3?
A case has variable values \((x_1, x_2, x_3) = (1, 1, 1)\). What are the PC1, PC2, and PC3 scores for this case? How can we interpret these scores?
What can be said about the amount of variation in the dataset explained by the three PCs?
In thinking about the PCs as defining new “directions”, how are PCs 2 and above selected relative to the first ones?
Now let’s use PCA to explore a gene expression dataset. The Khan
dataset in the ISLR
package contains gene expression measurements in cancer tissue samples. (Khan is the first author’s last name.) Such data are commonly part of biological studies to better understand the molecular basis for disease. You can find information about this dataset by entering ?Khan
in the Console.
library(dplyr)
library(purrr)
library(ggplot2)
library(ISLR)
data(Khan)
# train_data contains 2308 gene expression measurements for 63 samples
<- Khan %>% pluck('xtrain')
train_data
# Rename the variables to be gene1, gene2, etc.
colnames(train_data) <- paste0("gene", seq_len(ncol(train_data)))
# train_labels contains information on which of 4 cancer subtypes each sample comes from
<- Khan %>% pluck('ytrain') train_labels
Exercise 2: Exploring PC loadings
The prcomp()
function performs PCA. Look at the help page for the prcomp()
function under the “Value” section, and recall that pluck('name')
or $
extracts named components of objects (e.g., list_object %>% pluck('name_of_component')
or list_object$name_of_component
).
<- prcomp(train_data, center = TRUE, scale = TRUE) pca_out
Use the
head()
function to display the first few rows of the loadings matrix.Using just the first 3 genes, write out the equation for principal component 4.
\[PC4 = ??*gene1 + ??*gene2 + ??*gene3\]
- Describe how you would use the loadings matrix to find the genes that contribute most to the largest source of variation in the dataset.
- In R, we can extract the first column of a matrix object
mat
usingmat[,1]
or we can convert the matrix to a data frame and use the name of the columnmat %>% as.data.frame() %>% select(PC1)
. Use thehead()
,arrange()
for data frames orsort()
for vectors, andabs()
functions to display the 10 most important genes that contribute to the largest source of variation.
Exercise 3: Exploring PC scores
We can plot the PC1 and PC2 scores against each other in a scatterplot to see if these new variables cluster the cases according to some other information. For example, in this data, we have tumor type labels for each case. (4 tumor types)
- The
x
component ofpca_out
contains these scores. Complete the code below to make a scatterplot of the PC2 scores versus the PC1 scores.
# Color the points by the information in train_labels (the 4 cancer subtypes present)
%>%
pca_out pluck('x') %>%
as.data.frame() %>%
mutate(labels = train_labels) %>%
ggplot(aes(x = ??, y = ??, color = factor(??))) +
geom_point() +
labs(x = 'PC1', y = 'PC2') +
scale_color_viridis_d() +
theme_classic()
- Do you notice any clustering by tumor type?
- How could we use k-means and hierarchical clustering to see whether the cases (tissue samples) cluster by tumor type?
- How can we use loadings and the information in the score plot to understand what genes drive groupings of the tissue samples?
Exercise 4: Scree plots and dimension reduction
Let’s explore how to use PCA for dimension reduction.
- The
sdev
component ofpca_out
gives the standard deviation explained by each principal component. Explain what the first 2 lines of code below are doing.
<- (pca_out %>% pluck('sdev'))^2
var_explained <- var_explained/sum(var_explained)
pve
<- tibble(
var_data PC = seq_len(length(var_explained)),
var_explained = var_explained,
pve = pve
)
# Construct scree plots
<- var_data %>%
p1 ggplot(aes(x = PC, y = pve)) +
geom_point() +
geom_line() +
labs(x = 'Principal Component', y = 'Proportion of varinace explained') +
theme_classic()
<- var_data %>%
p2 ggplot(aes(x = PC, y = cumsum(pve))) +
geom_point() +
geom_line() +
labs(x = 'Principal Component', y = 'Cumulative proportion of variance explained') +
theme_classic()
library(ggpubr)
ggarrange(p1, p2)
- Explain why the plots above look the way they do. (These plots are called scree plots.)
- We can think of principal components as new variables. PCA allows us to perform dimension reduction to use a smaller set of variables, often to accompany supervised learning. How can we use the plots above to guide a choice about the number of PCs to use?
- Carefully describe how we could also use cross-validation to pick the number of PCs. (For concreteness, suppose that we’re in a linear regression setting.)
Exercise 5: Variable scaling
You likely noticed the scale = TRUE
within prcomp()
above. This scales the variables to all have unit variance. Explain why this is often advisable by thinking generally about the ranges of different variables. In what other methods would scaling be important?