Topic 18 Principal Components Analysis

Learning Goals

  • Explain the goal of dimension reduction and how this can be useful in a supervised learning setting
  • Interpret and use the information provided by principal component loadings and scores
  • Interpret and use a scree plot to guide dimension reduction


Slides from today are available here.




Exercises

You can download a template RMarkdown file to start from here.

Exercise 1: Core concepts

For this first exercise, we will work through some key ideas and terminology related to PCA using the information below, which comes from a small data set of 3 variables: \(x_1, x_2, x_3\).

\[ \text{PC1} = 0.672 x_1 - 0.287 x_2 - 0.683 x_3 \] \[ \text{PC2} = -0.244 x_1 - 0.956 x_2 + 0.162 x_3 \] \[ \text{PC3} = 0.699 x_1 - 0.058 x_2 - 0.713 x_3 \]

  1. What are the loadings of principal components 1 to 3? In general, what information does a loading give us?

  2. What are the two most important variables for forming PC1? PC2? PC3?

  3. A case has variable values \((x_1, x_2, x_3) = (1, 1, 1)\). What are the PC1, PC2, and PC3 scores for this case? How can we interpret these scores?

  4. What can be said about the amount of variation in the dataset explained by the three PCs?

  5. In thinking about the PCs as defining new “directions”, how are PCs 2 and above selected relative to the first ones?




Now let’s use PCA to explore a gene expression dataset. The Khan dataset in the ISLR package contains gene expression measurements in cancer tissue samples. (Khan is the first author’s last name.) Such data are commonly part of biological studies to better understand the molecular basis for disease. You can find information about this dataset by entering ?Khan in the Console.

library(dplyr)
library(purrr)
library(ggplot2)
library(ISLR)
data(Khan)

# train_data contains 2308 gene expression measurements for 63 samples
train_data <- Khan %>% pluck('xtrain')

# Rename the variables to be gene1, gene2, etc.
colnames(train_data) <- paste0("gene", seq_len(ncol(train_data)))

# train_labels contains information on which of 4 cancer subtypes each sample comes from
train_labels <- Khan %>% pluck('ytrain')

Exercise 2: Exploring PC loadings

The prcomp() function performs PCA. Look at the help page for the prcomp() function under the “Value” section, and recall that pluck('name') or $ extracts named components of objects (e.g., list_object %>% pluck('name_of_component') or list_object$name_of_component).

pca_out <- prcomp(train_data, center = TRUE, scale = TRUE)
  1. Use the head() function to display the first few rows of the loadings matrix.

  2. Using just the first 3 genes, write out the equation for principal component 4.

\[PC4 = ??*gene1 + ??*gene2 + ??*gene3\]

  1. Describe how you would use the loadings matrix to find the genes that contribute most to the largest source of variation in the dataset.
  2. In R, we can extract the first column of a matrix object mat using mat[,1] or we can convert the matrix to a data frame and use the name of the column mat %>% as.data.frame() %>% select(PC1). Use the head(), arrange() for data frames or sort() for vectors, and abs() functions to display the 10 most important genes that contribute to the largest source of variation.

Exercise 3: Exploring PC scores

We can plot the PC1 and PC2 scores against each other in a scatterplot to see if these new variables cluster the cases according to some other information. For example, in this data, we have tumor type labels for each case. (4 tumor types)

  1. The x component of pca_out contains these scores. Complete the code below to make a scatterplot of the PC2 scores versus the PC1 scores.
# Color the points by the information in train_labels (the 4 cancer subtypes present)
pca_out %>% 
    pluck('x') %>%
    as.data.frame() %>%
    mutate(labels = train_labels) %>%
    ggplot(aes(x = ??, y = ??, color = factor(??))) + 
    geom_point() +
    labs(x = 'PC1', y = 'PC2') +
    scale_color_viridis_d() +
    theme_classic()
  1. Do you notice any clustering by tumor type?
  2. How could we use k-means and hierarchical clustering to see whether the cases (tissue samples) cluster by tumor type?
  3. How can we use loadings and the information in the score plot to understand what genes drive groupings of the tissue samples?

Exercise 4: Scree plots and dimension reduction

Let’s explore how to use PCA for dimension reduction.

  1. The sdev component of pca_out gives the standard deviation explained by each principal component. Explain what the first 2 lines of code below are doing.
var_explained <- (pca_out %>% pluck('sdev'))^2
pve <- var_explained/sum(var_explained)

var_data <- tibble(
    PC = seq_len(length(var_explained)),
    var_explained = var_explained,
    pve = pve
)
    
# Construct scree plots
p1 <- var_data %>%
    ggplot(aes(x = PC, y = pve)) +
    geom_point() + 
    geom_line() + 
    labs(x = 'Principal Component', y = 'Proportion of varinace explained') +
    theme_classic()

p2 <- var_data %>%
    ggplot(aes(x = PC, y = cumsum(pve))) +
    geom_point() + 
    geom_line() + 
    labs(x = 'Principal Component', y = 'Cumulative proportion of variance explained') +
    theme_classic()

library(ggpubr) 
ggarrange(p1, p2)
  1. Explain why the plots above look the way they do. (These plots are called scree plots.)
  2. We can think of principal components as new variables. PCA allows us to perform dimension reduction to use a smaller set of variables, often to accompany supervised learning. How can we use the plots above to guide a choice about the number of PCs to use?
  3. Carefully describe how we could also use cross-validation to pick the number of PCs. (For concreteness, suppose that we’re in a linear regression setting.)

Exercise 5: Variable scaling

You likely noticed the scale = TRUE within prcomp() above. This scales the variables to all have unit variance. Explain why this is often advisable by thinking generally about the ranges of different variables. In what other methods would scaling be important?