Topic 15 K-Means Clustering
Learning Goals
- Clearly describe / implement by hand the k-means algorithm
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters
Slides from today are available here.
Exercises
You can download a template RMarkdown file to start from here.
In this paper, Gorman et al. study characteristics of penguin populations in the Antarctic. We’ll be looking at a dataset of penguin body measurements available in the palmerpenguins
package. (Make sure to install this package before beginning.)
Our goal in using this data is to better understand the following questions: What similarities are there among the penguins? Do there appear to be different species? If so, how many species are there?
library(dplyr)
library(ggplot2)
library(palmerpenguins)
data(penguins)
# Remove observations with missing data on key variables
<- penguins %>%
penguins filter(!is.na(bill_length_mm), !is.na(bill_depth_mm), !is.na(flipper_length_mm))
Exercise 1: Visual explorations
We’ll first explore clustering based on characteristics of the penguins’ bills/beaks. There are two variables that measure the length and depth of the penguins’ bills (in mm): bill_length_mm
and bill_depth_mm
.
- Make a scatterplot of these two measurements. If you had to visually designate 3 different penguin clusters (possible species), how would you designate them?
ggplot(penguins, aes(???)) +
geom_point()
- Based on the plot, are there any differences in scale that you might be concerned about?
Exercise 2: K-means clustering on bill length and depth
The kmeans()
function in R performs k-means clustering.
- Use the code below to run k-means for \(k = 3\) clusters. Why is it important to use
set.seed()
? (In practice, it’s best to run the algorithm for many values of the seed and compare results.)
# Select just the bill length and depth variables
<- penguins %>%
penguins_sub select(bill_length_mm, bill_depth_mm)
# Run k-means for k = centers = 3
set.seed(253)
<- kmeans(penguins_sub, centers = 3)
kclust_k3
# Display the cluter assignments
$cluster
kclust_k3
# Add a variable (kclust_3) to the original dataset
# containing the cluster assignments
<- penguins %>%
penguins mutate(kclust_3 = factor(kclust_k3$cluster))
- Update your original scatterplot to add a color aesthetic that corresponds to the
kclust_3
variable created above. Do the cluster assignments correspond to your intuition from exercise 1? Why might this be?
# Visualize the cluster assignments on the original scatterplot
Exercise 3: Addressing variable scale
We can use the code below to rerun k-means clustering on the scaled data. The scaled data have been rescaled so that the standard deviation of each variable is 1. Remake the scatterplot to visualize the updated cluster assignments. Do the cluster assignments correspond to your intuition from exercise 1?
# Run k-means on the *scaled* data (all variables have SD = 1)
set.seed(253)
<- kmeans(scale(penguins_sub), centers = 3)
kclust_k3_scale <- penguins %>%
penguins mutate(kclust_3_scale = factor(kclust_k3_scale$cluster))
# Visualize the new cluster assignments
Exercise 4: Clustering on more variables
We can use as many variables in our clustering as makes sense given our goals. The dataset contains another body measurement variable of interest to us: flipper_length_mm
(flipper length in mm).
Complete the code below to cluster on bill length and depth as well as flipper length. Looking at the summary statistics, do you think it would be best to scale the variables?
# Select the variables to be used in clustering
<- penguins %>%
penguins_sub select(???)
# Look at summary statistics of the 3 variables
summary(penguins_sub)
# Perform clustering: should you use scale()?
set.seed(253)
<- kmeans(???)
kclust_k3_3vars
<- penguins %>%
penguins mutate(kclust_3_3vars = factor(kclust_k3_3vars$cluster))
Exercise 5: Interpreting the clusters
One way to interpet the resulting clusters is to explore how variables differ across the clusters. We can look at the 3 variables used in the clustering as well as a body mass variable available in the dataset.
Run the code below to look at the mean bill length, bill depth, flipper length, and body mass across the 3 clusters. What characterizes each of the 3 clusters? Try to come up with contextual “names” for the clusters (e.g., “big beaks” or “small penguins”).
%>%
penguins group_by(kclust_3_3vars) %>%
summarize(across(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g), mean))
Exercise 6: Picking \(k\)
We’ve been using \(k = 3\) so far, but how can we pick \(k\) using a data-driven approach. One strategy is to compare the total squared distance of each case from its assigned centroid for different values of \(k\). (This measure is available within the $tot.withinss
component of objects resulting from kmeans()
.)
Run the code below to create this plot for choices of \(k\) from 1 to 15. Using this plot and thinking about data context and our scientific goals, what are some reasonable choices for the number of clusters?
# Data-specific function to cluster and calculate total within-cluster SS
<- function(k){
penguin_cluster_ss # Perform clustering
<- kmeans(scale(penguins_sub), centers = k)
kclust
# Return the total within-cluster sum of squares
return(kclust$tot.withinss)
}
tibble(
k = 1:15,
tot_wc_ss = purrr::map_dbl(1:15, penguin_cluster_ss)
%>%
) ggplot(aes(x = k, y = tot_wc_ss)) +
geom_point() +
labs(x = "Number of clusters",y = 'Total within-cluster sum of squares') +
theme_classic()
Go Deeper: Picking \(k\)
There generally three ways of picking \(k\).
- Elbow Method
- Silhouette Method
- Gap Static Method
The Elbow Method is based on the visualization in Q6. Pick the \(k\) that corresponds to the bend in the curve (the “elbow”) such that you are minimizing total within-cluster sum of squares while keeping the number of clusters small (simpler, more interpretable).
The Silhouette Method is based on minimizing within sum of squares while maximizing the difference between clusters. For each data point, we can calculate the silhouette value, \[\frac{b - a}{\max(a,b)}\] where \(b\) is the smallest mean distance of the data point to all points in any OTHER cluster and \(a\) is the mean distance between the data point and all other data points in the same cluster.
# Data-specific function to cluster and calculate total within-cluster SS
<- function(k){
penguin_cluster_silhouette # Perform clustering
<- kmeans(scale(penguins_sub), centers = k)
kclust
<- cluster::silhouette(kclust$cluster, dist(scale(penguins_sub)))
ss
# Return the silhouette measures
return(mean(ss[, 3]))
}
# Choose value that MAXIMIZES average silhouette
tibble(
k = 2:15,
avg_sil = purrr::map_dbl(2:15, penguin_cluster_silhouette)
%>%
) ggplot(aes(x = k, y = avg_sil)) +
geom_point() +
labs(x = "Number of clusters",y = 'Average Silhouette') +
theme_classic()
The Gap Method is based on statistical inference using Monte Carlo Simulations. This method is beyond the scope of the course but feel free to dig into other more mathematical resources