K-means Clustering

Brianna Heggeseth

As we gather

Sit with new people!
- Try to sit with at least 1 person you don’t know well
- Introduce yourself

Announcements

MSCS Events

Thursday at 11:15am - MSCS Coffee Break
- Today: Beyond Mac (Felicia DeSmith ’05, software by way of religion/philosophy)
- April 18: MSCS Seminar (Prof. Laura Lyman)

Concept Quiz 2

Part 1 Revisions due Friday

For each X, write a more correct answer and a reasoning why this is a more correct answer / what you were thinking during the quiz and how your understanding has changed.

Context

GOALS

Suppose we have a set of feature variables ($x_1,x_2,...,x_p$) but NO outcome variable y. Thus instead of our goal being to predict/classify/explain y, we might simply want to…

Examine the structure of our data.
Utilize this examination as a jumping off point for further analysis.

Drawbacks of hierarchical clustering

can be computationally expensive: we have to calculate the distance between each pair of objects in our sample to start the algorithm
greedy algorithm: This algorithm makes the best local decision of which clusters to fuse at each step in the dendrogram. Once clusters are fused they remain fused throughout the dendrogram. However, meaningful clusters may not always be hierarchical or nested.

K-means Algorithm

Goal:

Split cases into $K$ non-overlapping, homogeneous clusters or groups of cases (mathematically, think sets) notated as $C_1, C_2,..., C_K$ which minimize the total within cluster variance:

\[\sum_{k=1}^K W(C_k)\]

where each $W(\cdot)$ measures the within-cluster variance of each cluster, $C_k$:

\[W(C_k)= \frac{1}{ |C_k|} \sum_{\mathbf{x_i},\mathbf{x_j}\in C_k} ||\mathbf{x_i}-\mathbf{x_j}||^2\]

\[ = \frac{1}{\text{# of cases in $C_k$}} (\text{total Euclidean distance}^2 \text{ btwn all pairs in $C_k$})\]

$W(C_k)$ is the average squared distance between all pairs of cases in cluster $C_k$.

Algorithm:

Pick K
Appropriate values of $K$ are context dependent and can be deduced from prior knowledge, data, or the results of hierarchical clustering. Try multiple!
Initialization
Randomly select the location of $K$ centroids. Assign each case to the nearest centroid. This random composition defines the first set of clusters $C_1$ through $C_K$.
Centroid calculation
Calculate the centroid, ie. average location of the cases $x$ in each $C_i$.
Cluster assignment
Re-assign each case to the $C_i$ with the nearest centroid.
Repeat!
Iterate between steps 2 and 3 until the clusters stabilize, ie. the cluster centroids do not change much from iteration to iteration.

Example 1

Recall the hierarchical algorithm.

Each data point starts as a leaf.
Compute the Euclidean distance between all pairs of data points with respect to their features x.
Fuse the 2 closest data points into a single cluster or branch.
Continue to fuse the 2 closest clusters until all cases are in 1 cluster.

Revisit screenshots from the shiny app to explore the results of this algorithm.

Why can the “greediness” of this algorithm sometimes produce strange results?

Example 2

Go to this interactive app and play around with this algorithm.

“How to pick the initial centroids?”: Select “I’ll Choose”
“What kind of data would you like?”: Select “Gaussian Mixture”

Challenge:

Can you identify a set of initial centroids that result in reasonable final clusters?
What about a set of initial centroids that don’t result in reasonable final clusters?

Example 3

Let’s do K-means clustering in R using our penguin data:

We’ll cluster these penguins based on their bill lengths and depths:

Run the K-means algorithm using K = 3. We didn’t set the seed for hierarchical clustering. Why do we have to set the seed for K-means? Why do we have to scale the data?

Let’s repeat the K-means algorithm with K = 3. BUT let’s use a different seed. How did the results change and why is this possible?

In practice, we should try out a variety of seeds?

Example 4

To implement K-means clustering we must choose an appropriate K! Use the following example to discuss the goldilocks challenge of picking K.

Example 5

Let’s compare and contrast the results of the hierarchical and K-means algorithms. To this end, let’s use both to identify 2 penguin clusters:

Do the clusters defined by the two algorithms agree? NOTE: the actual cluster numbers are arbitrary.

# A tibble: 6 × 4
  bill_length_mm bill_depth_mm hier_cluster kmeans_cluster
           <dbl>         <dbl> <fct>        <fct>         
1           50.7          15   1            1             
2           37.6          19.1 2            2             
3           35.5          16.2 1            2             
4           51.3          19.9 2            1             
5           50.4          15.3 1            1             
6           40.9          16.8 2            2

Plot the cluster assignments. Which algorithm produces seemingly better clusters?

Example 6

Pros and Cons of K-means?

Bonus

Choosing $K$ - Average Silhouette

Goal: Choose $K$ that maximizes the distance between clusters and minimizes distance within clusters

\[a(i) = \frac{1}{|C_I|} \sum_{j\in C_I, i\not=j} ||\mathbf{x}_i - \mathbf{x_j}||^2\] \[ = \text{ avg. within cluster variance or distance from point i}\]

\[b(i) = \min_{J\not=I}\frac{1}{|C_J|} \sum_{j\in C_J} ||\mathbf{x}_i - \mathbf{x_j}||^2\] \[=\text{ min avg. variance / distance from point i to points in another cluster}\]

Silhouette for point $i$ ($-1\leq s(i) \leq 1$):

\[s(i) = \frac{b(i) - a(i)}{\max\{a(i),b(i)\}}\] \[ = \text{relative difference in within and between distance}\] \[ = \text{measure of how tightly clustered a group of points is relative to other groups }\]

Properties of Silhouette

If $b(i) >> a(i)$ (distance to other clusters is far relative to distance within cluster), $s(i) = 1$. This means it is an appropriate cluster.
If $b(i) << a(i)$, point $i$ is more similar to the neighboring cluster than its own (not great), $s(i) = -1$.
If $b(i) = a(i)$, then $s(i) = 0$. It is more of a flip of a coin of which cluster point i should be in (on the border between 2).

We plot the average silhouette, $\frac{1}{n}\sum_{i=1}^n s(i)$, and choose $K$ that maximizes the average silhouette.

library(factoextra)
fviz_nbclust(scale(penguins), kmeans, method='silhouette')

Small Group Work

For the rest of the class, work together on Ex 4-6 on HW7 (Rmd on Moodle).

After Class

Upcoming due dates

4/11 : Group Assignment 2
4/12 : Quiz 2 Revisions
4/23 : HW7