Topic 17 Catch up Clustering Day
Learning Goals
- Implement k-means and hierarchical clustering and interpret in their outputs and algorithms
- Synthesize and apply concepts covered so far on real data
Slides from today are available here.
Real Data Clustering
As a group, choose one of the following three datasets to work with:
- Wine Attributes (download here)
- 178 Italian wines were analyzed
- Variables (from Chemical Analysis)
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
library(readr)
<- read_csv('wine.csv') wine
- Mall Customers (download here)
- 200 individuals
- Variables
- Binary Gender
- Age
- Annual Income (in $1000’s)
- Spending Score (summary of buying behavior)
library(readr)
<- read_csv('mall_customers.csv') customers
- Credit Card Clients (download here)
- Almost 9000 credit card holders
- Variables based on 6 months of time
- CUSTID: Identification of Credit Card holder
- BALANCE : Balance amount left in their account to make purchases
- BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES : Number of purchases made from account
- ONEOFFPURCHASES : Maximum purchase amount done in one-go
- INSTALLMENTSPURCHASES : Amount of purchase done in installment
- CASHADVANCE : Cash in advance given by the user
- PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
- CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
- PURCHASESTRX : Number of purchase transactions made
- CREDITLIMIT : Limit of Credit Card for user
- PAYMENTS : Amount of Payment done by user
- MINIMUM_PAYMENTS : Minimum amount of payments made by user
- PRCFULLPAYMENT : Percent of full payment paid by user
- TENURE : Tenure of credit card service for user
library(readr)
<- read_csv('creditcard.csv') credit
Your Goal
Goal: Cluster the data to discover insight and patterns in the data
Available Methods
- K-means with all quantitative variables
- Partitioning around Medoids (pam) as a robust version of K-means
- If you have at least one categorical variable,
daisy()
will calculate Gower’s distance
- If you have at least one categorical variable,
- Hierarchical clustering
- If you have at least one categorical variable,
daisy()
will calculate Gower’s distance
- If you have at least one categorical variable,
kmeans(data, centers = k)
library(cluster)
pam(daisy(data), k = k)
hclust(daisy(data))
- Insights
- Based on clustering, you’ll want to interpret/visualize the resulting clusters to gain insight
- Deliverable
- Create 1 graphic that demonstrates your insight
- Add it to this Shared Folder with your names in the filename
Clustering in the Wild
To give you a taste of how these methods get use in “the wild” world of science, here are a few papers (quality varies):
- Image Segmentation (https://link.springer.com/article/10.1007/s11042-021-10594-9)
- Bacteria Clustering (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002843)
- Document Clustering (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7790388/)
- Clustering ICD10 Diagnosis Codes (https://arxiv.org/abs/1909.00306)
- Clustering Activity Sequences (https://www.sciencedirect.com/science/article/abs/pii/S0968090X21000395?via%3Dihub)
R Coding Challenges
The best way to learn new things about R is to work on a data project.
- The goals drive what code is needed.
- Learn them as you need them.
What things have come up so far for you?
- What has been the most frustrating?
- When do you get stuck?
- What are you wanting to do with your data?
Besides class projects, you can practice visualizing data:
- TidyTuesday Challenges
- Check out David Robinson’s TidyTuesday’s Screencasts