Topic 17 Catch up Clustering Day

Learning Goals

  • Implement k-means and hierarchical clustering and interpret in their outputs and algorithms
  • Synthesize and apply concepts covered so far on real data


Slides from today are available here.




Real Data Clustering

As a group, choose one of the following three datasets to work with:

  1. Wine Attributes (download here)
  • 178 Italian wines were analyzed
  • Variables (from Chemical Analysis)
    • Alcohol
    • Malic acid
    • Ash
    • Alcalinity of ash
    • Magnesium
    • Total phenols
    • Flavanoids
    • Nonflavanoid phenols
    • Proanthocyanins
    • Color intensity
    • Hue
    • OD280/OD315 of diluted wines
    • Proline
library(readr)
wine <- read_csv('wine.csv')
  1. Mall Customers (download here)
  • 200 individuals
  • Variables
    • Binary Gender
    • Age
    • Annual Income (in $1000’s)
    • Spending Score (summary of buying behavior)
library(readr)
customers <- read_csv('mall_customers.csv')
  1. Credit Card Clients (download here)
  • Almost 9000 credit card holders
  • Variables based on 6 months of time
    • CUSTID: Identification of Credit Card holder
    • BALANCE : Balance amount left in their account to make purchases
    • BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
    • PURCHASES : Number of purchases made from account
    • ONEOFFPURCHASES : Maximum purchase amount done in one-go
    • INSTALLMENTSPURCHASES : Amount of purchase done in installment
    • CASHADVANCE : Cash in advance given by the user
    • PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
    • ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
    • PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
    • CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
    • CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
    • PURCHASESTRX : Number of purchase transactions made
    • CREDITLIMIT : Limit of Credit Card for user
    • PAYMENTS : Amount of Payment done by user
    • MINIMUM_PAYMENTS : Minimum amount of payments made by user
    • PRCFULLPAYMENT : Percent of full payment paid by user
    • TENURE : Tenure of credit card service for user
library(readr)
credit <- read_csv('creditcard.csv')

Your Goal

  • Goal: Cluster the data to discover insight and patterns in the data

  • Available Methods

    • K-means with all quantitative variables
    • Partitioning around Medoids (pam) as a robust version of K-means
      • If you have at least one categorical variable, daisy() will calculate Gower’s distance
    • Hierarchical clustering
      • If you have at least one categorical variable, daisy() will calculate Gower’s distance
kmeans(data, centers = k)

library(cluster)
pam(daisy(data), k = k)

hclust(daisy(data))
  • Insights
    • Based on clustering, you’ll want to interpret/visualize the resulting clusters to gain insight
  • Deliverable
    • Create 1 graphic that demonstrates your insight
    • Add it to this Shared Folder with your names in the filename

Clustering in the Wild

To give you a taste of how these methods get use in “the wild” world of science, here are a few papers (quality varies):

R Coding Challenges

The best way to learn new things about R is to work on a data project.

  • The goals drive what code is needed.
  • Learn them as you need them.

What things have come up so far for you?

  • What has been the most frustrating?
  • When do you get stuck?
  • What are you wanting to do with your data?

Besides class projects, you can practice visualizing data: