Motivating Question

Where are we?

Within the supervised learning framework, we have a categorical response variable \(y\) and a set of potential predictors \(x\). For example:

y = vote / don’t vote, x = (age, party id, …)
y = spam / not spam, x = (# of $, # of !, …)
y = human / car / plant, x = (speed, shape, …)

We have the following goals:

Build a classification model
We’ll use the following techniques to build classification models of \(y\) from predictors \(x\):
- parametric techniques
  - logistic regression (with or without LASSO!)
  - support vector machines (optional)
- nonparametric techniques
  - K Nearest Neighbors (KNN)
  - classification trees
  - random forests and bagging

Evaluate the quality of a classification model
We’ll use the following metrics and tools to evaluate the quality of a classification model:
- overall accuracy, sensitivity, & specificity
  We can approximate these metrics using in-sample and cross validation techniques.
- ROC (receiver operating characteristic) curves