Learning Goals

The goal of this course is for you to further develop general skills necessary for statistics and data science and gain a working understanding of set of machine learning algorithms.

Specific course topics and general skills are listed below. Use these to guide your synthesis of course material for your portfolio and project throughout the entire semester

General Skills

Computational Thinking

  • Be able to perform the following tasks:

    • Decomposition: Break a task into smaller tasks to be able to explain the process to another person or computer
    • Pattern Recognition: Recognize patterns in tasks by noticing similarities and common differences
    • Abstraction: Represent an idea or process in general terms so that you can use it to solve other projects that are similar in nature
    • Algorithmic Thinking: Develop a step-by-step strategy for solving a problem


Ethical Data Thinking

  • Identify ethical issues associated with applications of statistical machine learning in a variety of settings
  • Assess and critique the actions of individuals and organizations as it relates to ethical use of data


Data Communication

  • In written and oral formats:

    • Inform and justify data analysis and modeling process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.

Collaborative Learning

  • Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
  • Develop a common purpose and agreement on goals.
  • Be able to contribute questions or concerns in a respectful way.
  • Share and contribute to the group’s learning in an equitable manner.

Course Topics

Specific learning objectives for our course topics are listed below. Use these to guide your synthesis of video and reading material for specific topics.

Introduction to Statistical Machine Learning

  • Formulate research questions that align with regression, classification, or unsupervised learning tasks


Evaluating Regression Models

  • Create and interpret residuals vs. fitted, residuals vs. predictor plots to identify improvements in modeling and address ethical concerns
  • Interpret MSE, RMSE, MAE, and R-squared in a contextually meaningful way


Overfitting and cross-validation

  • Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
  • Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
  • Explain what role CV has in a predictive modeling analysis and its connection to overfitting
  • Explain the pros/cons of higher vs. lower k in k-fold CV in terms of sample size and computing time


Subset selection

  • Clearly describe the forward and backward stepwise selection algorithm and why they are examples of greedy algorithms
  • Compare best subset and stepwise algorithms in terms of optimality of output and computational time


LASSO (shrinkage/regularization)

  • Explain how ordinary and penalized least squares are similar and different with regard to (1) the form of the objective function and (2) the goal of variable selection
  • Explain how the lambda tuning parameter affects model performance and how this is related to overfitting


KNN Regression and the Bias-Variance Tradeoff

  • Clearly describe / implement by hand the KNN algorithm for making a regression prediction
  • Explain how the number of neighbors relates to the bias-variance tradeoff
  • Explain the difference between parametric and nonparametric methods
  • Explain how the curse of dimensionality relates to the performance of KNN


Modeling Nonlinearity: Polynomial Regression and Splines

  • Explain the advantages of splines over global transformations and other types of piecewise polynomials
  • Explain how splines are constructed by drawing connections to variable transformations and least squares
  • Explain how the number of knots relates to the bias-variance tradeoff


Local Regression and Generalized Additive Models

  • Clearly describe the local regression algorithm for making a prediction
  • Explain how bandwidth (span) relate to the bias-variance tradeoff
  • Describe some different formulations for a GAM (how the arbitrary functions are represented)
  • Explain how to make a prediction from a GAM
  • Interpret the output from a GAM


Logistic regression

  • Use a logistic regression model to make hard (class) and soft (probability) predictions
  • Interpret non-intercept coefficients from logistic regression models in the data context


Evaluating classification models

  • Calculate (by hand from confusion matrices) and contextually interpret overall accuracy, sensitivity, and specificity
  • Construct and interpret plots of predicted probabilities across classes
  • Explain how a ROC curve is constructed and the rationale behind AUC as an evaluation metric
  • Appropriately use and interpret the no-information rate to evaluate accuracy metrics


Decision trees

  • Clearly describe the recursive binary splitting algorithm for tree building for both regression and classification
  • Compute the weighted average Gini index to measure the quality of a classification tree split
  • Compute the sum of squared residuals to measure the quality of a regression tree split
  • Explain how recursive binary splitting is a greedy algorithm
  • Explain how different tree parameters relate to the bias-variance tradeoff


Bagging and random forests

  • Explain the rationale for bagging
  • Explain the rationale for selecting a random subset of predictors at each split (random forests)
  • Explain how the size of the random subset of predictors at each split relates to the bias-variance tradeoff
  • Explain the rationale for and implement out-of-bag error estimation for both regression and classification
  • Explain the rationale behind the random forest variable importance measure and why it is biased towards quantitative predictors (in class)


K-means clustering

  • Clearly describe / implement by hand the k-means algorithm
  • Describe the rationale for how clustering algorithms work in terms of within-cluster variation
  • Describe the tradeoff of more vs. less clusters in terms of interpretability
  • Implement strategies for interpreting / contextualizing the clusters


Hierarchical clustering

  • Clearly describe / implement by hand the hierarchical clustering algorithm
  • Compare and contrast k-means and hierarchical clustering in their outputs and algorithms
  • Interpret cuts of the dendrogram for single and complete linkage
  • Describe the rationale for how clustering algorithms work in terms of within-cluster variation
  • Describe the tradeoff of more vs. less clusters in terms of interpretability
  • Implement strategies for interpreting / contextualizing the clusters


Principal Component Analysis

  • Explain the goal of dimension reduction and how this can be useful in a supervised learning setting
  • Interpret and use the information provided by principal component loadings and scores
  • Interpret and use a scree plot to guide dimension reduction