Final Project

Requirements

You will be analyzing a dataset using a regression analysis, a classification analysis and an unsupervised learning analysis.

Collaboration:

  • You will work in teams of 2-3 members.
  • The weekly homework assignments will note whether work for that week should be submitted individually or if just one team member should submit work.
  • There will be a required synthesis of the weekly homework investigations at the end of the course that will be done in your group.



Final deliverables:

Only one team member has to submit these materials to Moodle. A draft of the html file is due on December 9th and the final html code appendix file and video presentation due date is Thursday, December 16 at midnight CST.

  • Submit a final knitted HTML file of a CODE APPENDIX (must knit without errors) that is CLEARLY organized and easy to navigate plus the corresponding Rmd file used to create the html file.

    • To ensure code is run and output is provided, make sure that you have the following code in the first R chunk at the top of the Rmd file.
    knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE, tidy = TRUE)
  • A 10-15 minute video presentation of your project that addresses the items in the Expectations below. (Recording the presentation over Zoom is a good option for creating the video. You can record to your computer or to the cloud.)

    • Upload the video to Google Drive & share a link in Moodle (with Mac Viewer sharing)
    • All team members should have an equal speaking role in the presentation.

In order to record your presentation,

  1. Start a Zoom meeting and invite your project mates.
  2. One of your share your screen with presentation slides (recommended: Google Slides or Powerpoint).
  3. Please have everyone turn your video on so that we can see who is speaking.
  4. When you are ready to start, the host of the meeting (who ever started the meeting) can click Record on this Computer. I highly recommend that someone else start a timer so that you can make sure you keep the presentation to 15 minutes max.
  5. Start presenting!
  6. You can Pause the recording, as needed, and then press start recording again.
  7. When you have finished recording, you can press Stop Recording. When you end the meeting, the recording (an mp4 file) will be downloaded to the computer of the individual who pressed Record.
  8. Upload the video to Google Drive & share a link in Moodle



Expectations

In your final presentation, you should address the following things:

  • Data context
    • Clearly describe what the cases in the final clean dataset represent.
    • Broadly describe the variables used in your analyses.
    • Who collected the data? When, why, and how? Answer as much of this as the available information allows.
  • Research questions
    • Research question(s)/motivation for the regression task; make clear the outcome variable and its units.
    • Research question(s)/motivation for the classification task; make clear the outcome variable and its possible categories.
    • Research question(s)/motivation for the unsupervised learning task.
  • Regression: Methods
    • Describe the models used.
    • Describe what you did to evaluate models.
      • Indicate how you estimated quantitative evaluation metrics.
      • Indicate what plots you used to evaluate models.
    • Describe the goals / purpose of the methods used in the overall context of your research investigations.
  • Regression: Results
    • Summarize your final model and justify your model choice (see below for ways to justify your choice).
      • Compare the different models in light of evaluation metrics, plots, variable importance, and data context.
      • Display evaluation metrics for different models in a clean, organized way. This display should include both the estimated CV metric as well as its standard deviation.
      • Broadly summarize conclusions from looking at these CV evaluation metrics and their measures of uncertainty.
      • Summarize conclusions from residual plots from initial models (don’t have to display them though).
    • Show and interpret some representative examples of residual plots for your final model. Does the model show acceptable results in terms of any systematic biases?
  • Regression: Conclusions
    • Interpret you final model (show plots of estimated non-linear functions, or slope coefficients) for important predictors, and provide some general interpretations of what you learn from these
    • Interpret evaluation metric(s) for the final model in context with units. Does the model show an acceptable amount of error?
    • Summarization should show evidence of acknowledging the data context in thinking about the sensibility of these results.
  • Classification: Methods
    • Indicate at least 2 different methods used to answer your classification research question.
    • Describe what you did to evaluate the models explored.
      • Indicate how you estimated quantitative evaluation metrics.
    • Describe the goals / purpose of the methods used in the overall context of your research investigations.
  • Classification: Results
    • Summarize your final model and justify your model choice (see below for ways to justify your choice).
      • Compare the different classification models tried in light of evaluation metrics, variable importance, and data context.
      • Display evaluation metrics for different models in a clean, organized way. This display should include both the estimated metric as well as its standard deviation. (This won’t be available from OOB error estimation. If using OOB, don’t worry about reporting the SD.)
      • Broadly summarize conclusions from looking at these evaluation metrics and their measures of uncertainty.
  • Classification: Conclusions
    • Interpret evaluation metric(s) for the final model in context. Does the model show an acceptable amount of error?
      • If using OOB error estimation, display the test (OOB) confusion matrix, and use it to interpret the strengths and weaknesses of the final model.
    • Summarization should show evidence of acknowledging the data context in thinking about the sensibility of these results.
  • Unsupervised Learning: Clustering
    • Choose one method for clustering
    • Justify the choice of features included in a distance measure based on the research goals
    • Justify the choice of \(k\) and summarize resulting clusters
      • Interpret the clusters qualitatively
      • Evaluate clusters quantitatively (kmeans: within cluster sum of squares, pam: silhouette, hclust: height of cut on dendrogram)
      • If appropriate, show visuals to justify your choices.
    • Summarize what information you gain from the clustering in context (tell a story)
  • Code
    • Knitted, error-free HTML and corresponding Rmd file submitted
    • Code corresponding to all analyses above is present and correct