Homework 5

Due Tuesday, November 23 at midnight CST on Moodle

Deliverables: Each person should submit a PDF containing your responses for Reflections and if your team wants feedback, one group member should submit the separate project work for the team.




Project Work

Look at the final requirements on the Final Project page. There is nothing you are required to submit for this homework assignment, but you should start on the following:

  • Explore 2 different classification models to answer your classification research question
  • Explore clustering with your data
  • Clean, organize, and consolidate code into one final Rmd file.

If you would like feedback on any part of the final deliverables, you may submit them with HW5, but this is not a requirement for HW5.





Portfolio Work

Length requirements: Detailed for each section below.

Organization: To help the instructor and preceptors grade, please organize your document with clear section headers and start new pages for each method. Thank you!

Deliverables: Continue writing your responses in the same Google Doc that you set up for Homework 1. Include that URL for the Google Doc in your submission.

Note: Some prompts below may seem very open-ended. This is intentional. Crafting good responses requires looking back through our material to organize the concepts in a coherent, thematic way, which is extremely useful for your learning.

Revisions:

  • Make any revisions desired to previous concepts. Important note: When making revisions, please change from “editing” to “suggesting” so that we can easily see what you’ve added to the document since we gave feedback (we will “accept” the changes when we give feedback). If you don’t do this, we won’t know to reread that section and give new feedback.

  • General guidance for past homeworks are available on Moodle (under the Solutions section). Look at these to guide your revisions. You can always ask for guidance in office hours as well.


New concepts to address:

K-Means Clustering:

  • Algorithmic understanding: Explain the general algorithm by explaining how you perform two iterations of k-means with k = 2 on the dataset below. The data has just 1 variable x1, and the random initial cluster assignment is shown in the cluster column. Show your work: in particular, show the centroids computed for iterations 1 and 2 and the updated cluster assignments for iterations 1 and 2.
 x1   cluster
---- ---------
  1      1
  1      2
  3      2
  4      1
  5      1
  • Bias-variance tradeoff: (This is a prompt about clustering in general, but put your response in the K-Means section.) In clustering, we don’t quite have the same concepts of bias and variance as we do with supervised learning methods, but a similar type of tradeoff exists. Discuss the pros and cons of high vs. low number of clusters in terms of (1) ease of learning more about each cluster and (2) within-cluster homogeneity (closeness of cases within clusters). (5 sentences max.)

  • Parametric / nonparametric: SKIP

  • Scaling of variables: (This is a prompt that pertains to both k-means and hierarchical clustering, but put your response in the K-Means section.) Does the scale on which variables are measured matter for the performance of clustering? Why or why not? If scale does matter, how should this be addressed? (3 sentences max.)

  • Computational time: Consider a single round of the cluster reassignment step of k-means with \(n\) cases and \(k\) clusters. How many distance calculations are required in this step? Explain in at most 2 sentences.

  • Interpretation of output: (This is a prompt about clustering in general, but put your response in the K-Means section.) Describe data explorations we could use to interpret / learn more about the cluster assignments that clustering algorithms produce.


Hierarchical clustering:

  • Algorithmic understanding: We have a dataset with 4 cases, and the Euclidean distance between every pair of cases is shown below. (The column labeled 1 gives the distances of case 1 to cases 2, 3, and 4 from top to bottom.) Draw and explain the dendrogram that would result from single-linkage clustering. Clearly label what cases are at each leaf and the heights at which fusions occur. Show any intermediate work.
     |   1       2       3
-----|------- ------- -------
  2  |  0.69       
  3  |  1.23    0.55 
  4  |  0.94    1.39    1.75
  • Bias-variance tradeoff: How does the tree cutting height relate to the tradeoff you discussed in the K-Means section? (2 sentences max.)

  • Parametric / nonparametric: SKIP

  • Scaling of variables: SKIP

  • Computational time: Consider the very first step of hierarchical clustering in which all \(n\) cases are in their own cluster. How many distance calculations are required as a function of \(n\)? (Note: \(1 + 2 + \cdots + n = n(n+1)/2\).) Explain in at most 2 sentences.

  • Interpretation of output: SKIP


Principal components analysis:

  • Algorithmic understanding: In no more than 4 sentences, summarize the goal of principal components analysis and how it allows us to perform dimension reduction. Use the following terms / ideas in your response: linear combination, variance.

  • Bias-variance tradeoff: How is dimension reduction related to the bias-variance tradeoff for some of the supervised methods we’ve covered? How is the use of the scree plot from PCA related to the tradeoff?

  • Parametric / nonparametric: SKIP

  • Scaling of variables: Does the scale on which variables are measured matter for the performance of this algorithm? Why or why not? If scale does matter, how should this be addressed when using this method? (3 sentences max.)

  • Computational time: SKIP

  • Interpretation of output: What information can we gain by looking at the loadings of the first few principal components? Explain in at most 3 sentences.




Reflection

Ethics: (optional) In case it’s of interest and you have the bandwidth for it, the article “Race,” Racism, and the Practice of Epidemiology is a very illuminating one.

Reflection: Write a short, thoughtful reflection about how things went the past few weeks. Feel free to use whichever prompts below resonate most with you, but don’t feel limited to these prompts.

  • How was Content Conversation 2 for you? What have you learned about yourself in these conversations?
  • How are class-related things going? Is there anything that you need from the instructor?
  • How is collaborative learning going in class? in project groups? Is there anything I could help you with?
  • How is your work/life balance going? What do you want to make time for this next week?