Homework 4

Due Thursday, November 4 at 11:59pm CST on Moodle

Deliverables: Submit a single PDF containing your responses for Reflections and one group member submit the project work for the team.

Project Work

Open Communication about Expectations and Goals

Before you start working, plan to have a 20 minute conversation to discuss the following prompts. Make sure each person gets a chance to answer the questions and is fully heard by the group.

What do you hope to gain from doing this project in this class?
What group roles (project manager, communication director, resource manager, reflector) do you feel like you’ve taken on so far? Which role(s) do you want to take on going forward?
What do you want to change about how your approaching this project going forward?

Delegation of Analysis

Among the team, divide coding tasks members.
- No one member should be doing all of the coding.
Decide on a deadline (ahead of when it is due) to share with the group

Required Analysis

Start working on building a classification model to answer a research question on your data set. For HW4, only include your classification model work (leave your regression models work in another file).

For this homework,

Specify the research question for a classification task.
Try to implement at least 2 different classification methods to answer your research question.
Reflect on the information gained from these two methods and how you might justify this method to others.

Keep in mind that the final project will require you to complete the pieces below. Use this as a guide for your work but don’t try to accomplish everything for HW4:

Classification - Methods
- Indicate at least 2 different methods used to answer your classification research question.
- Describe what you did to evaluate the models explored.
  - Indicate how you estimated quantitative evaluation metrics.
- Describe the goals / purpose of the methods used in the overall context of your research investigations.
Classification - Results
- Summarize your final model and justify your model choice (see below for ways to justify your choice).
  - Compare the different classification models tried in light of evaluation metrics, variable importance, and data context.
  - Display evaluation metrics for different models in a clean, organized way. This display should include both the estimated metric as well as its standard deviation. (This won’t be available from OOB error estimation. If using OOB, don’t worry about reporting the SD.)
  - Broadly summarize conclusions from looking at these evaluation metrics and their measures of uncertainty.
Classification - Conclusions - Interpret evaluation metric(s) for the final model in context. Does the model show an acceptable amount of error? - If using OOB error estimation, display the test (OOB) confusion matrix, and use it to interpret the strengths and weaknesses of the final model. - Summarization should show evidence of acknowledging the data context in thinking about the sensibility of these results.

Portfolio Work

Length requirements: Detailed for each section below.

Organization: To help the instructor and preceptors grade, please organize your document with clear section headers and start new pages for each method. Thank you!

Deliverables: Continue writing your responses in the same Google Doc that you set up for Homework 1. Include that URL for the Google Doc in your submission.

Note: Some prompts below may seem very open-ended. This is intentional. Crafting good responses requires looking back through our material to organize the concepts in a coherent, thematic way, which is extremely useful for your learning.

Revisions:

Make any revisions desired to previous concepts. Important note: When making revisions, please change from “editing” to “suggesting” so that we can easily see what you’ve added to the document since we gave feedback (we will “accept” the changes when we give feedback). If you don’t do this, we won’t know to reread that section and give new feedback.
General guidance for past homeworks are available on Moodle (under the Solutions section). Look at these to guide your revisions. You can always ask for guidance in office hours as well.

New concepts to address:

Decision trees:

Algorithmic understanding:
- Consider a dataset with two predictors: x1 is categorical with levels A, B, or C. x2 is quantitative with integer values from 1 to 100. How many different splits must be considered when recursive binary splitting attempts to make a split? Explain. (2 sentences max.)
- Explain the “recursive”, “binary”, and “splitting” parts of the recursive binary splitting algorithm. Make sure to discuss the concept of node (im)purity and how it is measured for classification and regression trees.
Bias-variance tradeoff: What tuning parameters control the performance of the method? How do low/high values of the tuning parameters relate to bias and variance of the learned model? (3 sentences max.)
- Note: There are several specific named tuning parameters as part of the rpart method. Don’t focus on these. Just discuss the broader feature of trees that these specific parameters affect.
Parametric / nonparametric: Where (roughly) does this method fall on the parametric-nonparametric spectrum, and why? (3 sentences max.)
Scaling of variables: Does the scale on which variables are measured matter for the performance of this algorithm? Why or why not? If scale does matter, how should this be addressed when using this method? (3 sentences max.)
Computational time: Recursive binary splitting does not find the overall optimal sequence of splits for a tree. What type of behavior is this? What method have we seen before that also exhibits this type of behavior? Briefly explain the parallels between these methods and what implications this have for computational time. (5 sentences max.)
Interpretation of output: Explain the rationale behind the variable importance measures that decision trees provide. (4 sentences max.)

Bagging & Random Forests:

Algorithmic understanding: Explain the rationale for extending single decision trees to bagging models and then to random forest models. What specific improvements to predictive performance are being sought? (5 sentences max.)
Bias-variance tradeoff: What tuning parameters control the performance of the method? How do low/high values of the tuning parameters relate to bias and variance of the learned model? (3 sentences max.)
Parametric / nonparametric: Where (roughly) does this method fall on the parametric-nonparametric spectrum, and why? (3 sentences max.)
Scaling of variables: Does the scale on which variables are measured matter for the performance of this algorithm? Why or why not? If scale does matter, how should this be addressed when using this method? (3 sentences max.)
Computational time: Explain why cross-validation is computationally intensive for many-tree algorithms. What method do we have to reduce this computational burden, and why is it faster? (5 sentences max.)
Interpretation of output: Explain the rationale behind the variable importance measures that random forest models provide. (4 sentences max.)

Reflection

Ethics: Read the article How to Support your Data Interpretations. Write a short (roughly 250 words), thoughtful response about the ideas that the article brings forth. Which pillar(s) do you think is/are hardest to do well for groups that rely on data analytics, and why?

Reflection: Write a short, thoughtful reflection about how things went this week. Feel free to use whichever prompts below resonate most with you, but don’t feel limited to these prompts.

How are class-related things going? Is there anything that you need from the instructor?
How is collaborative learning going in class? in project groups? Is there anything I could help you with?
How is your work/life balance going? What do you want to make time for this next week?