Homework 7: Exploratory data analysis

Author

PUT YOUR NAME HERE


DIRECTIONS





Context

The running_112.csv data file posted in the Homework 7 link on Moodle contains data on people that have run in the annual 10-mile “Cherry Blossom” race in Washington, D.C.. Our general goal will be to use these data to explore the broad research question:

What’s the relationship between running time and age?

You will address this questions through guided EDA. That is, I’ll tell you what steps to take in this process. The steps provided are merely one approach, not the only approach. With that, pay special attention to:

  • What follow-up questions you have along the way / how your EDA might deviate from that here.
  • The general vibe behind the EDA process.





Exercise 1: Import and get to know the data

The first step in our EDA is to import and get to know some basics about the data structure.

Part a

Import the running_112.csv data into RStudio. NOTE: This data is a modified version of the Cherry data in the mdsr package, and contains different info than the running data we’ve used elsewhere. You must use a local copy of the provided running_112.csv dataset for these exercises. You may not import this dataset from the internet, an R package, etc.

Check out and familiarize yourself with the data in a View() window. Included in each row is data on:

  • name.yob = name and year of birth (yob) for the runner
  • sex = binary sex
  • net = net running time to complete the 10-mile race, in minutes
  • year = year of the race
  • previous = how many times the runner participated in the years from 1999 to 2008, previous to this race
  • nruns = number of times the runner participated in the years from 1999 to 2008

Part b

Run some code below that helps you familiarize yourself with the basics of this data. Mainly, after this exercise you should have a sense of:

  • how many data points you have
  • what the first several rows of the dataset look like
  • a quick summary() of each variable in the dataset. (Each quantitative variable should be summarized with some simple summary statistics (eg: mean), and each categorical variable should be summarized by its categories and the number of observations per category.)

Part c

Reflect on what you’ve noticed about the data thus far. For example: Are there any interesting patterns? Anything that looks like it will need to be cleaned before addressing our research questions? Any obvious outliers? (The only wrong answers here are answers that lack reflection about data context.)

YOUR REFLECTION:





Exercise 2: Prep the variables we’ll need

Recall that our general research question is about the relationship between running time and age. In this exercise, prep any variables you’ll need for this analysis and show the head() of your new dataset to illustrate your results. NOTE: After completing this exercise, you should have the same number of rows that you started with, but you might have a different number of columns.





Exercise 3: Check out the most important variables, alone…& do more cleaning

Before examining the relationship between running time and age, we should check these 2 variables out on their own in order to understand their scale, range, typical outcomes, etc.

Part a

Construct a plot of age. You don’t need to write anything, but take note of the range, shape, and typical outcome of the age distribution among runners.

Part b

Similarly, construct and think about the plot of running time.

Part c

Follow-up observation: The plot in part b revealed a data point that needs to be removed from our analysis. (If this isn’t clear yet, look back at the numerical summaries in exercise 1 and/or move on to the next exercise and then come back.) Do 3 things here:

  • Show all data points for the runner associated with this odd data point.
  • Explain what you think the odd data point is supposed to be.
  • Explain why removing this particular data point isn’t a deceptive practice (i.e. we’re not being dishonest).
# Show all data points on the runner of interest 
# (There should be 5 data points!)

YOUR EXPLANATION:

Part d

Permanently remove the data point in question from your dataset. You shouldn’t use it in any of the remaining exercises! Be careful in this process to not remove more than 1 data point. Confirm that your new dataset has 41127 data points, only 1 fewer than your original.

# Remove the data point
# HINT: "a | b" indicates condition "a" OR "b" (or look to Google for other approaches)


# Confirm that your dataset now has 41127 data points





Exercise 4: Simple calculations & aggregate explorations

In studying the relationship between running time and age, let’s start by obtaining and plotting some simple calculations.

Part a

In a single plot, plot the fastest running time by age, the average running time by age, and the slowest running time by age.

Part b

Provide a list of 3 things that your plot from Part a indicates about the relationship between running time and age.

YOUR ANSWER:

  • put thing 1 here

  • put thing 2 here

  • put thing 3 here

Part c

Follow-up observation: I noticed that there’s more volatility in the fastest, average, and slowest running times across older runners. This is probably a feature of our data, not because of actual volatility across older runners. Explain why this is happening. AND support your answer with numerical evidence from the data.

# Numerical evidence

YOUR EXPLANATION:

Part d

This isn’t the end of the EDA! Reflecting on what you did and didn’t learn from the plot above, identify some follow-up questions that you have about the relationship between running time and age. (The only wrong answers here are answers that lack reflection.)

YOUR REFLECTION:





Exercise 5: Follow-up plots at a different level of information

In Exercise 4, we used aggregate information (i.e. summaries by groups, not individual observations) to explore the relationship between running time and age. This group-level information provides a nice, first glimpse into the relationship. BUT it oversimplifies the information in our dataset, and so we don’t want to stop with Exercise 4. Let’s now consider the relationship between running time and age using the individual observations.

Part a

Construct a plot of running time vs age that reflects the outcome for each data point in our dataset.

Part b

Follow-up observation: The plot in part a puts all runners into the same bucket, thus the relationship between running time and age doesn’t seem very strong. Perhaps this is because we ignored information about prior running experience. Do the following:

  • Create a new version of the plot from part a to include information related to a person’s prior running experience.
  • Provide a one sentence caveat. Specifically, explain why the additional variable you used is a proxy for prior experience but not a perfect measure. (THINK: Can you imagine a runner for whom your proxy measure is high even though prior experience is actually low? Or a runner for whom your proxy measure is low even though prior experience is actually high?)

YOUR CAVEAT:

Part c

In one sentence, explain what new information you’ve gained about the relationship between running time and age through the plots in this exercise.

YOUR ANSWER:

Part d

This isn’t the end of the EDA! Each new step in the process is helping us refine our research questions and introducing even more research questions. Pick the option below which best reflects the refined research questions we do have insight into at this point in our analysis:

  1. How do older runners compare to younger runners?
  2. How does a person’s running time change with age?
  3. Both a and b.






Exercise 6: Yet more follow-up plots at a yet a different level of information

At this point, we’ve “sliced” the data in two different ways or at different levels to examine the relationship between running time and age:

  1. at the aggregate level at each age
  2. at the individual race outcome level at each age

Let’s slice the data yet another way by exploring running time and age dynamics among individual runners.

Part a

Some people appear in our dataset more than once. To that end, address some related properties of our data.

# How many unique runners are there in the dataset?
# Address this with code and provide just one number (e.g. do not print out a long table)



# What is the fewest number of times that somebody appears in the dataset?
# Again, address this with code and provide just one number



# What is the most number of times that somebody appears in the dataset?
# Again, address this with code and provide just one number

Part b

Let’s focus only on the runners that participated in the Cherry Blossom Race at least 7 times (i.e. 7 or more times) in the years from 1999 to 2008. Construct a plot that represents the relationship between running time and age for each of these individual runners.

  • Show a smooth trend line (not curve) for each runner.
  • Put these trend lines all on the same frame.
  • Show only the trend lines for each runner, otherwise it would be very messy.
  • Use group not color, fill, or facets to represent the individual runners. Otherwise, we’d get messy legends. Since we don’t necessarily care about these particular runners, just runners in general, we don’t need a legend.

Part c

Our intention in restricting our data to runners with at least 7 nruns was to focus on people for whom we had at least 7 observed data points. But our plot makes clear that this isn’t always the case – some lines span only 2 or 3 years. Explain how this happened. That is, how can somebody have fewer data points than suggested by their nruns?

YOUR ANSWER:





Exercise 7: A “final” follow-up plot

Let’s improve upon our analysis from Exercise 6.

Part a

Obtain data on only the runners for whom we observed at least 7 race outcomes, and only on their races for which we have complete information. Again, since we’re missing some data, this is different than the runners that completed at least 7 races.

# Create new dataset


# Confirm that we have 511 data points

Part b

Ignoring individual runners and focusing on the individual race outcomes, construct a scatterplot of running time vs age using your new dataset.

Part c

Now focusing on individual runners, recreate the plot from Exercise 6 Part b using your new dataset.

Part d

Summarize your combined observations from Parts b and c. Your summary should address:

  • What you learned about the relationship between running time and age in Part b.
  • What you learned about the relationship between running time and age in Part c.
  • What explains any conflicting messages between these 2 plots.

YOUR ANSWER





Reflection

The exercises above represent one possible EDA of the relationship between running time and age. As we did here, any EDA has the following steps:

  • import the data
  • do some quick data checks
  • iterate between the following:
    • clean the data
    • construct some summaries and plots
    • ask yourself what questions these summaries and plots do answer, what questions don’t they answer, and what follow-up questions they provoke
    • repeat as necessary, letting your reflections on the previous questions inspire your next steps

Further, this is the beginning, not the end of our analysis!





Finalize your homework

  • Render your qmd one more time.

    • If the formatting is amiss, if there are long datasets printed out, or if you have code but no output, we can’t grade it :/
      • Confirm that it appears as you expect it and that it’s correctly formatted.
      • Confirm that you haven’t accidentally printed out long datasets.
    • Review your answers and make sure you addressed each question. For example, several questions ask for both some code / plot and a discussion or summary in words.
  • Submit your html file to the Homework 7 assignment on Moodle. Do NOT submit a .qmd or pdf or any other file type – we will not be able to grade them.

  • You’re done with Homework 7. Congrats!!