Homework 4: Data Wrangling

Author

PUT YOUR NAME HERE

DIRECTIONS

Save this file as homework_4.qmd in your “DS 112 > Homework” folder.
Type your name in line 3 above (where it says “author”).
Type your responses in this template.
Do not modify the structure of this document (e.g. don’t change section headers, spacing, etc).
There are lots of ways to do things in R. In these exercises, be sure to use the tidyverse code and style / structure we’ve learned in this class.
Submit your knit HTML file. We cannot grade qmd and pdf files.

GOALS

Practice some data wrangling and data viz in guided settings. The content of the exercises is not necessarily in the order that we learned it, so you’ll need to practice identifying appropriate tools for a given task.

Exercise 1: Birthdays

In Exercises 1-4 we’ll return to the daily Birthdays dataset in the mosaic package:

library(mosaic)
data("Birthdays")
head(Birthdays)
##   state year month day       date wday births
## 1    AK 1969     1   1 1969-01-01  Wed     14
## 2    AL 1969     1   1 1969-01-01  Wed    174
## 3    AR 1969     1   1 1969-01-01  Wed     78
## 4    AZ 1969     1   1 1969-01-01  Wed     84
## 5    CA 1969     1   1 1969-01-01  Wed    824
## 6    CO 1969     1   1 1969-01-01  Wed    100

Address each prompt below using our wrangling tools.

Part a

# How many total births were there in the U.S. during this time period (1969-1988)?


# Show the 6 data points (state/date pairs) with the fewest number of births
# (What do these have in common?!)


# February 29, "leap day", occurs once every 4 years
# Report data on leap day births in Alaska (AK) during this time period
# Your dataset should have 5 rows!

Part b

# Show the 6 states with the most total births during this time period
# (your answer should indicate both the state and its number of births)


# Show the 6 states with the fewest total births in 1988
# (your answer should indicate both the state and its number of births)

Part c

Open https://gemini.google.com/. You should have access to a free version of this tool (that doesn’t use your data to train it).

Give it the prompt “Show the 6 states with the fewest total births in 1988” and see what it generates.

Record any additional information you give it to help it understand the prompt to give you a useful response.

Additional information:

Evaluate the quality of the response and how it compares to your response to the question above.

Evaluate the quality of response:

Reflect on the strengths and weaknesses of this tool to help you understand and learn the data science tools.

Strengths of this tool for your learning:

Weaknesses of this tool for your learning:

Develop a rule for yourself to decide when to use this tool for your learning while maintaining your creativity and integrity in your work.

Your rule for using this tool:

Exercise 2: Temporal & geographical trends

Let’s examine some temporal trends in birth (using Birthdays). For each part, you will need to wrangle and plot some data.

You can decide whether you want to: (1) wrangle and store data, then plot; or (2) wrangle data and pipe directly into ggplot. For example:

Birthdays %>% 
  filter(state == "MN") %>% 
  ggplot(aes(y = births, x = date)) + 
    geom_smooth()

Part a: monthly trends

Calculate the total number of births in each month and year (eg: Jan 1969, Feb 1969, …), combining all states. Label month by names not numbers (Jan not 1). Then plot the relationship of births with month and summarize (in words) what you learn. (NOTE: You should have / use multiple data points for each month!)

Summary:

Part b: weekly trends

Construct a line plot of the total number of births per week in 1988 for each state. It should have 51 lines (for the 50 states + DC) and eliminate week “53” (which isn’t a complete week). Then summarize (in words) what you learn. For example, do you notice any seasonal trends? Are these the same in every state? Any outliers? NOTE: It’s tough to identify individual states, so focusing on bigger trends is fine.

Summary:

Part c: geography

Repeat Part b for just Minnesota (MN) and Louisiana (LA). MN has one of the coldest climates, and LA has one of the warmest. Discuss how their seasonal trends compare.

Discussion:

OPTIONAL exercise

Dig deeper into the geography of birth patterns. Are the trends you observed in MN vs LA similar in other colder and warmer states? Which states have the largest increases or decreases in their proportion of US births over time? Is the weekend effect less strong for states with a higher percentage of their populations living in rural areas?

Exercise 3: Daily trends and anomalies (Part 1)

Part a

Define a new dataset, daily_births, which has the following variables for each date in the study period:

date
total = total number of births on that date
year = the corresponding year
week_day = day of week, labeled “Mon”, “Tue”, etc
month_day = day of the month (e.g. 1–31)

Use code to show the first 6 rows and confirm that your dataset has 7305 rows (one per date in the dataset) and 5 columns.

# Wrangle and store your data


# Show first 6 rows


# Confirm there are 7305 rows and 5 columns

Part b

This article from FiveThirtyEight demonstrates that fewer babies are born on the 13th of each month, and the effect is even stronger when the 13th falls on a Friday. Let’s see if that theory holds up in our data. There are lots of ways to do this. Consider just one.

Start from the daily_births dataset you made in Part a, not Birthdays. From there, calculate and plot the average number of births (y-axis) per day of month (x-axis) in the U.S. Your plot should include 31 points. Discuss your observations. Does your plot support the theory that fewer babies tend to be born on the 13th day of the month? Any other data points that stand out?

Discussion:

Part c

Starting from daily_births, plot the total number of babies born (y-axis) per day (x-axis) in 1980. Color each date according to its day of the week.

Part d

As in Homework 2, one thing that stands out in this plot is that fewer babies are born on weekends. BUT there are some exceptions – relative to day of the week, there are significantly fewer births than expected on some days. Explain what you think is happening here.

Explanation:

Exercise 4: Daily trends and anomalies (Part 2)

In Exercise 3, you might have hypothesized that the anomalous births are explained by holidays. To test this hypothesis, import data on U.S. federal holidays here. NOTE: lubridate::dmy() converts the character-string date stored in the CSV to a “POSIX” date-number.

holidays <- read.csv("https://mac-stat.github.io/data/US-Holidays.csv") %>%
  mutate(date = as.POSIXct(lubridate::dmy(date)))

Part a

Create a new dataset, daily_births_1980, which:

keeps only daily_births related to 1980
adds a variable called is_holiday which is TRUE when the day is a holiday, and FALSE otherwise. NOTE: !is.na(x) is TRUE if column x is not NA, and FALSE if it is NA.

Print out the first 6 rows and confirm that your dataset has 366 rows (1 per day in 1980) and 7 columns. HINT: You’ll need to combine 2 different datasets.

# Define daily_births_1980


# Check out the first 6 rows


# Confirm that daily_births_1980 has 366 rows and 7 columns

Part b

Plot the total number of babies born (y-axis) per day (x-axis) in 1980. Color each date according to its day of the week, and shape each date according to whether or not it’s a holiday. (This is a modified version of 3c!)

Part c

Discuss your observations. For example: To what degree does the theory that there tend to be fewer births on holidays hold up? What holidays stand out the most?

Part d

Some holidays stand out more than others. It would be helpful to label them. Use geom_text to add labels to each of the holidays. NOTE: You can set the orientation of a label with the angle argument; e.g., geom_text(angle = 40, ...).

Exercise 5: Baby names

Let’s switch our attention to the babynames dataset. This dataset, provided by the U.S. Social Security Administration, provides information on the names of every baby born in the U.S. from 1880-2017. Along with names, there’s information on the sex assigned at birth. This information reflects that collected by the U.S. government at birth. We’ll refer to sex assigned at birth as sex throughout.

library(babynames)
data(babynames)
head(babynames)
## # A tibble: 6 × 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

Part a

Let’s do some preliminary exploration. Address each prompt below using our wrangling tools. Be mindful that some names were assigned to both male and female babies.

# Find the 6 most popular names for male babies in 2017


# Find the 6 most popular names overall, i.e. combining all years and combining male & female babies


# Find the 6 most popular names for female babies during this time period, i.e. combining all years

Part b

Create a new dataset that records the most popular name by sex for each year. Print out the data for the years 2013-2017 only. NOTES:

You can use the slice_max(___) verb which pulls out the row in each group that has the maximum value with respect to the variable provided.
Your dataset should have 10 rows and 5 columns.

Part c

Construct a line plot of the total number of babies per year that were named “Alicia”, no matter the sex assigned at birth.

Part d

Repeat Part c using whatever name and whatever consideration of sex assigned at birth you wish. Discuss your observations.

Discussion:

Finalize your homework

Render your qmd one more time.
- If the formatting is amiss, or if there are long datasets printed out, we can’t grade it :/
  - Confirm that it appears as you expect it and that it’s correctly formatted.
  - Confirm that you haven’t accidentally printed out long datasets.
- Review your answers and make sure you addressed each question. For example, several questions ask for both some code / plot and a discussion or summary in words.
If you’re working on Mac’s RStudio server, you have one more step that you should take at the end of each activity / assignment: export your files to your computer. To do so:
- Go to the Files tab in the lower right pane.
- Click the boxes next to the two homework files: homework_4.qmd and homework_4.html.
- Still within the Files tab, click on the “More” button that has a gear symbol next to it.
- Click “Export” then “Download”.
- The files were likely exported from the RStudio server to the Downloads folder on your computer. It’s important to now move them to the “DS 112 > Homework” folder that you created at the beginning of class. They are now there for safe keeping :)
Submit your html html html html html file to the Homework 4 assignment on Moodle. Do NOT submit a .qmd or pdf or any other file type – we will not be able to grade them.
You’re done with Homework 4. Congrats!!