Homework 4: Data Wrangling

Author

PUT YOUR NAME HERE


DIRECTIONS



GOALS

Practice some data wrangling and data viz in guided settings. The content of the exercises is not necessarily in the order that we learned it, so you’ll need to practice identifying appropriate tools for a given task.





Exercise 1: Birthdays

In Exercises 1-4 we’ll return to the daily Birthdays dataset in the mosaic package:

library(mosaic)
data("Birthdays")
head(Birthdays)
##   state year month day       date wday births
## 1    AK 1969     1   1 1969-01-01  Wed     14
## 2    AL 1969     1   1 1969-01-01  Wed    174
## 3    AR 1969     1   1 1969-01-01  Wed     78
## 4    AZ 1969     1   1 1969-01-01  Wed     84
## 5    CA 1969     1   1 1969-01-01  Wed    824
## 6    CO 1969     1   1 1969-01-01  Wed    100

Address each prompt below using our wrangling tools.

Part a

# How many total births were there in the U.S. during this time period (1969-1988)?


# Show the 6 data points (state/date pairs) with the fewest number of births
# (What do these have in common?!)


# February 29, "leap day", occurs once every 4 years
# Report data on leap day births in Alaska (AK) during this time period
# Your dataset should have 5 rows!

Part b

# Show the 6 states with the most total births during this time period
# (your answer should indicate both the state and its number of births)


# Show the 6 states with the fewest total births in 1988
# (your answer should indicate both the state and its number of births)

Part c

Open https://gemini.google.com/. You should have access to a free version of this tool (that doesn’t use your data to train it).

Give it the prompt “Show the 6 states with the fewest total births in 1988” and see what it generates.

Record any additional information you give it to help it understand the prompt to give you a useful response.

Additional information:

Evaluate the quality of the response and how it compares to your response to the question above.

Evaluate the quality of response:

Reflect on the strengths and weaknesses of this tool to help you understand and learn the data science tools.

Strengths of this tool for your learning:

Weaknesses of this tool for your learning:

Develop a rule for yourself to decide when to use this tool for your learning while maintaining your creativity and integrity in your work.

Your rule for using this tool:





OPTIONAL exercise

Dig deeper into the geography of birth patterns. Are the trends you observed in MN vs LA similar in other colder and warmer states? Which states have the largest increases or decreases in their proportion of US births over time? Is the weekend effect less strong for states with a higher percentage of their populations living in rural areas?





Exercise 5: Baby names

Let’s switch our attention to the babynames dataset. This dataset, provided by the U.S. Social Security Administration, provides information on the names of every baby born in the U.S. from 1880-2017. Along with names, there’s information on the sex assigned at birth. This information reflects that collected by the U.S. government at birth. We’ll refer to sex assigned at birth as sex throughout.

library(babynames)
data(babynames)
head(babynames)
## # A tibble: 6 × 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

Part a

Let’s do some preliminary exploration. Address each prompt below using our wrangling tools. Be mindful that some names were assigned to both male and female babies.

# Find the 6 most popular names for male babies in 2017


# Find the 6 most popular names overall, i.e. combining all years and combining male & female babies


# Find the 6 most popular names for female babies during this time period, i.e. combining all years

Part b

Create a new dataset that records the most popular name by sex for each year. Print out the data for the years 2013-2017 only. NOTES:

  • You can use the slice_max(___) verb which pulls out the row in each group that has the maximum value with respect to the variable provided.
  • Your dataset should have 10 rows and 5 columns.

Part c

Construct a line plot of the total number of babies per year that were named “Alicia”, no matter the sex assigned at birth.

Part d

Repeat Part c using whatever name and whatever consideration of sex assigned at birth you wish. Discuss your observations.

Discussion:





Finalize your homework

  • Render your qmd one more time.

    • If the formatting is amiss, or if there are long datasets printed out, we can’t grade it :/
      • Confirm that it appears as you expect it and that it’s correctly formatted.
      • Confirm that you haven’t accidentally printed out long datasets.
    • Review your answers and make sure you addressed each question. For example, several questions ask for both some code / plot and a discussion or summary in words.
  • If you’re working on Mac’s RStudio server, you have one more step that you should take at the end of each activity / assignment: export your files to your computer. To do so:

    • Go to the Files tab in the lower right pane.
    • Click the boxes next to the two homework files: homework_4.qmd and homework_4.html.
    • Still within the Files tab, click on the “More” button that has a gear symbol next to it.
    • Click “Export” then “Download”.
    • The files were likely exported from the RStudio server to the Downloads folder on your computer. It’s important to now move them to the “DS 112 > Homework” folder that you created at the beginning of class. They are now there for safe keeping :)
  • Submit your html html html html html file to the Homework 4 assignment on Moodle. Do NOT submit a .qmd or pdf or any other file type – we will not be able to grade them.

  • You’re done with Homework 4. Congrats!!