# Load the tidyverse
library(tidyverse)
# Import the data
library(babynames)
data(babynames)
head(babynames)
## # A tibble: 6 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
Homework 5: MORE Data Wrangling
DIRECTIONS
- Save this file as
homework_5.qmd
in your “DS 112 > Homework” folder. - Type your name in line 3 above (where it says “author”).
- Type your responses in this template.
- Do not modify the structure of this document (e.g. don’t change section headers, spacing, etc).
- There are lots of ways to do things in R. In these exercises, be sure to use the
tidyverse
code and style / structure we’ve learned in this class.
GOALS
Practice some data wrangling and data viz in guided settings. The content of the exercises is not necessarily in the order that we learned it, so you’ll need to practice identifying appropriate tools for a given task.
Exercise 1: More names
In this exercise, let’s revisit the babynames
dataset from the previous homework. This dataset, provided by the U.S. Social Security Administration, provides information on the names of every baby born in the U.S. from 1880-2017:
Along with names, there’s information on the sex
assigned at birth. This information reflects that collected by the U.S. government at birth. We’ll refer to sex assigned at birth as sex
throughout. In this exercise, we’ll examine the neutrality or non-neutrality of names assigned to babies by sex
.
Part a
Create a dataset that has one row per name observed during the study period. For each observed name, calculate the total number of females and males born with that name. Use values_fill = 0
to replace NAs with 0s. Store this as babynames_total
and print out the first 3 rows which should match the following:
name | M | F |
---|---|---|
Aaban | 107 | 0 |
Aabha | 0 | 35 |
Aabid | 10 | 0 |
# Define babynames_total
# Print out the first 3 rows
Part b
After completing Part a, run the following code. Fill in the blanks (___
) below to comment on what each wrangling row does. (You can check your claims by running the code, but should first try to do this without running any code.)
# popular_names <- babynames_total %>%
# filter(M > 25000, F > 25000) %>% # ___
# mutate(ratio = F / (F + M)) %>% # ___
# arrange(desc(ratio)) # ___
#
# head(popular_names)
Part c
Starting from popular_names
, identify the names that are most popular among both male and female babies. Specifically, names for which 45-55% of babies born with that name are female (thus 45-55% are male.)
Part d
Pick one “neutral” name from Part c. Construct a line plot of the number of babies with this name by year and sex. You should have 2 lines. Discuss your observations.
Discussion:
Exercise 2: Laughing
In Laughing On Line, The Pudding analyzed the use of different laughter expressions (e.g. “lol”, “haha”) in Reddit comments. And how these have changed over time.
# Import data (shared by The Pudding at https://github.com/the-pudding/data/tree/master/laugh)
<- read.csv("https://mac-stat.github.io/data/reddit-laughs.csv")
laughs
# Check it out
head(laughs)
## id family description count_2009 share_2009 count_2010
## 1 lol lol 63719 0.24246748 241274
## 2 haha ha 50889 0.19364597 196208
## 3 LOL lol 26038 0.09908141 70148
## 4 ha ha 20930 0.07964413 56167
## 5 heh other 20283 0.07718213 45803
## 6 haahahha ha keyboard smash "haha" 11282 0.04293096 33157
## share_2010 count_2011 share_2011 count_2012 share_2012 count_2013 share_2013
## 1 0.28669647 882509 0.33900448 2083298 0.35558406 3281948 0.37271772
## 2 0.23314630 647712 0.24881023 1629267 0.27808857 2507590 0.28477698
## 3 0.08335413 196477 0.07547411 368135 0.06283448 466509 0.05297956
## 4 0.06674105 139746 0.05368163 301229 0.05141474 463033 0.05258481
## 5 0.05442592 98599 0.03787554 177181 0.03024183 244618 0.02778029
## 6 0.03939917 94499 0.03630058 195952 0.03344572 276116 0.03135739
## count_2014 share_2014 count_2015 share_2015 count_2016 share_2016 count_2017
## 1 5020118 0.39813165 7548949 0.43376563 10467043 0.47207181 15189198
## 2 3349881 0.26566978 4059995 0.23328894 4445584 0.20049931 5328650
## 3 633184 0.05021607 859447 0.04938417 1076547 0.04855311 1319891
## 4 647988 0.05139013 813667 0.04675363 841035 0.03793134 955275
## 5 314737 0.02496092 366573 0.02106343 366292 0.01652006 362392
## 6 371280 0.02944519 471774 0.02710832 591297 0.02666796 842984
## share_2017 count_2018 share_2018 count_2019 share_2019
## 1 0.50227697 21404140 0.52110200 12263438 0.52432395
## 2 0.17620800 6502422 0.15830700 3413330 0.14593711
## 3 0.04364620 1574734 0.03833824 779184 0.03331405
## 4 0.03158907 1131222 0.02754056 586156 0.02506113
## 5 0.01198359 365398 0.00889593 170043 0.00727020
## 6 0.02787583 1045384 0.02545076 558478 0.02387776
Part a
What are the units of observation in laughs
?
ANSWER:
Part b
Our goal is to create a line plot similar to that in the section titled “The evolution of lol”:
First, create a new laugh_plot_data
set that you’ll need to make this plot. Show the first 6 rows of the data and confirm that it includes 55 rows (one per laugh type / year combination), and 3 columns: id
, year
, and share
. HINTS:
- Remove any columns you don’t need for this analysis.
- Only keep track of 5 laugh types: “lol”, “haha”, “lmao”, “ha”, “heh”
- Make sure that
share
is on the scale from 0-100, not 0-1.
# Define your dataset
# Show the first 6 rows
# Confirm that the dataset has 55 rows and 3 columns
Part c
Use your data from Part b to create a plot that’s similar to that from the article. For example, the colors and styling don’t have to be the same. Discuss your observations.
Discussion:
Exercise 3: Bikes
In the next 5 exercises, you’ll explore two datasets related to Capital Bikeshare, a bike share company, in Washington, DC. To begin, the trips
data includes information about 10,000 bike rentals during the last quarter of 2014:
# Import the data
# Only keep certain variables of interest
<- readRDS(gzcon(url("https://mac-stat.github.io/data/2014-Q4-Trips-History-Data-Small.rds"))) %>%
trips select(client, sstation, sdate, duration) %>%
mutate(duration = as.numeric(hms(duration))/60)
# Check it out
head(trips)
## client sstation sdate
## 344758 Registered 15th & L St NW 2014-11-06 16:26:00
## 113251 Registered 3rd & D St SE 2014-10-12 11:30:00
## 633756 Casual 10th & E St NW 2014-12-27 14:24:00
## 466862 Casual 4th & M St SW 2014-11-23 16:42:00
## 474332 Registered 1st & Washington Hospital Center NW 2014-11-24 17:29:00
## 581597 Registered 11th & Kenyon St NW 2014-12-15 13:11:00
## duration
## 344758 9.2500
## 113251 47.3500
## 633756 166.3667
## 466862 15.2500
## 474332 18.5500
## 581597 2.6000
For each rental, trips
has information on:
client
= whether the renter is aRegistered
member of the bike share service or aCasual
rentersstation
= the name of the station where the rental startedsdate
= the date and time that a rental startedduration
= the duration of the rental in minutes
Part a
The sdate
variable contains a lot of information! Use it to define (and store) 5 new variables in the trips
data:
s_date = as_date(sdate)
: the date (not including time) that the rental starteds_day_of_week
= day of week that the rental started, labeledSun
,Mon
, etcs_hour
= hour of day that the rentals started (0-23)s_minute
= minute within the hour that the rentals started (0-59)s_time_of_day
= time of day that the rentals started in decimal notation (e.g. 3:30 should be 3.5)
HINTS:
- Recall that the
lubridate
package (part of thetidyverse
) has some handy functions, includinghour()
,minute()
,wday(label = TRUE)
. - The time of day can be calculated by hour + minute/60.
# Define the new variables
# Once you're confident, store them under trips
# Confirm that your new dataset has 10000 rows and 9 columns
dim(trips)
## [1] 10000 4
# Convince yourself that the first entry has
# s_date = 2014-11-06, s_day_of_week = Thu, s_hour = 16, s_minute = 26, s_time_of_day = 16.43333
Part b
Let’s warm up by exploring some basic patterns in the data.
# Calculate the shortest, average, and longest rental duration
# Show the rentals that left from Lincoln Memorial on a Monday and lasted under 10 minutes
# HINT: You should get two data points!
# Calculate the total number of rides taken on each day of the week
# Sort from highest to lowest
# Identify the 2 dates (s_date) on which there were the fewest rentals
# (Think about the significance of these dates)
Exercise 4: When do people ride?
Let’s explore temporal patterns in bike rentals using data viz. You will make a series of plots and summarize them in the last part of the exercise.
Part a
How did the volume of bike trips vary throughout the study period? Construct a univariate plot of s_date
.
Part b
How did the volume of bike trips vary at different times of the day? Construct a univariate plot of s_time_of_day
.
Part c
How do ridership patterns vary by both time of day and day of week? Facet your plot from Part b by s_day_of_week
.
Part d
Summarize, in words, the temporal patterns you observed in Parts a, b, and c.
Exercise 5: Registered vs casual riders
Let’s explore how ridership patterns might differ among registered and casual riders. You will make a series of plots and calculations, and summarize them in the last part of the problem.
Part a
How do trip durations compare among registered and casual riders? Calculate the shortest, average, and longest rental duration among registered and casual riders.
Part b
On what days of the week do registered and casual riders tend to ride more? Construct a plot of s_day_of_week
, faceted by client
.
Part c
At what times of day and days of the week do registered and casual riders tend to ride more? Construct density plots of s_time_of_day
faceted by s_day_of_week
and colored by client
status.
Part d
Summarize some key takeaways from Parts a, b, and c. What do they tell us about registered vs casual riders? What is your explanation for why these observations make sense?
Exercise 6: Where do people ride?
Beyond information about when the bike trips started, the trips
data includes information about sstation
, the stations where each bike trip started. Let’s explore.
Part a
Make a table with the ten stations with the highest number of departures (head(10)
). Define this as popular_stations
and print / show the data table. HINT: Your first row should have 2 entries: Columbus Circle / Union Station
, 241
# Define popular_stations
# Print out popular_stations (just type popular_stations)
Part b
Get a dataset of only the trips
that departed from the ten most popular_stations
. Store this as popular_trips
. HINT: Use a join operation.
# Define popular_trips
# Confirm that popular_trips has 1525 rows and 9 columns
Part c
Get a dataset of the trips
that did not depart from the most popular_stations
. Store this as unpopular_trips
.
# Define unpopular_trips
# Confirm that unpopular_trips has 8475 rows and 9 columns
Exercise 7: Spatial trends
Let’s bring in spatial information about the bike stations! The stations
dataset includes the latitude / longitude coordinates and other details for the bike rental stations:
# Import data on the bike stations
# Only keep certain variables of interest
<- read_csv("https://mac-stat.github.io/data/DC-Stations.csv") %>%
stations select(name, lat, long)
# Check it out
head(stations)
## # A tibble: 6 × 3
## name lat long
## <chr> <dbl> <dbl>
## 1 20th & Bell St 38.9 -77.1
## 2 18th & Eads St. 38.9 -77.1
## 3 20th & Crystal Dr 38.9 -77.0
## 4 15th & Crystal Dr 38.9 -77.0
## 5 Aurora Hills Community Ctr/18th & Hayes St 38.9 -77.1
## 6 Pentagon City Metro / 12th & S Hayes St 38.9 -77.1
Part a
To the popular_stations
data, tack on new information about the latitude and longitude of the 10 most popular stations.
# Define popular_stations with the additional lat / long variables
# NOTE: Don't store the data until you know it's what you want!
# Confirm that popular_stations has 10 rows and 4 columns
Part b
Construct a leaflet map of the stations
data:
- include circles representing each station in the Capital Bikeshare system
- add markers (the upside-down teardrop / flag symbols) that indicate the locations of the 10 most popular stations. NOTE: Since these 10 locations are in the
popular_stations
notstations
data, you’ll need to add adata = popular_stations
argument to this layer.
library(leaflet)
Part c
Write a 1-sentence summary of what you learned from this map.
Finalize your homework
Render your qmd one more time and check out the rendered html.
- Confirm that the html appears as you expect it and that it’s correctly formatted.
- Confirm that you haven’t accidentally printed out long datasets.
- Review your answers and make sure you addressed each question. For example, several questions ask for both some code / plot and a discussion or summary in words.
Submit your html file to the Homework 5 assignment on Moodle.
You’re done with Homework 5. Congrats!!