Homework 5: MORE Data Wrangling

Author

PUT YOUR NAME HERE

DIRECTIONS

Save this file as homework_5.qmd in your “DS 112 > Homework” folder.
Type your name in line 3 above (where it says “author”).
Type your responses in this template.
Do not modify the structure of this document (e.g. don’t change section headers, spacing, etc).
There are lots of ways to do things in R. In these exercises, be sure to use the tidyverse code and style / structure we’ve learned in this class.

GOALS

Practice some data wrangling and data viz in guided settings. The content of the exercises is not necessarily in the order that we learned it, so you’ll need to practice identifying appropriate tools for a given task.

Exercise 1: More names

In this exercise, let’s revisit the babynames dataset from the previous homework. This dataset, provided by the U.S. Social Security Administration, provides information on the names of every baby born in the U.S. from 1880-2017:

# Load the tidyverse 
library(tidyverse)

# Import the data
library(babynames)
data(babynames)
head(babynames)
## # A tibble: 6 × 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

Along with names, there’s information on the sex assigned at birth. This information reflects that collected by the U.S. government at birth. We’ll refer to sex assigned at birth as sex throughout. In this exercise, we’ll examine the neutrality or non-neutrality of names assigned to babies by sex.

Part a

Create a dataset that has one row per name observed during the study period. For each observed name, calculate the total number of females and males born with that name. Use values_fill = 0 to replace NAs with 0s. Store this as babynames_total and print out the first 3 rows which should match the following:

name	M	F
Aaban	107	0
Aabha	0	35
Aabid	10	0

# Define babynames_total


# Print out the first 3 rows

Part b

After completing Part a, run the following code. Fill in the blanks (___) below to comment on what each wrangling row does. (You can check your claims by running the code, but should first try to do this without running any code.)

# popular_names <- babynames_total %>%
#   filter(M > 25000, F > 25000) %>%  # ___
#   mutate(ratio = F / (F + M)) %>%   # ___
#   arrange(desc(ratio))              # ___
# 
# head(popular_names)

Part c

Starting from popular_names, identify the names that are most popular among both male and female babies. Specifically, names for which 45-55% of babies born with that name are female (thus 45-55% are male.)

Part d

Pick one “neutral” name from Part c. Construct a line plot of the number of babies with this name by year and sex. You should have 2 lines. Discuss your observations.

Discussion:

Exercise 2: Laughing

In Laughing On Line, The Pudding analyzed the use of different laughter expressions (e.g. “lol”, “haha”) in Reddit comments. And how these have changed over time.

# Import data (shared by The Pudding at https://github.com/the-pudding/data/tree/master/laugh)
laughs <- read.csv("https://mac-stat.github.io/data/reddit-laughs.csv")

# Check it out
head(laughs)
##         id family           description count_2009 share_2009 count_2010
## 1      lol    lol                            63719 0.24246748     241274
## 2     haha     ha                            50889 0.19364597     196208
## 3      LOL    lol                            26038 0.09908141      70148
## 4       ha     ha                            20930 0.07964413      56167
## 5      heh  other                            20283 0.07718213      45803
## 6 haahahha     ha keyboard smash "haha"      11282 0.04293096      33157
##   share_2010 count_2011 share_2011 count_2012 share_2012 count_2013 share_2013
## 1 0.28669647     882509 0.33900448    2083298 0.35558406    3281948 0.37271772
## 2 0.23314630     647712 0.24881023    1629267 0.27808857    2507590 0.28477698
## 3 0.08335413     196477 0.07547411     368135 0.06283448     466509 0.05297956
## 4 0.06674105     139746 0.05368163     301229 0.05141474     463033 0.05258481
## 5 0.05442592      98599 0.03787554     177181 0.03024183     244618 0.02778029
## 6 0.03939917      94499 0.03630058     195952 0.03344572     276116 0.03135739
##   count_2014 share_2014 count_2015 share_2015 count_2016 share_2016 count_2017
## 1    5020118 0.39813165    7548949 0.43376563   10467043 0.47207181   15189198
## 2    3349881 0.26566978    4059995 0.23328894    4445584 0.20049931    5328650
## 3     633184 0.05021607     859447 0.04938417    1076547 0.04855311    1319891
## 4     647988 0.05139013     813667 0.04675363     841035 0.03793134     955275
## 5     314737 0.02496092     366573 0.02106343     366292 0.01652006     362392
## 6     371280 0.02944519     471774 0.02710832     591297 0.02666796     842984
##   share_2017 count_2018 share_2018 count_2019 share_2019
## 1 0.50227697   21404140 0.52110200   12263438 0.52432395
## 2 0.17620800    6502422 0.15830700    3413330 0.14593711
## 3 0.04364620    1574734 0.03833824     779184 0.03331405
## 4 0.03158907    1131222 0.02754056     586156 0.02506113
## 5 0.01198359     365398 0.00889593     170043 0.00727020
## 6 0.02787583    1045384 0.02545076     558478 0.02387776

Part a

What are the units of observation in laughs?

ANSWER:

Part b

Our goal is to create a line plot similar to that in the section titled “The evolution of lol”:

First, create a new laugh_plot_data set that you’ll need to make this plot. Show the first 6 rows of the data and confirm that it includes 55 rows (one per laugh type / year combination), and 3 columns: id, year, and share. HINTS:

Remove any columns you don’t need for this analysis.
Only keep track of 5 laugh types: “lol”, “haha”, “lmao”, “ha”, “heh”
Make sure that share is on the scale from 0-100, not 0-1.

# Define your dataset


# Show the first 6 rows


# Confirm that the dataset has 55 rows and 3 columns

Part c

Use your data from Part b to create a plot that’s similar to that from the article. For example, the colors and styling don’t have to be the same. Discuss your observations.

Discussion:

Exercise 3: Bikes

In the next 5 exercises, you’ll explore two datasets related to Capital Bikeshare, a bike share company, in Washington, DC. To begin, the trips data includes information about 10,000 bike rentals during the last quarter of 2014:

# Import the data
# Only keep certain variables of interest
trips <- readRDS(gzcon(url("https://mac-stat.github.io/data/2014-Q4-Trips-History-Data-Small.rds"))) %>% 
  select(client, sstation, sdate, duration) %>% 
  mutate(duration = as.numeric(hms(duration))/60)

# Check it out
head(trips)
##            client                            sstation               sdate
## 344758 Registered                      15th & L St NW 2014-11-06 16:26:00
## 113251 Registered                       3rd & D St SE 2014-10-12 11:30:00
## 633756     Casual                      10th & E St NW 2014-12-27 14:24:00
## 466862     Casual                       4th & M St SW 2014-11-23 16:42:00
## 474332 Registered 1st & Washington Hospital Center NW 2014-11-24 17:29:00
## 581597 Registered                 11th & Kenyon St NW 2014-12-15 13:11:00
##        duration
## 344758   9.2500
## 113251  47.3500
## 633756 166.3667
## 466862  15.2500
## 474332  18.5500
## 581597   2.6000

For each rental, trips has information on:

client = whether the renter is a Registered member of the bike share service or a Casual renter
sstation = the name of the station where the rental started
sdate = the date and time that a rental started
duration = the duration of the rental in minutes

Part a

The sdate variable contains a lot of information! Use it to define (and store) 5 new variables in the trips data:

s_date = as_date(sdate): the date (not including time) that the rental started
s_day_of_week = day of week that the rental started, labeled Sun, Mon, etc
s_hour = hour of day that the rentals started (0-23)
s_minute = minute within the hour that the rentals started (0-59)
s_time_of_day = time of day that the rentals started in decimal notation (e.g. 3:30 should be 3.5)

HINTS:

Recall that the lubridate package (part of the tidyverse) has some handy functions, including hour(), minute(), wday(label = TRUE).
The time of day can be calculated by hour + minute/60.

# Define the new variables
# Once you're confident, store them under trips


# Confirm that your new dataset has 10000 rows and 9 columns
dim(trips)
## [1] 10000     4

# Convince yourself that the first entry has
# s_date = 2014-11-06, s_day_of_week = Thu, s_hour = 16, s_minute = 26, s_time_of_day = 16.43333

Part b

Let’s warm up by exploring some basic patterns in the data.

# Calculate the shortest, average, and longest rental duration 


# Show the rentals that left from Lincoln Memorial on a Monday and lasted under 10 minutes
# HINT: You should get two data points!


# Calculate the total number of rides taken on each day of the week
# Sort from highest to lowest


# Identify the 2 dates (s_date) on which there were the fewest rentals
# (Think about the significance of these dates)

Exercise 4: When do people ride?

Let’s explore temporal patterns in bike rentals using data viz. You will make a series of plots and summarize them in the last part of the exercise.

Part a

How did the volume of bike trips vary throughout the study period? Construct a univariate plot of s_date.

Part b

How did the volume of bike trips vary at different times of the day? Construct a univariate plot of s_time_of_day.

Part c

How do ridership patterns vary by both time of day and day of week? Facet your plot from Part b by s_day_of_week.

Part d

Summarize, in words, the temporal patterns you observed in Parts a, b, and c.

Exercise 5: Registered vs casual riders

Let’s explore how ridership patterns might differ among registered and casual riders. You will make a series of plots and calculations, and summarize them in the last part of the problem.

Part a

How do trip durations compare among registered and casual riders? Calculate the shortest, average, and longest rental duration among registered and casual riders.

Part b

On what days of the week do registered and casual riders tend to ride more? Construct a plot of s_day_of_week, faceted by client.

Part c

At what times of day and days of the week do registered and casual riders tend to ride more? Construct density plots of s_time_of_day faceted by s_day_of_week and colored by client status.

Part d

Summarize some key takeaways from Parts a, b, and c. What do they tell us about registered vs casual riders? What is your explanation for why these observations make sense?

Exercise 6: Where do people ride?

Beyond information about when the bike trips started, the trips data includes information about sstation, the stations where each bike trip started. Let’s explore.

Part a

Make a table with the ten stations with the highest number of departures (head(10)). Define this as popular_stations and print / show the data table. HINT: Your first row should have 2 entries: Columbus Circle / Union Station, 241

# Define popular_stations

# Print out popular_stations (just type popular_stations)

Part b

Get a dataset of only the trips that departed from the ten most popular_stations. Store this as popular_trips. HINT: Use a join operation.

# Define popular_trips

# Confirm that popular_trips has 1525 rows and 9 columns

Part c

Get a dataset of the trips that did not depart from the most popular_stations. Store this as unpopular_trips.

# Define unpopular_trips

# Confirm that unpopular_trips has 8475 rows and 9 columns

Exercise 7: Spatial trends

Let’s bring in spatial information about the bike stations! The stations dataset includes the latitude / longitude coordinates and other details for the bike rental stations:

# Import data on the bike stations
# Only keep certain variables of interest
stations <- read_csv("https://mac-stat.github.io/data/DC-Stations.csv") %>% 
  select(name, lat, long)

# Check it out
head(stations)
## # A tibble: 6 × 3
##   name                                         lat  long
##   <chr>                                      <dbl> <dbl>
## 1 20th & Bell St                              38.9 -77.1
## 2 18th & Eads St.                             38.9 -77.1
## 3 20th & Crystal Dr                           38.9 -77.0
## 4 15th & Crystal Dr                           38.9 -77.0
## 5 Aurora Hills Community Ctr/18th & Hayes St  38.9 -77.1
## 6 Pentagon City Metro / 12th & S Hayes St     38.9 -77.1

Part a

To the popular_stations data, tack on new information about the latitude and longitude of the 10 most popular stations.

# Define popular_stations with the additional lat / long variables
# NOTE: Don't store the data until you know it's what you want!


# Confirm that popular_stations has 10 rows and 4 columns

Part b

Construct a leaflet map of the stations data:

include circles representing each station in the Capital Bikeshare system
add markers (the upside-down teardrop / flag symbols) that indicate the locations of the 10 most popular stations. NOTE: Since these 10 locations are in the popular_stations not stations data, you’ll need to add a data = popular_stations argument to this layer.

library(leaflet)

Part c

Write a 1-sentence summary of what you learned from this map.

Finalize your homework

Render your qmd one more time and check out the rendered html.
- Confirm that the html appears as you expect it and that it’s correctly formatted.
- Confirm that you haven’t accidentally printed out long datasets.
- Review your answers and make sure you addressed each question. For example, several questions ask for both some code / plot and a discussion or summary in words.
Submit your html file to the Homework 5 assignment on Moodle.
You’re done with Homework 5. Congrats!!