Topic 7 Six Main Verbs

Learning Goals

  • Understand and be able to use the following verbs appropriate: select, mutate, filter, arrange, summarize, group_by
  • Develop working knowledge of working with dates and lubridate functions

You can download a template .Rmd of this activity here. Add it to a folder called Assignment_05 in your COMP_STAT_112 folder.

Data Wrangling Introduction

Example: US Births

The number of daily births in the US varies over the year and from day to day. What’s surprising to many people is that the variation from one day to the next can be huge: some days have only about 80% as many births as others. Why? In this activity we’ll use basic data wrangling skills to understand some drivers of daily births.

The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.3

Table 7.1: A subset of the initial birthday data.
state date year births
AK 1969-01-01 1969 14
AL 1969-01-01 1969 174
AR 1969-01-01 1969 78
AZ 1969-01-01 1969 84
CA 1969-01-01 1969 824
CO 1969-01-01 1969 100

Tidy Data

Additional reading:

There are different ways to store and represent the same data. In order to be consistent and to also take advantage of the vectorized nature of R, the tidyverse packages we’ll use provide a set of three interrelated rules/conventions for a dataset to be tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

One of the first things we’ll often do when acquiring new data is to “tidy it” into this form. For now, we can already start thinking of a data frame (tibble) as a table whose rows are the individual cases and whose columns are the variables on which we have information for each individual case. The first figure from the tidyr cheat sheet summarizes this principle.

Data Verbs

Additional reading:

There are six main data transformation verbs in the dplyr library. Each verb takes an input data frame along with additional arguments specifying the action, and returns a new data frame. We’ll examine them in three pairs.

Verbs that change the variables (columns) but not the cases (rows)

The first two verbs change which variables (columns) are included in the data frame, but preserve the same set of cases (rows).

  • select() chooses which columns to keep, or put another way, deletes those colummns that are not selected. To specify the columns, we can either list them out, or use functions like starts_with(), ends_with(), or contains() to specify the titles of the variables we wish to keep.

  • mutate() adds one or more columns to the data frame. Each column is a function of the other columns that is applied on a row by row basis. For example, we can use arithmetic operations like adding two other variables or logical operations like checking if two columns are equal, or equal to a target number.

Example 7.1 (select and mutate) Consider the Birthdays data

  1. Add two new variables to the Birthdays data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
  2. Then form a new table that only has three columns: the state and your two new columns.
  3. What does the following operation return: select(Birthdays, ends_with("te"))?
Solution

The commands for the first two parts are

BirthdaysExtra <- mutate(Birthdays,
  year_short = year - 1900,
  busy_birthday = (births > 100)
)

BirthdaysExtraTable <- select(
  BirthdaysExtra, state,
  year_short, busy_birthday
)

select(Birthdays, ends_with("te")) %>% head()
##   state       date
## 1    AK 1969-01-01
## 2    AL 1969-01-01
## 3    AR 1969-01-01
## 4    AZ 1969-01-01
## 5    CA 1969-01-01
## 6    CO 1969-01-01
The operation in (c) selects only the first two columns state and date because they end in ‘te’.


Verbs that change the cases (rows) but not the variables (columns)

The next two verbs change which cases (rows) are included in the data frame, but preserve the same set of variables (columns).

  • filter() deletes some of the rows by specifying which rows to keep.

  • arrange() reorders the rows according to a specified criteria. To sort in reverse order based on the variable x, use arrange(desc(x)).

Example 7.2 (filter and arrange) Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.

Solution

We want to filter and then arrange:

MABirths1979 <- filter(Birthdays, state == "MA", year == 1979)
MABirths1979Sorted <- arrange(MABirths1979, desc(births))
Table 7.2: Birthdays in Massachusetts in 1979, sorted from those dates with the most births to those dates with the fewest births.
state date year births
MA 1979-09-28 1979 262
MA 1979-09-11 1979 252
MA 1979-12-28 1979 249
MA 1979-09-26 1979 246
MA 1979-07-24 1979 245
MA 1979-04-27 1979 243


When filtering, we often use logical comparison operators like ==, >, <, >= (greater than or equal to), <= (less than or equal to), and %in%, which compares the value to a list of entries.4 For example, if we want all births in AK, CA, and MA, we can write

filter(Birthdays, state %in% c("AK","CA","MA"))

The c() here is for concatenate, which is how we form vectors in R.

Grouped summaries

  • summarize() (or equivalently summarise()) takes an entire data frame as input and outputs a single row with one or more summary statistics, such as mean(), sum(), sd(), n_distinct(), or n() (which just counts the number of entries/rows).
summarize(Birthdays,
  total_births = sum(births),
  average_births = mean(births),
  nstates = n_distinct(state), ncases = n()
)
##   total_births average_births nstates ncases
## 1     70486538       189.0409      51 372864

So summarize changes both the cases and the variables. Alone, summarize is not all that useful, because we can also access individual variables directly with the dollar sign. For example, to find the total and average births, we can write

sum(Birthdays$births)
## [1] 70486538
mean(Birthdays$births)
## [1] 189.0409

Rather, we will mostly use it to create grouped summaries, which brings us to the last of the six main data verbs.

  • group_by() groups the cases of a data frame by a specified set of variables. The size of the stored data frame does not actually change (neither the cases nor the variables change), but then other functions can be applied to the specified groups instead of the entire data set. We’ll often use group_by in conjunction with summarize to get a grouped summary.

Example 7.3 (grouped summary) Consider the Birthdays data again.

  1. Find the average number of daily births in each year (average across states).
  2. Find the average number of daily births in each year, by state.
Solution

We have to first group by the desired grouping and then perform a summarize.

BirthdaysYear <- group_by(Birthdays, year)
summarize(BirthdaysYear, average = mean(births))
## # A tibble: 20 × 2
##     year average
##    <int>   <dbl>
##  1  1969    192.
##  2  1970    200.
##  3  1971    191.
##  4  1972    175.
##  5  1973    169.
##  6  1974    170.
##  7  1975    169.
##  8  1976    170.
##  9  1977    179.
## 10  1978    179.
## 11  1979    188.
## 12  1980    194.
## 13  1981    195.
## 14  1982    198.
## 15  1983    196.
## 16  1984    197.
## 17  1985    202.
## 18  1986    202.
## 19  1987    205.
## 20  1988    210.
BirthdaysYearState <- group_by(Birthdays, year, state)
summarize(BirthdaysYearState, average = mean(births))
## # A tibble: 1,020 × 3
## # Groups:   year [20]
##     year state average
##    <int> <chr>   <dbl>
##  1  1969 AK       18.6
##  2  1969 AL      174. 
##  3  1969 AR       91.3
##  4  1969 AZ       93.3
##  5  1969 CA      954. 
##  6  1969 CO      110. 
##  7  1969 CT      134. 
##  8  1969 DC       75.3
##  9  1969 DE       27.6
## 10  1969 FL      292. 
## # … with 1,010 more rows


Piping

Additional reading:

Pipes offer an efficient way to execute multiple operations at once. Here is a more efficient way to redo Example 7.2 with the pipe:

QuickMABirths1979 <-
  Birthdays %>%
  filter(state == "MA", year == 1979) %>%
  arrange(desc(births))

With the pipe notation, x %>% f(y) becomes f(x,y), where in the first line here, x is Birthdays, the function f is filter, and y is state == "MA", year == 1979. The really nice thing about piping is that you can chain together a bunch of different operations without having to save the intermediate results. This is what we have done above by chaining together a filter followed by an arrange.

Manipulating Dates

Additional reading:

The date variable in Birthdays prints out in the conventional, human-readable way. But it is actually in a format (called POSIX date format) that automatically respects the order of time. The lubridate package contains helpful functions that will extract various information about any date. Here are some you might find useful:

  • year()
  • month()
  • week()
  • yday() — gives the day of the year as a number 1-366. This is often called the “Julian day.”
  • mday() — gives the day of the month as a number 1-31
  • wday() — gives the weekday (e.g. Monday, Tuesday, …). Use the optional argument label = TRUE to have the weekday spelled out rather than given as a number 1-7.

Using these lubridate functions, you can easily look at the data in more detail. For example, we can add columns to the date table for month and day of the week:5

Birthdays <-
  Birthdays %>%
  mutate(
    month = month(date, label = TRUE),
    weekday = wday(date, label = TRUE)
  )

Here is what the data table looks like with our new columns:

Table 7.3: A subset of the birthday data with additional variables.
state date year births month weekday
AK 1969-01-01 1969 14 Jan Wed
AL 1969-01-01 1969 174 Jan Wed
AR 1969-01-01 1969 78 Jan Wed
AZ 1969-01-01 1969 84 Jan Wed
CA 1969-01-01 1969 824 Jan Wed
CO 1969-01-01 1969 100 Jan Wed

Example 7.4 Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.

Solution

The plan of attack is to first filter the dates, then group by state, then use a summarize to add up totals for each state, and finally arrange them in descending order to find the top 5.6

SepTable <-
  Birthdays %>%
  filter(date >= ymd("1979-09-09"), date <= ymd("1979-09-11")) %>%
  group_by(state) %>%
  summarize(total = sum(births)) %>%
  arrange(desc(total)) %>%
  head(n = 5)
knitr::kable(
  SepTable[, ],
  caption = "States with the
  most births between September 9, 1979
  and September 11, 1979, inclusive."
)
Table 7.4: States with the most births between September 9, 1979 and September 11, 1979, inclusive.
state total
CA 3246
TX 2347
NY 1943
IL 1673
OH 1408


Practice Part 1: Baby Names

We are going to practice the six data verbs on the babynames dataset:

Table 7.5: A subset of the babynames data, which runs from 1880-2015 and is provided by the US Social Security Administration.
year sex name n prop
1880 F Mary 7065 0.0723836
1880 F Anna 2604 0.0266790
1880 F Emma 2003 0.0205215
1880 F Elizabeth 1939 0.0198658
1880 F Minnie 1746 0.0178884
1880 F Margaret 1578 0.0161672

Exercise 7.1 Add a new boolean (true or false) variable called has2000 that indicates whether there were more than 2000 babies of that sex assigned at birth with that name in each year. Display the first six rows of your new table.

Exercise 7.2 Find the number of total babies per year, sorted by most babies to least babies.

Exercise 7.3 Find the twelve most popular names overall (i.e., totaled over all year and sexes), ordered by popularity.

Exercise 7.4 Find the most popular names for males, over all years and ordered by popularity.

Exercise 7.5 Find the most popular names for females, over all years and ordered by popularity.

Exercise 7.6 Calculate the number of babies born each decade, and arrange them in descending order. Calculating the decade may be the trickiest part of this question!

Exercise 7.7 Calculate the most popular name for each year. Print out the answer for the years 2006-2015. This is tricky, but try Googling for hints.

Practice Part 2: US Births

Now we are ready to return to the Birthdays data set to investigate some drivers of daily births in the US.

Seasonality

For this activity, we need to work with data aggregated across the states.

Exercise 7.8 (Total Across States) Create a new data table, DailyBirths, that adds up all the births for each day across all the states. Plot out daily births vs date.

For all of the remaining exercises, start with your DailyBirths data frame.

Exercise 7.9 (Examine Seasonality) To examine seasonality in birth rates, look at the daily number of births using DailyBirths (don’t aggregate) by

  1. week of the year (1-53)
  2. month of the year (January to December)
  3. Julian day (1-366)

When are the most babies born? The fewest?

Day of the Week

Exercise 7.10 (Examine Patterns within the Week) To examine patterns within the week, make a box plot showing the daily number of births by day of the week (use DailyBirths). Interpret your results.

Holidays

Exercise 7.11 (Two Year Sample) Pick a two-year span of the DailyBirths that falls in the 1980s, say, 1980/1981. Extract out the data just in this interval, calling it MyTwoYears. (Hint: filter(), year()). Plot out the births in this two-year span day by day. Color each date according to its day of the week. Make sure to choose your font size, line widths, and color scheme to make your figure legible. Explain the pattern that you see.

The plot you generate for Exercise 7.11 should be generally consistent with the weekend effect and seasonal patterns we have already seen; however, a few days each year stand out as exceptions. We are going to examine the hypothesis that these are holidays. You can find a data set listing US federal holidays here. Read it in as follows:7

Holidays <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/US-Holidays.csv") %>%
  mutate(date = as.POSIXct(lubridate::dmy(date)))

Exercise 7.12 (Holidays) Now let’s update the plot from Exercise 7.11 to include the holidays.

  1. Add a variable to MyTwoYears called is_holiday. It should be TRUE when the day is a holiday, and FALSE otherwise. One way to do this is with the transformation verb %in%, for instance, is_holiday = date %in% Holidays$date.
  2. Add a geom_point layer to your plot that sets the color of the points based on the day of the week and the shape of the points based on whether or not the day is a holiday.
  3. Finally, some holidays seem to have more of an effect than others. It would be helpful to label them. Use geom_text with the holiday data to add labels to each of the holidays. Hints: 1. Start by making a new data table called MyHolidays that just contains the holidays in your selected two year window. 2. Start your geometry line with geom_text(data=MyHolidays). 3. You’ll have to make up a y-coordinate for each label. 4. You can set the orientation of each label with the angle argument; e.g., geom_text(data=MyHolidays, angle=40, ...).

Geography

In any way you choose, explore the effect of geography on birth patterns using the original Birthdays data. For example, do parents in Minnesota have fewer winter babies than in other states? Which states have the largest increases or decreases in their portion of US births over time? Is the weekend effect less strong for states with a higher percentage of their populations living in rural areas?

If you have extra time or want some extra practice, pick any issue (not all of these) that interests you, explore it, and create a graphic to illustrate your findings.

Superstition

This article from FiveThirtyEight demonstrates that fewer babies are born on the 13th of each month, and the effect is even stronger when the 13th falls on a Friday. If you have extra time or want some extra practice, you can try to recreate the first graphic in the article.

Appendix: R Functions

Six Main Verbs

Function/Operator Action Example
select() Provides a subset of variables select(Birthdays, state, date, year, births)
mutate() Creates a new variable mutate(Birthdays, year_short = year - 1900)
filter() Provides a subset of rows filter(Birthdays, state %in% c("AK","CA","MA"))
arrange() Sorts the rows of a dataset arrange(Birthdays, desc(births))
summarize() Collapses rows into summaries of variables across rows summarize(Birthdays,total_births = sum(births), average_births = mean(births), nstates = n_distinct(state), ncases = n())
group_by() Collapses rows into summaries of variables across rows group_by(Birthdays, year, state) %>% summarize(average = mean(births))

Logical/Boolean Operators

Function/Operator Action Example
== Checks whether two items are equal year == 2000
> Checks whether left is greater than the right year > 2000
< Checks whether left is less than the right year < 2000
>= Checks whether left is greater than or equal to right year >= 2000
<= Checks whether left is less than or equal to right year <= 2000
!= Checks whether left is not equal to right year != 2000
%in% Checks whether left is in vector on right state %in% c("AK","CA","MA")

  1. The fivethirtyeight package has more recent data.↩︎

  2. Important note about = vs. ==: A single = is an assignment operator that assigns the value after the equal sign to the variable before the equal sign. We saw an example of this above with year_short = year - 1900. In order to compare whether two values are the same, we need to use the double equal == as in year == 1979.↩︎

  3. The label = TRUE argument tells month to return a string abbreviation for the month instead of the month’s number.↩︎

  4. The verbs head(n = 5), tail(n = 3) are often used just after an arrange to keep, e.g., only the first 5 entries or last 3 entries, where n specifies the number of entries to keep.↩︎

  5. The point of the lubridate::dmy() function is to convert the character-string date stored in the CSV to a POSIX date-number.↩︎