Topic 7 Six Main Verbs
Learning Goals
- Understand and be able to use the following verbs appropriate:
select
,mutate
,filter
,arrange
,summarize
,group_by
- Develop an understanding what code will do conceptually without running it
- Develop working knowledge of working with dates and
lubridate
functions
You can download a template .Rmd of this activity here. Add it to a folder called Assignment_05
in your COMP_STAT_112
folder.
Data Wrangling Introduction
Example: US Births
The number of daily births in the US varies over the year and from day to day. What’s surprising to many people is that the variation from one day to the next can be huge: some days have only about 80% as many births as others. Why? In this activity we’ll use basic data wrangling skills to understand some drivers of daily births.
The data table Birthdays
in the mosaicData
package gives the number of births recorded on each day of the year in each state from 1969 to 1988.3
state | date | year | births |
---|---|---|---|
AK | 1969-01-01 | 1969 | 14 |
AL | 1969-01-01 | 1969 | 174 |
AR | 1969-01-01 | 1969 | 78 |
AZ | 1969-01-01 | 1969 | 84 |
CA | 1969-01-01 | 1969 | 824 |
CO | 1969-01-01 | 1969 | 100 |
Tidy Data
Additional reading:
There are different ways to store and represent the same data. In order to be consistent and to also take advantage of the vectorized nature of R
, the tidyverse
packages we’ll use provide a set of three interrelated rules/conventions for a dataset to be tidy
:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
One of the first things we’ll often do when acquiring new data is to “tidy it” into this form. For now, we can already start thinking of a data frame (tibble) as a table whose rows are the individual cases and whose columns are the variables on which we have information for each individual case. The first figure from the tidyr
cheat sheet summarizes this principle.
Data Verbs
Additional reading:
There are six main data transformation verbs in the dplyr
library. Each verb takes an input data frame along with additional arguments specifying the action, and returns a new data frame. We’ll examine them in three pairs.
Verbs that change the variables (columns) but not the cases (rows)
The first two verbs change which variables (columns) are included in the data frame, but preserve the same set of cases (rows).
select()
chooses which columns to keep, or put another way, deletes those colummns that are not selected. To specify the columns, we can either list them out, or use functions likestarts_with()
,ends_with()
, orcontains()
to specify the titles of the variables we wish to keep.mutate()
adds one or more columns to the data frame. Each column is a function of the other columns that is applied on a row by row basis. For example, we can use arithmetic operations like adding two other variables or logical operations like checking if two columns are equal, or equal to a target number.
Example 7.1 (select and mutate) Consider the Birthdays
data
- Add two new variables to the
Birthdays
data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
- Then form a new table that only has three columns: the state and your two new columns.
- What does the following operation return (describe the output):
select(Birthdays, ends_with("te"))
?
Solution
The commands for the first two parts are
BirthdaysExtra <- mutate(Birthdays,
year_short = year - 1900,
busy_birthday = (births > 100)
)
BirthdaysExtraTable <- select(
BirthdaysExtra, state,
year_short, busy_birthday
)
select(Birthdays, ends_with("te")) %>% head()
## state date
## 1 AK 1969-01-01
## 2 AL 1969-01-01
## 3 AR 1969-01-01
## 4 AZ 1969-01-01
## 5 CA 1969-01-01
## 6 CO 1969-01-01
The operation in (c) provides a data set with only the two columns state
and date
because they end in ‘te’.
Verbs that change the cases (rows) but not the variables (columns)
The next two verbs change which cases (rows) are included in the data frame, but preserve the same set of variables (columns).
filter()
deletes some of the rows by specifying which rows to keep.arrange()
reorders the rows according to a specified criteria. To sort in reverse order based on the variablex
, usearrange(desc(x))
.
Example 7.2 (filter and arrange) Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.
Solution
We want to filter
and then arrange
:
MABirths1979 <- filter(Birthdays, state == "MA", year == 1979)
MABirths1979Sorted <- arrange(MABirths1979, desc(births))
state | date | year | births |
---|---|---|---|
MA | 1979-09-28 | 1979 | 262 |
MA | 1979-09-11 | 1979 | 252 |
MA | 1979-12-28 | 1979 | 249 |
MA | 1979-09-26 | 1979 | 246 |
MA | 1979-07-24 | 1979 | 245 |
MA | 1979-04-27 | 1979 | 243 |
When filtering, we often use logical comparison operators like ==
, >
, <
, >=
(greater than or equal to), <=
(less than or equal to), and %in%
, which compares the value to a list of entries.4 For example, if we want all births in AK, CA, and MA, we can write
filter(Birthdays, state %in% c("AK","CA","MA"))
The c()
here is for concatenate, which is how we form vectors in R
.
Grouped summaries
summarize()
(or equivalentlysummarise()
) takes an entire data frame as input and outputs a single row with one or more summary statistics, such asmean()
,sum()
,sd()
,n_distinct()
, orn()
(which just counts the number of entries/rows).
summarize(Birthdays,
total_births = sum(births),
average_births = mean(births),
nstates = n_distinct(state), ncases = n()
)
## total_births average_births nstates ncases
## 1 70486538 189.0409 51 372864
So summarize
changes both the cases and the variables. Alone, summarize
is not all that useful, because we can also access individual variables directly with the dollar sign. For example, to find the total and average births, we can write
## [1] 70486538
## [1] 189.0409
Rather, we will mostly use it to create grouped summaries, which brings us to the last of the six main data verbs.
group_by()
groups the cases of a data frame by a specified set of variables. The size of the stored data frame does not actually change (neither the cases nor the variables change), but then other functions can be applied to the specified groups instead of the entire data set. We’ll often usegroup_by
in conjunction withsummarize
to get a grouped summary.
Example 7.3 (grouped summary) Consider the Birthdays
data again.
- Find the average number of daily births in each year (average across states).
- Find the average number of daily births in each year, by state.
Solution
We have to first group by the desired grouping and then perform a summarize
.
## # A tibble: 20 × 2
## year average
## <int> <dbl>
## 1 1969 192.
## 2 1970 200.
## 3 1971 191.
## 4 1972 175.
## 5 1973 169.
## 6 1974 170.
## 7 1975 169.
## 8 1976 170.
## 9 1977 179.
## 10 1978 179.
## 11 1979 188.
## 12 1980 194.
## 13 1981 195.
## 14 1982 198.
## 15 1983 196.
## 16 1984 197.
## 17 1985 202.
## 18 1986 202.
## 19 1987 205.
## 20 1988 210.
BirthdaysYearState <- group_by(Birthdays, year, state)
summarize(BirthdaysYearState, average = mean(births))
## # A tibble: 1,020 × 3
## # Groups: year [20]
## year state average
## <int> <chr> <dbl>
## 1 1969 AK 18.6
## 2 1969 AL 174.
## 3 1969 AR 91.3
## 4 1969 AZ 93.3
## 5 1969 CA 954.
## 6 1969 CO 110.
## 7 1969 CT 134.
## 8 1969 DC 75.3
## 9 1969 DE 27.6
## 10 1969 FL 292.
## # ℹ 1,010 more rows
Piping
Additional reading:
Pipes offer an efficient way to execute multiple operations at once. Here is a more efficient way to redo Example 7.2 with the pipe:
With the pipe notation, x %>% f(y)
becomes f(x,y)
, where in the first line here, x
is Birthdays
, the function f
is filter
, and y
is state == "MA", year == 1979
. The really nice thing about piping is that you can chain together a bunch of different operations without having to save the intermediate results. This is what we have done above by chaining together a filter
followed by an arrange
.
Manipulating Dates
Additional reading:
The date
variable in Birthdays
prints out in the conventional, human-readable way. But it is actually in a format (called POSIX
date format) that automatically respects the order of time. The lubridate
package contains helpful functions that will extract various information about any date. Here are some you might find useful:
year()
month()
week()
yday()
— gives the day of the year as a number 1-366. This is often called the “Julian day.”mday()
— gives the day of the month as a number 1-31wday()
— gives the weekday (e.g. Monday, Tuesday, …). Use the optional argumentlabel = TRUE
to have the weekday spelled out rather than given as a number 1-7.
Using these lubridate
functions, you can easily look at the data in more detail. For example, we can add columns to the date table for month and day of the week:5
Birthdays <-
Birthdays %>%
mutate(
month = month(date, label = TRUE),
weekday = wday(date, label = TRUE)
)
## Error in `mutate()`:
## ℹ In argument: `month = month(date, label = TRUE)`.
## Caused by error in `month()`:
## ! unused argument (label = TRUE)
Here is what the data table looks like with our new columns:
state | date | year | births |
---|---|---|---|
AK | 1969-01-01 | 1969 | 14 |
AL | 1969-01-01 | 1969 | 174 |
AR | 1969-01-01 | 1969 | 78 |
AZ | 1969-01-01 | 1969 | 84 |
CA | 1969-01-01 | 1969 | 824 |
CO | 1969-01-01 | 1969 | 100 |
Example 7.4 Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.
Solution
The plan of attack is to first filter the dates, then group by state, then use a summarize
to add up totals for each state, and finally arrange them in descending order to find the top 5.6
SepTable <-
Birthdays %>%
filter(date >= ymd("1979-09-09"), date <= ymd("1979-09-11")) %>%
group_by(state) %>%
summarize(total = sum(births)) %>%
arrange(desc(total)) %>%
head(n = 5)
knitr::kable(
SepTable[, ],
caption = "States with the
most births between September 9, 1979
and September 11, 1979, inclusive."
)
state | total |
---|---|
CA | 3246 |
TX | 2347 |
NY | 1943 |
IL | 1673 |
OH | 1408 |
Practice Part 1: Baby Names
We are going to practice the six data verbs on the babynames
dataset:
year | sex | name | n | prop |
---|---|---|---|---|
1880 | F | Mary | 7065 | 0.0723836 |
1880 | F | Anna | 2604 | 0.0266790 |
1880 | F | Emma | 2003 | 0.0205215 |
1880 | F | Elizabeth | 1939 | 0.0198658 |
1880 | F | Minnie | 1746 | 0.0178884 |
1880 | F | Margaret | 1578 | 0.0161672 |
Exercise 7.1 Add a new boolean (true or false) variable called has2000
that indicates whether there were more than 2000 babies of that sex assigned at birth with that name in each year. Display the first six rows of your new table.
Exercise 7.2 Find the number of total babies per year, sorted by most babies to least babies.
Exercise 7.3 Find the twelve most popular names overall (i.e., totaled over all year and sexes), ordered by popularity.
Exercise 7.4 Find the most popular names for males, over all years and ordered by popularity.
Exercise 7.5 For each line of code below, describe the output passed to the next function. Write your answers as comments in the code. Try to see if you can describe the output without running the code, an important skill to develop!
babynames %>%
filter( year >= 1900, year < 2000) %>%
mutate(YearCat = cut(year, c(1900,1950,2000), right = FALSE, labels = c('Early 1900','Late 1900'))) %>%
group_by(sex,YearCat,name) %>%
summarize(n = sum(n)) %>%
arrange(sex,YearCat, n) %>%
group_by(sex,YearCat) %>%
mutate(rank = dense_rank(-n)) %>%
arrange(sex,rank) %>%
head(10)
Exercise 7.6 Calculate the number of babies born each decade, and arrange them in descending order. Calculating the decade may be the trickiest part of this question!
Exercise 7.7 Calculate the most popular name for each year. Print out the answer for the years 2006-2015. This is tricky, but try Googling for hints.
Practice Part 2: US Births
Now we are ready to return to the Birthdays
data set to investigate some drivers of daily births in the US.
Seasonality
For this activity, we need to work with data aggregated across the states.
Exercise 7.8 (Total Across States) Create a new data table, DailyBirths
, that adds up all the births for each day across all the states. Plot out daily births vs date.
For all of the remaining exercises, start with your DailyBirths
data frame.
Exercise 7.9 (Examine Seasonality) To examine seasonality in birth rates, look at the daily number of births using DailyBirths
(don’t aggregate) by
- week of the year (1-53)
- month of the year (January to December)
- Julian day (1-366)
When are the most babies born? The fewest?
Day of the Week
Exercise 7.10 (Examine Patterns within the Week) To examine patterns within the week, make a box plot showing the daily number of births by day of the week (use DailyBirths
). Interpret your results.
Holidays
Exercise 7.11 (Two Year Sample) Pick a two-year span of the DailyBirths
that falls in the 1980s, say, 1980/1981. Extract out the data just in this interval, calling it MyTwoYears
. (Hint: filter()
, year()
). Plot out the births in this two-year span day by day. Color each date according to its day of the week. Make sure to choose your font size, line widths, and color scheme to make your figure legible. Explain the pattern that you see.
The plot you generate for Exercise 7.11 should be generally consistent with the weekend effect and seasonal patterns we have already seen; however, a few days each year stand out as exceptions. We are going to examine the hypothesis that these are holidays. You can find a data set listing US federal holidays here. Read it in as follows:7
Holidays <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/US-Holidays.csv") %>%
mutate(date = as.POSIXct(lubridate::dmy(date)))
Exercise 7.12 (Holidays) Now let’s update the plot from Exercise 7.11 to include the holidays.
- Add a variable to
MyTwoYears
calledis_holiday
. It should beTRUE
when the day is a holiday, andFALSE
otherwise. One way to do this is with the transformation verb%in%
, for instance,is_holiday = date %in% Holidays$date
.
- Add a
geom_point
layer to your plot that sets the color of the points based on the day of the week and the shape of the points based on whether or not the day is a holiday.
- Finally, some holidays seem to have more of an effect than others. It would be helpful to label them. Use
geom_text
with the holiday data to add labels to each of the holidays. Hints: 1. Start by making a new data table calledMyHolidays
that just contains the holidays in your selected two year window. 2. Start your geometry line withgeom_text(data=MyHolidays)
. 3. You’ll have to make up a y-coordinate for each label. 4. You can set the orientation of each label with theangle
argument; e.g.,geom_text(data=MyHolidays, angle=40, ...)
.
Geography
In any way you choose, explore the effect of geography on birth patterns using the original Birthdays
data. For example, do parents in Minnesota have fewer winter babies than in other states? Which states have the largest increases or decreases in their portion of US births over time? Is the weekend effect less strong for states with a higher percentage of their populations living in rural areas?
If you have extra time or want some extra practice, pick any issue (not all of these) that interests you, explore it, and create a graphic to illustrate your findings.
Superstition
This article from FiveThirtyEight demonstrates that fewer babies are born on the 13th of each month, and the effect is even stronger when the 13th falls on a Friday. If you have extra time or want some extra practice, you can try to recreate the first graphic in the article.
Appendix: R Functions
Six Main Verbs
Function/Operator | Action | Example |
---|---|---|
select() |
Provides a subset of variables | select(Birthdays, state, date, year, births) |
mutate() |
Creates a new variable | mutate(Birthdays, year_short = year - 1900) |
filter() |
Provides a subset of rows | filter(Birthdays, state %in% c("AK","CA","MA")) |
arrange() |
Sorts the rows of a dataset | arrange(Birthdays, desc(births)) |
summarize() |
Collapses rows into summaries of variables across rows | summarize(Birthdays,total_births = sum(births), average_births = mean(births), nstates = n_distinct(state), ncases = n()) |
group_by() |
Collapses rows into summaries of variables across rows | group_by(Birthdays, year, state) %>% summarize(average = mean(births)) |
Logical/Boolean Operators
Function/Operator | Action | Example |
---|---|---|
== |
Checks whether two items are equal | year == 2000 |
> |
Checks whether left is greater than the right | year > 2000 |
< |
Checks whether left is less than the right | year < 2000 |
>= |
Checks whether left is greater than or equal to right | year >= 2000 |
<= |
Checks whether left is less than or equal to right | year <= 2000 |
!= |
Checks whether left is not equal to right | year != 2000 |
%in% |
Checks whether left is in vector on right | state %in% c("AK","CA","MA") |
The
fivethirtyeight
package has more recent data.↩︎Important note about
=
vs.==
: A single=
is an assignment operator that assigns the value after the equal sign to the variable before the equal sign. We saw an example of this above withyear_short = year - 1900
. In order to compare whether two values are the same, we need to use the double equal==
as inyear == 1979
.↩︎The
label = TRUE
argument tellsmonth
to return a string abbreviation for the month instead of the month’s number.↩︎The verbs
head(n = 5)
,tail(n = 3)
are often used just after anarrange
to keep, e.g., only the first 5 entries or last 3 entries, wheren
specifies the number of entries to keep.↩︎The point of the
lubridate::dmy()
function is to convert the character-string date stored in the CSV to a POSIX date-number.↩︎