Data Wrangling

Six Main Verbs

Brianna Heggeseth

Announcements

This week in MSCS

  • Wednesday (4:45-7pm): MSCS Block Party! Outside OLRI near Tennis Courts
  • Thursday 11:15am: Coffee Break!

Looking Ahead

  • Week 5: Data Wrangling
  • Week 6: Data Wrangling
  • Week 7: Mini Project/Midterm Review
  • Week 8: Midterm/Fall Break

Due this Week

  • Assignment 5 (Spatial Viz) on Tuesday [via Moodle]
  • Self-Reflection on Friday [individual Google Doc]
  • At least 1 Tidy Tuesday (TT) by Friday [via Moodle, TT4]
    • Choose 1 TT to iterate on by Friday [Via Moodle, IV0]

Data Wrangling

Getting the data in the tidy format that we want…

  • to visualize (glyph-ready: one row per glyph)
  • to summarize
  • to learn more about the data

Learning Goals

  • Understand and be able to use the following verbs appropriate: select, mutate, filter, arrange, summarize, group_by
  • Develop working knowledge of working with dates and lubridate functions

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
  • mutate

Verbs that change the cases (rows) but not the variables (columns)

  • filter
  • arrange

Grouped summaries

  • summarize
  • group_by

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
    • Action: Provides a subset of variables
    • Inputs: data, variable names
  • mutate
    • Action: creates new variables
    • Inputs: data, new_variable_name = how_you_define_new_var
    • Examples: mutate(data, var2 = var^2)

Verbs that change the cases (rows) but not the variables (columns)

  • filter
    • Action: shows subset of rows
    • Inputs: data, Boolean conditions based on variables
    • Examples: filter(data, year > 2000)
  • arrange
    • Action: sorts rows
    • Inputs: data, variable names, desc(variable name) if by descending order
    • Examples: arrange(data, desc(n))

Grouped summaries

  • summarize
    • Action: collapses rows and calculates a summary
    • Inputs: data, new_variable_name = expression_used_to_summarize
    • Example: summarize(data, avgHeight = mean(height))
  • group_by
    • Action: creates a grouping structure within data
    • Inputs: data, names of variables to define grouping structure
    • Example: data %>% group_by(sport) %>% summarize(avgHeight = mean(height))

Visual Illustration

https://tidydatatutor.com/vis.html#

Template File

Download a template .Rmd of this activity. Put the file in a Day_07 folder within your COMP_STAT_112 folder.

  • This .Rmd only contains exercises that we’ll work on in class and you’ll finish for Assignment 6.

Data Example

The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.

library(mosaicData)
Birthdays <- Birthdays %>% select(state, date, year, births)

First Exercise

Consider the Birthdays data

  1. Add two new variables to the Birthdays data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
BirthdaysExtra <- mutate(Birthdays, 
      year_short = year - 1900,
      busy_birthday = (births > 100)
)
  1. Then form a new table that only has three columns: the state and your two new columns.
BirthdaysExtraTable <- select(BirthdaysExtra, 
                         state,
                         year_short,
                         busy_birthday
                         )
  1. What does the following operation return: select(Birthdays, ends_with("te"))?
select(Birthdays, ends_with("te")) %>% head()
  state       date
1    AK 1969-01-01
2    AL 1969-01-01
3    AR 1969-01-01
4    AZ 1969-01-01
5    CA 1969-01-01
6    CO 1969-01-01

Second Exercise

Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.

MABirths1979 <- filter(Birthdays, state == "MA", year == 1979)
MABirths1979Sorted <- arrange(MABirths1979, desc(births))

head(MABirths1979Sorted)
  state       date year births
1    MA 1979-09-28 1979    262
2    MA 1979-09-11 1979    252
3    MA 1979-12-28 1979    249
4    MA 1979-09-26 1979    246
5    MA 1979-07-24 1979    245
6    MA 1979-04-27 1979    243

Third Exercise

Consider the Birthdays data again.

  1. Find the average number of daily births (per state) in each year.
  2. Find the average number of daily births in each year, by state.
BirthdaysYear <- group_by(Birthdays, year)
summarise(BirthdaysYear, average = mean(births))
# A tibble: 20 × 2
    year average
   <int>   <dbl>
 1  1969    192.
 2  1970    200.
 3  1971    191.
 4  1972    175.
 5  1973    169.
 6  1974    170.
 7  1975    169.
 8  1976    170.
 9  1977    179.
10  1978    179.
11  1979    188.
12  1980    194.
13  1981    195.
14  1982    198.
15  1983    196.
16  1984    197.
17  1985    202.
18  1986    202.
19  1987    205.
20  1988    210.
BirthdaysYearState <- group_by(Birthdays, year, state)
summarise(BirthdaysYearState, average = mean(births))
# A tibble: 1,020 × 3
# Groups:   year [20]
    year state average
   <int> <chr>   <dbl>
 1  1969 AK       18.6
 2  1969 AL      174. 
 3  1969 AR       91.3
 4  1969 AZ       93.3
 5  1969 CA      954. 
 6  1969 CO      110. 
 7  1969 CT      134. 
 8  1969 DC       75.3
 9  1969 DE       27.6
10  1969 FL      292. 
# … with 1,010 more rows

Piping

QuickMABirths1979 <-
  Birthdays %>%
  filter(state == "MA", year == 1979) %>%
  arrange(desc(births))

With the pipe notation, x %>% f(y) becomes f(x,y), where in the first line here, x is Birthdays, the function f is filter, and y is state == "MA", year == 1979.

Dates

Birthdays <-
  Birthdays %>%
  mutate(
    month = month(date, label = TRUE),
    weekday = wday(date, label = TRUE)
  )

Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.

  Birthdays %>%
  filter(date >= ymd("1979-09-09"), date <= ymd("1979-09-11")) %>%
  group_by(state) %>%
  summarise(total = sum(births)) %>%
  arrange(desc(total)) %>%
  head(n = 5)
# A tibble: 5 × 2
  state total
  <chr> <int>
1 CA     3246
2 TX     2347
3 NY     1943
4 IL     1673
5 OH     1408

Rest of Class

Continue working on the activity; check in with your classmates.

Don’t leave anyone left struggling alone!

After Class

This activity is all code, no interpretations.

There are many exercises to give you plenty of practice with these important six tasks!

You’ll finish the activity for Assignment 6.