Data Wrangling

Six Main Verbs

Brianna Heggeseth

Announcements

This week in MSCS

  • Thursday 11:15am: Coffee Break!

Looking Ahead

  • Week 5: Data Wrangling
  • Week 6: Data Wrangling
  • Week 7: Mini Project/Capstone Days!
  • Week 8: Midterm Review/Midterm

Due this Week

  • Assignment 4 (Effective, Spatial Viz) on Weds [via Moodle]
  • At least 1 Tidy Tuesday (TT) by next Friday [via Moodle, TT4 or TT5]

Data Wrangling

Getting the data in the tidy format that we want…

  • to visualize (glyph-ready: one row per glyph)
  • to summarize
  • to learn more about the data

Learning Goals

  • Understand and be able to use the following verbs appropriate: select, mutate, filter, arrange, summarize, group_by
  • Develop working knowledge of working with dates and lubridate functions

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
  • mutate

Verbs that change the cases (rows) but not the variables (columns)

  • filter
  • arrange

Grouped summaries

  • summarize
  • group_by

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
    • Action: Provides a subset of variables
    • Inputs: data, variable names
  • mutate
    • Action: creates new variables
    • Inputs: data, new_variable_name = how_you_define_new_var
    • Examples: mutate(data, var2 = var^2)

Verbs that change the cases (rows) but not the variables (columns)

  • filter
    • Action: shows subset of rows
    • Inputs: data, Boolean conditions based on variables
    • Examples: filter(data, year > 2000)
  • arrange
    • Action: sorts rows
    • Inputs: data, variable names, desc(variable name) if by descending order
    • Examples: arrange(data, desc(n))

Grouped summaries

  • summarize
    • Action: collapses rows and calculates a summary
    • Inputs: data, new_variable_name = expression_used_to_summarize
    • Example: summarize(data, avgHeight = mean(height))
  • group_by
    • Action: creates a grouping structure within data
    • Inputs: data, names of variables to define grouping structure
    • Example: data %>% group_by(sport) %>% summarize(avgHeight = mean(height))

Visual Illustration

https://tidydatatutor.com/vis.html#

Template File

Download a template .Rmd of this activity. Put the file in a Assignment_05 folder within your COMP_STAT_112 folder.

  • This .Rmd only contains examples that we’ll work on in class and exercises you’ll finish for Assignment 5.

Data Example

The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.

library(mosaicData)

#select() is not a required step but makes the data simpler to work with right now
Birthdays <- Birthdays %>% select(state, date, year, births) 

#?Birthdays in Console for documentation
head(Birthdays)
  state       date year births
1    AK 1969-01-01 1969     14
2    AL 1969-01-01 1969    174
3    AR 1969-01-01 1969     78
4    AZ 1969-01-01 1969     84
5    CA 1969-01-01 1969    824
6    CO 1969-01-01 1969    100
tail(Birthdays)
       state       date year births
372859    VA 1988-12-31 1988    201
372860    VT 1988-12-31 1988     21
372861    WA 1988-12-31 1988    157
372862    WI 1988-12-31 1988    167
372863    WV 1988-12-31 1988     45
372864    WY 1988-12-31 1988     18

First Example

Consider the Birthdays data

  1. Add two new variables to the Birthdays data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
BirthdaysExtra <- mutate(Birthdays, 
      year_short = year - 1900,
      busy_birthday = (births > 100)
)
  1. Then form a new table that only has three columns: the state and your two new columns.
BirthdaysExtraTable <- select(BirthdaysExtra, 
                         state,
                         year_short,
                         busy_birthday
                         )
  1. What does the following operation return: select(Birthdays, ends_with("te"))?
select(Birthdays, ends_with("te")) %>% head()
  state       date
1    AK 1969-01-01
2    AL 1969-01-01
3    AR 1969-01-01
4    AZ 1969-01-01
5    CA 1969-01-01
6    CO 1969-01-01

Second Example

Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.

MABirths1979 <- filter(Birthdays, state == "MA", year == 1979)
MABirths1979Sorted <- arrange(MABirths1979, desc(births))

head(MABirths1979Sorted)
  state       date year births
1    MA 1979-09-28 1979    262
2    MA 1979-09-11 1979    252
3    MA 1979-12-28 1979    249
4    MA 1979-09-26 1979    246
5    MA 1979-07-24 1979    245
6    MA 1979-04-27 1979    243

Third Example

Consider the Birthdays data again.

  1. Find the average number of daily births (per state) in each year.
  2. Find the average number of daily births in each year, by state.
BirthdaysYear <- group_by(Birthdays, year)
summarise(BirthdaysYear, average = mean(births))
# A tibble: 20 × 2
    year average
   <int>   <dbl>
 1  1969    192.
 2  1970    200.
 3  1971    191.
 4  1972    175.
 5  1973    169.
 6  1974    170.
 7  1975    169.
 8  1976    170.
 9  1977    179.
10  1978    179.
11  1979    188.
12  1980    194.
13  1981    195.
14  1982    198.
15  1983    196.
16  1984    197.
17  1985    202.
18  1986    202.
19  1987    205.
20  1988    210.
BirthdaysYearState <- group_by(Birthdays, year, state)
summarise(BirthdaysYearState, average = mean(births))
# A tibble: 1,020 × 3
# Groups:   year [20]
    year state average
   <int> <chr>   <dbl>
 1  1969 AK       18.6
 2  1969 AL      174. 
 3  1969 AR       91.3
 4  1969 AZ       93.3
 5  1969 CA      954. 
 6  1969 CO      110. 
 7  1969 CT      134. 
 8  1969 DC       75.3
 9  1969 DE       27.6
10  1969 FL      292. 
# … with 1,010 more rows

Piping

QuickMABirths1979 <-
  Birthdays %>%
  filter(state == "MA", year == 1979) %>%
  arrange(desc(births))

With the pipe notation, x %>% f(y) becomes f(x,y), where in the first line here, x is Birthdays, the function f is filter, and y is state == "MA", year == 1979.

Dates

Birthdays <-
  Birthdays %>%
  mutate(
    month = month(date, label = TRUE),
    weekday = wday(date, label = TRUE)
  )

Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.

  Birthdays %>%
  filter(date >= ymd("1979-09-09"), date <= ymd("1979-09-11")) %>%
  group_by(state) %>%
  summarise(total = sum(births)) %>%
  arrange(desc(total)) %>%
  head(n = 5)
# A tibble: 5 × 2
  state total
  <chr> <int>
1 CA     3246
2 TX     2347
3 NY     1943
4 IL     1673
5 OH     1408

Rest of Class

Continue working on the activity; check in with your classmates.

Don’t leave anyone left struggling alone!

After Class

This activity is all code, no interpretations.

There are 12 exercises to give you plenty of practice with these important six tasks!

You’ll finish the activity for Assignment 5.