Data Wrangling

Six Main Verbs

Brianna Heggeseth

Announcements

MSCS Events

  • Thursday 11:15am: Coffee Break!
  • What events would you like? Let me know!

Looking Ahead

  • Week 5: Data Wrangling
  • Week 6: Data Wrangling
  • Week 7: Mini Project/Midterm Review
  • Week 8: Midterm/Fall Break!

Due next Week

  • Assignment 4 (Spatial Viz) on Weds [via Moodle]
  • September 2023 Reflection [write in Google Doc shared with you]
  • Tidy Tuesday on Friday (at least 1 of TT1-TT5)

Reflection

Let’s take 5 minutes to start that reflection.

  • Write under Sept 2023.
  • See the prompts at the top of the document. Respond to any combo of them.
  • Goal: You are sharing your perspective about your learning that may not be reflected in what you turn in.
  • Shared only between you and Brianna (you can/should be vulnerable about struggles/barriers so I can support you)

Data Wrangling

What is it?

Data wrangling is getting the data in the tidy format that we want…

  • to visualize (glyph-ready: one row per glyph)
  • to summarize
  • to learn more about the data

Learning Goals

  • Understand and be able to use the following verbs appropriate: select, mutate, filter, arrange, summarize, group_by
  • Develop working knowledge of working with dates and lubridate functions

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
  • mutate

Verbs that change the cases (rows) but not the variables (columns)

  • filter
  • arrange

Grouped summaries

  • summarize
  • group_by

Six Main Verbs

Verbs that change the variables (columns) but not the cases (rows)

  • select
    • Action: Provides a subset of variables
    • Inputs: data, variable names
    • Example: select(data,var1,var2,var3)
  • mutate
    • Action: creates new variables
    • Inputs: data, new_variable_name = how_you_define_new_var
    • Examples: mutate(data, var2 = var^2)

Verbs that change the cases (rows) but not the variables (columns)

  • filter
    • Action: shows subset of rows
    • Inputs: data, Boolean conditions based on variables
    • Examples: filter(data, year > 2000)
  • arrange
    • Action: sorts rows
    • Inputs: data, variable names, desc(variable name) if by descending order
    • Examples: arrange(data, desc(n))

Grouped summaries

  • summarize
    • Action: collapses rows and calculates a summary
    • Inputs: data, new_variable_name = expression_used_to_summarize
    • Example: summarize(data, avgHeight = mean(height)
  • group_by
    • Action: creates a grouping structure within data
    • Inputs: data, names of variables to define grouping structure
    • Example: data %>% group_by(sport) %>% summarize(avgHeight = mean(height))

WAIT! What is that?

%>% is called a pipe.

  • It serves as a way to “pass” objects (usually datasets) on the left to a function on the right as the 1st input.

LEFT_OBJECT %>% RIGHT_FUNCTION()

is the same as RIGHT_FUNCTION(LEFT_OBJECT)

Visual Illustration

https://tidydatatutor.com/vis.html#

Template File

Download a template .Rmd of this activity. Put the file in a Assignment_05 folder within your COMP_STAT_112 folder.

  • This .Rmd only contains examples that we’ll work on in class and exercises you’ll finish for Assignment 5.

Data Example

The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.

library(mosaicData)

#select() is not a required step but makes the data simpler to work with right now
Birthdays <- Birthdays %>% select(state, date, year, births) 

#?Birthdays in Console for documentation
head(Birthdays)
  state       date year births
1    AK 1969-01-01 1969     14
2    AL 1969-01-01 1969    174
3    AR 1969-01-01 1969     78
4    AZ 1969-01-01 1969     84
5    CA 1969-01-01 1969    824
6    CO 1969-01-01 1969    100
tail(Birthdays)
       state       date year births
372859    VA 1988-12-31 1988    201
372860    VT 1988-12-31 1988     21
372861    WA 1988-12-31 1988    157
372862    WI 1988-12-31 1988    167
372863    WV 1988-12-31 1988     45
372864    WY 1988-12-31 1988     18

First Example

Consider the Birthdays data

  1. Add two new variables to the Birthdays data: one that has only the last two digits of the year, and one that states whether there were more than 100 births in the given state on the given date.
BirthdaysExtra <- mutate(Birthdays, 
      year_short = year - 1900,
      busy_birthday = (births > 100)
)
  1. Then form a new table that only has three columns: the state and your two new columns.
BirthdaysExtraTable <- select(BirthdaysExtra, 
                         state,
                         year_short,
                         busy_birthday
                         )
  1. What does the following operation return: select(Birthdays, ends_with("te"))?
select(Birthdays, ends_with("te")) %>% head()
  state       date
1    AK 1969-01-01
2    AL 1969-01-01
3    AR 1969-01-01
4    AZ 1969-01-01
5    CA 1969-01-01
6    CO 1969-01-01

Second Example

Create a table with only births in Massachusetts in 1979, and sort the days from those with the most births to those with the fewest.

MABirths1979 <- filter(Birthdays, state == "MA", year == 1979)
MABirths1979Sorted <- arrange(MABirths1979, desc(births))

head(MABirths1979Sorted)
  state       date year births
1    MA 1979-09-28 1979    262
2    MA 1979-09-11 1979    252
3    MA 1979-12-28 1979    249
4    MA 1979-09-26 1979    246
5    MA 1979-07-24 1979    245
6    MA 1979-04-27 1979    243

Third Example

Consider the Birthdays data again.

  1. Find the average number of daily births in each year (average across states).
  2. Find the average number of daily births in each year, by state.
BirthdaysYear <- group_by(Birthdays, year)
summarise(BirthdaysYear, average = mean(births))
# A tibble: 20 × 2
    year average
   <int>   <dbl>
 1  1969    192.
 2  1970    200.
 3  1971    191.
 4  1972    175.
 5  1973    169.
 6  1974    170.
 7  1975    169.
 8  1976    170.
 9  1977    179.
10  1978    179.
11  1979    188.
12  1980    194.
13  1981    195.
14  1982    198.
15  1983    196.
16  1984    197.
17  1985    202.
18  1986    202.
19  1987    205.
20  1988    210.
BirthdaysYearState <- group_by(Birthdays, year, state)
summarise(BirthdaysYearState, average = mean(births))
# A tibble: 1,020 × 3
# Groups:   year [20]
    year state average
   <int> <chr>   <dbl>
 1  1969 AK       18.6
 2  1969 AL      174. 
 3  1969 AR       91.3
 4  1969 AZ       93.3
 5  1969 CA      954. 
 6  1969 CO      110. 
 7  1969 CT      134. 
 8  1969 DC       75.3
 9  1969 DE       27.6
10  1969 FL      292. 
# ℹ 1,010 more rows

Piping

QuickMABirths1979 <-
  Birthdays %>%
  filter(state == "MA", year == 1979) %>%
  arrange(desc(births))

With the pipe notation, x %>% f(y) becomes f(x,y), where in the first line here, x is Birthdays, the function f is filter, and y is state == "MA", year == 1979.

Dates

Birthdays <-
  Birthdays %>%
  mutate(
    month = month(date, label = TRUE),
    weekday = wday(date, label = TRUE)
  )

Make a table showing the five states with the most births between September 9, 1979 and September 11, 1979, inclusive. Arrange the table in descending order of births.

  Birthdays %>%
  filter(date >= ymd("1979-09-09"), date <= ymd("1979-09-11")) %>%
  group_by(state) %>%
  summarise(total = sum(births)) %>%
  arrange(desc(total)) %>%
  head(n = 5)
# A tibble: 5 × 2
  state total
  <chr> <int>
1 CA     3246
2 TX     2347
3 NY     1943
4 IL     1673
5 OH     1408

Rest of Class

Continue working on the activity; check in with your classmates.

Don’t leave anyone left struggling alone!

After Class

This activity is all code, no interpretations.

There are 12 exercises to give you plenty of practice with these important six tasks!

You’ll finish the activity for Assignment 5 (due in 1.5 weeks!).