6  Data wrangling - Part 1

Settling In

You choose where you want to sit today.

  • Introduce yourself
  • Check in as human beings

As we settle in,

  • Go to Project Brainstorming, make adjustments to category titles and add your name to categories you’d be interesting in working in.

  • Download a template Quarto file to start from here. Put this file in a folder called wrangling within the activities folder for this course.

Data Storytelling Moment

Go to https://www.abc.net.au/news/2018-12-13/how-life-has-changed-for-people-your-age/10303912?nw=0&r=HtmlFragment#

  • What is the data story?
  • What is effective?
  • What could be improved?

Learning goals

After this lesson, you should be able to:

  • Determine the class of a given object and identify concerns to be wary of when manipulating an object of that class (numerics, logicals, factors, dates, strings, data.frames)
  • Explain what vector recycling is, when it can be a problem, and how to avoid those problems
  • Use a variety of functions to wrangle numerical and logical data
  • Extract date-time information using the lubridate package
  • Use the forcats package to wrangle factor data





Helpful cheatsheets

RStudio (Posit) maintains a collection of wonderful cheatsheets. The following will be helpful:

Data Wrangling Verbs (from Stat/Comp 112)

  • mutate(): creates/changes columns/elements in a data frame/tibble
  • select(): keeps subset of columns/elements in a data frame/tibble
  • filter(): keeps subsets of rows in a data frame/tibble
  • arrange(): sorts rows in a data frame/tibble
  • group_by(): internally groups rows in data frame/tibble by values in 1 or more columsn/elements
  • summarize(): collapses/combines information across rows using functions such as n(), sum(), mean(), min(), max(), median(), sd()
  • count(): shortcut for group_by() %>% summarize(n = n())
  • left_join(): mutating join of two data frames/tibbles keeping all rows in left data frame
  • full_join(): mutating join of two data frames/tibbles keeping all rows in both data frames
  • inner_join(): mutating join of two data frames/tibbles keeping rows in left data frame that find match in right
  • semi_join(): filtering join of two data frames/tibbles keeping rows in left data frame that find match in right
  • anti_join(): filtering join of two data frames/tibbles keeping rows in left data frame that do not find match in right
  • pivot_wider(): rearrange values from two columns to many(one column becomes the names of new variables, one column becomes the values of the new variables)
  • pivot_longer(): rearrange values from many columns to two (the names of the columns go to one new variable, the values of the columns go to a second new variable)





Vectors

An atomic vector is a storage container in R where all elements in the container are of the same type. The types that are relevant to data science are:

  • logical (also known as boolean)
  • numbers
    • integer
    • numeric floating point (also known as double)
  • character string
  • Date and date-time (saved as POSIXct)
  • factor

. . .

Function documentation will refer to vectors frequently.

See examples below:

  • ggplot2::scale_x_continuous()
    • breaks: A numeric vector of positions
    • labels: A character vector giving labels (must be same length as breaks)
  • shiny::sliderInput()
    • value: The initial value of the slider […] A length one vector will create a regular slider; a length two vector will create a double-ended range slider.

. . .





When you need a vector, you can create one manually using

  • c(): the combine function

Or you can create one based on available data using

  • dataset %>% mutate(newvar = variable > 5) %>% pull(newvar): taking one column out of a dataset
  • dataset %>% pull(variable) %>% unique(): taking one column out of a dataset and finding unique values
c("Fair", "Good", "Very Good", "Premium", "Ideal")
[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"    
diamonds %>% pull(cut) %>% unique()
[1] Ideal     Premium   Good      Very Good Fair     
Levels: Fair < Good < Very Good < Premium < Ideal

Logicals

Notes

What does a logical vector look like?

x <- c(TRUE, FALSE, NA)
x
[1]  TRUE FALSE    NA
class(x)
[1] "logical"

. . .

You will often create logical vectors with comparison operators: >, <, <=, >=, ==, !=.

x <- c(1, 2, 9, 12)
x < 2
[1]  TRUE FALSE FALSE FALSE
x <= 2
[1]  TRUE  TRUE FALSE FALSE
x > 9
[1] FALSE FALSE FALSE  TRUE
x >= 9
[1] FALSE FALSE  TRUE  TRUE
x == 12
[1] FALSE FALSE FALSE  TRUE
x != 12
[1]  TRUE  TRUE  TRUE FALSE

. . .

When you want to check for set containment, the %in% operator is the correct way to do this (as opposed to ==).

x <- c(1, 2, 9, 4)
x == c(1, 2, 4)
Warning in x == c(1, 2, 4): longer object length is not a multiple of shorter
object length
[1]  TRUE  TRUE FALSE FALSE
x %in% c(1, 2, 4)
[1]  TRUE  TRUE FALSE  TRUE

. . .

The Warning: longer object length is not a multiple of shorter object length is a manifestation of vector recycling.

In R, if two vectors are being combined or compared, the shorter one will be repeated to match the length of the longer one–even if longer object length isn’t a multiple of the shorter object length. We can see the exact recycling that happens below:

x <- c(1, 2, 9, 4)
x == c(1, 2, 4)
[1]  TRUE  TRUE FALSE FALSE
x == c(1, 2, 4, 1) # This line demonstrates the recycling that happens on the previous line
[1]  TRUE  TRUE FALSE FALSE

. . .

Logical vectors can also be created with functions. is.na() is one useful example:

x <- c(1, 4, 9, NA)
x == NA
[1] NA NA NA NA
is.na(x)
[1] FALSE FALSE FALSE  TRUE

. . .

We can negate a logical object with !. We can combine logical objects with & (and) and | (or).

x <- c(1, 2, 4, 9)
x > 1 & x < 5
[1] FALSE  TRUE  TRUE FALSE
!(x > 1 & x < 5)
[1]  TRUE FALSE FALSE  TRUE
x < 2 | x > 8
[1]  TRUE FALSE FALSE  TRUE

. . .

We can summarize logical vectors with:

  • any(): Are ANY of the values TRUE?
  • all(): Are ALL of the values TRUE?
  • sum(): How many of the values are TRUE?
  • mean(): What fraction of the values are TRUE?
x <- c(1, 2, 4, 9)
any(x == 1)
[1] TRUE
all(x < 10)
[1] TRUE
sum(x == 1)
[1] 1
mean(x == 1)
[1] 0.25

if_else() and case_when() are functions that allow you to return values depending on the value of a logical vector. You’ll explore the documentation for these in the following exercises.

Note: ifelse() (from base R) and if_else() (from tidyverse) are different functions. We prefer if_else() for many reasons (examples below).

  • Noisy to make sure you catch issues/bugs
  • Can explicitly handle missing values
  • Keeps dates as dates
Examples
x <- c(-1, -2, 4, 9, NA)

ifelse(x > 0, 'positive', 'negative')
[1] "negative" "negative" "positive" "positive" NA        
if_else(x > 0, 'positive', 'negative')
[1] "negative" "negative" "positive" "positive" NA        
ifelse(x > 0, 1, 'negative') # Bad: doesn't complain with combo of data types
[1] "negative" "negative" "1"        "1"        NA        
if_else(x > 0, 1, 'negative') # Good:noisy to make sure you catch issues
Error in `if_else()`:
! Can't combine `true` <double> and `false` <character>.
if_else(x > 0, 'positive', 'negative', missing = 'missing') # Good: can explicitly handle NA
[1] "negative" "negative" "positive" "positive" "missing" 
fun_dates <- mdy('1-1-2025') + 0:365
ifelse(fun_dates < today(), fun_dates + years(), fun_dates) # Bad: converts dates to integers
  [1] 20454 20455 20456 20457 20458 20459 20460 20461 20462 20463 20464 20465
 [13] 20466 20467 20468 20469 20470 20471 20472 20473 20474 20475 20476 20477
 [25] 20478 20479 20480 20481 20482 20483 20484 20485 20486 20487 20488 20489
 [37] 20490 20491 20492 20493 20494 20495 20496 20497 20498 20499 20500 20501
 [49] 20502 20503 20504 20505 20506 20507 20508 20509 20510 20511 20512 20513
 [61] 20514 20515 20516 20517 20518 20519 20520 20521 20522 20523 20524 20525
 [73] 20526 20527 20528 20529 20530 20531 20532 20533 20534 20535 20536 20537
 [85] 20538 20539 20540 20541 20542 20543 20544 20545 20546 20547 20548 20549
 [97] 20550 20551 20552 20553 20554 20555 20556 20557 20193 20194 20195 20196
[109] 20197 20198 20199 20200 20201 20202 20203 20204 20205 20206 20207 20208
[121] 20209 20210 20211 20212 20213 20214 20215 20216 20217 20218 20219 20220
[133] 20221 20222 20223 20224 20225 20226 20227 20228 20229 20230 20231 20232
[145] 20233 20234 20235 20236 20237 20238 20239 20240 20241 20242 20243 20244
[157] 20245 20246 20247 20248 20249 20250 20251 20252 20253 20254 20255 20256
[169] 20257 20258 20259 20260 20261 20262 20263 20264 20265 20266 20267 20268
[181] 20269 20270 20271 20272 20273 20274 20275 20276 20277 20278 20279 20280
[193] 20281 20282 20283 20284 20285 20286 20287 20288 20289 20290 20291 20292
[205] 20293 20294 20295 20296 20297 20298 20299 20300 20301 20302 20303 20304
[217] 20305 20306 20307 20308 20309 20310 20311 20312 20313 20314 20315 20316
[229] 20317 20318 20319 20320 20321 20322 20323 20324 20325 20326 20327 20328
[241] 20329 20330 20331 20332 20333 20334 20335 20336 20337 20338 20339 20340
[253] 20341 20342 20343 20344 20345 20346 20347 20348 20349 20350 20351 20352
[265] 20353 20354 20355 20356 20357 20358 20359 20360 20361 20362 20363 20364
[277] 20365 20366 20367 20368 20369 20370 20371 20372 20373 20374 20375 20376
[289] 20377 20378 20379 20380 20381 20382 20383 20384 20385 20386 20387 20388
[301] 20389 20390 20391 20392 20393 20394 20395 20396 20397 20398 20399 20400
[313] 20401 20402 20403 20404 20405 20406 20407 20408 20409 20410 20411 20412
[325] 20413 20414 20415 20416 20417 20418 20419 20420 20421 20422 20423 20424
[337] 20425 20426 20427 20428 20429 20430 20431 20432 20433 20434 20435 20436
[349] 20437 20438 20439 20440 20441 20442 20443 20444 20445 20446 20447 20448
[361] 20449 20450 20451 20452 20453 20454
if_else(fun_dates < today(), fun_dates + years(), fun_dates) # Good: keeps dates as dates
  [1] "2026-01-01" "2026-01-02" "2026-01-03" "2026-01-04" "2026-01-05"
  [6] "2026-01-06" "2026-01-07" "2026-01-08" "2026-01-09" "2026-01-10"
 [11] "2026-01-11" "2026-01-12" "2026-01-13" "2026-01-14" "2026-01-15"
 [16] "2026-01-16" "2026-01-17" "2026-01-18" "2026-01-19" "2026-01-20"
 [21] "2026-01-21" "2026-01-22" "2026-01-23" "2026-01-24" "2026-01-25"
 [26] "2026-01-26" "2026-01-27" "2026-01-28" "2026-01-29" "2026-01-30"
 [31] "2026-01-31" "2026-02-01" "2026-02-02" "2026-02-03" "2026-02-04"
 [36] "2026-02-05" "2026-02-06" "2026-02-07" "2026-02-08" "2026-02-09"
 [41] "2026-02-10" "2026-02-11" "2026-02-12" "2026-02-13" "2026-02-14"
 [46] "2026-02-15" "2026-02-16" "2026-02-17" "2026-02-18" "2026-02-19"
 [51] "2026-02-20" "2026-02-21" "2026-02-22" "2026-02-23" "2026-02-24"
 [56] "2026-02-25" "2026-02-26" "2026-02-27" "2026-02-28" "2026-03-01"
 [61] "2026-03-02" "2026-03-03" "2026-03-04" "2026-03-05" "2026-03-06"
 [66] "2026-03-07" "2026-03-08" "2026-03-09" "2026-03-10" "2026-03-11"
 [71] "2026-03-12" "2026-03-13" "2026-03-14" "2026-03-15" "2026-03-16"
 [76] "2026-03-17" "2026-03-18" "2026-03-19" "2026-03-20" "2026-03-21"
 [81] "2026-03-22" "2026-03-23" "2026-03-24" "2026-03-25" "2026-03-26"
 [86] "2026-03-27" "2026-03-28" "2026-03-29" "2026-03-30" "2026-03-31"
 [91] "2026-04-01" "2026-04-02" "2026-04-03" "2026-04-04" "2026-04-05"
 [96] "2026-04-06" "2026-04-07" "2026-04-08" "2026-04-09" "2026-04-10"
[101] "2026-04-11" "2026-04-12" "2026-04-13" "2026-04-14" "2025-04-15"
[106] "2025-04-16" "2025-04-17" "2025-04-18" "2025-04-19" "2025-04-20"
[111] "2025-04-21" "2025-04-22" "2025-04-23" "2025-04-24" "2025-04-25"
[116] "2025-04-26" "2025-04-27" "2025-04-28" "2025-04-29" "2025-04-30"
[121] "2025-05-01" "2025-05-02" "2025-05-03" "2025-05-04" "2025-05-05"
[126] "2025-05-06" "2025-05-07" "2025-05-08" "2025-05-09" "2025-05-10"
[131] "2025-05-11" "2025-05-12" "2025-05-13" "2025-05-14" "2025-05-15"
[136] "2025-05-16" "2025-05-17" "2025-05-18" "2025-05-19" "2025-05-20"
[141] "2025-05-21" "2025-05-22" "2025-05-23" "2025-05-24" "2025-05-25"
[146] "2025-05-26" "2025-05-27" "2025-05-28" "2025-05-29" "2025-05-30"
[151] "2025-05-31" "2025-06-01" "2025-06-02" "2025-06-03" "2025-06-04"
[156] "2025-06-05" "2025-06-06" "2025-06-07" "2025-06-08" "2025-06-09"
[161] "2025-06-10" "2025-06-11" "2025-06-12" "2025-06-13" "2025-06-14"
[166] "2025-06-15" "2025-06-16" "2025-06-17" "2025-06-18" "2025-06-19"
[171] "2025-06-20" "2025-06-21" "2025-06-22" "2025-06-23" "2025-06-24"
[176] "2025-06-25" "2025-06-26" "2025-06-27" "2025-06-28" "2025-06-29"
[181] "2025-06-30" "2025-07-01" "2025-07-02" "2025-07-03" "2025-07-04"
[186] "2025-07-05" "2025-07-06" "2025-07-07" "2025-07-08" "2025-07-09"
[191] "2025-07-10" "2025-07-11" "2025-07-12" "2025-07-13" "2025-07-14"
[196] "2025-07-15" "2025-07-16" "2025-07-17" "2025-07-18" "2025-07-19"
[201] "2025-07-20" "2025-07-21" "2025-07-22" "2025-07-23" "2025-07-24"
[206] "2025-07-25" "2025-07-26" "2025-07-27" "2025-07-28" "2025-07-29"
[211] "2025-07-30" "2025-07-31" "2025-08-01" "2025-08-02" "2025-08-03"
[216] "2025-08-04" "2025-08-05" "2025-08-06" "2025-08-07" "2025-08-08"
[221] "2025-08-09" "2025-08-10" "2025-08-11" "2025-08-12" "2025-08-13"
[226] "2025-08-14" "2025-08-15" "2025-08-16" "2025-08-17" "2025-08-18"
[231] "2025-08-19" "2025-08-20" "2025-08-21" "2025-08-22" "2025-08-23"
[236] "2025-08-24" "2025-08-25" "2025-08-26" "2025-08-27" "2025-08-28"
[241] "2025-08-29" "2025-08-30" "2025-08-31" "2025-09-01" "2025-09-02"
[246] "2025-09-03" "2025-09-04" "2025-09-05" "2025-09-06" "2025-09-07"
[251] "2025-09-08" "2025-09-09" "2025-09-10" "2025-09-11" "2025-09-12"
[256] "2025-09-13" "2025-09-14" "2025-09-15" "2025-09-16" "2025-09-17"
[261] "2025-09-18" "2025-09-19" "2025-09-20" "2025-09-21" "2025-09-22"
[266] "2025-09-23" "2025-09-24" "2025-09-25" "2025-09-26" "2025-09-27"
[271] "2025-09-28" "2025-09-29" "2025-09-30" "2025-10-01" "2025-10-02"
[276] "2025-10-03" "2025-10-04" "2025-10-05" "2025-10-06" "2025-10-07"
[281] "2025-10-08" "2025-10-09" "2025-10-10" "2025-10-11" "2025-10-12"
[286] "2025-10-13" "2025-10-14" "2025-10-15" "2025-10-16" "2025-10-17"
[291] "2025-10-18" "2025-10-19" "2025-10-20" "2025-10-21" "2025-10-22"
[296] "2025-10-23" "2025-10-24" "2025-10-25" "2025-10-26" "2025-10-27"
[301] "2025-10-28" "2025-10-29" "2025-10-30" "2025-10-31" "2025-11-01"
[306] "2025-11-02" "2025-11-03" "2025-11-04" "2025-11-05" "2025-11-06"
[311] "2025-11-07" "2025-11-08" "2025-11-09" "2025-11-10" "2025-11-11"
[316] "2025-11-12" "2025-11-13" "2025-11-14" "2025-11-15" "2025-11-16"
[321] "2025-11-17" "2025-11-18" "2025-11-19" "2025-11-20" "2025-11-21"
[326] "2025-11-22" "2025-11-23" "2025-11-24" "2025-11-25" "2025-11-26"
[331] "2025-11-27" "2025-11-28" "2025-11-29" "2025-11-30" "2025-12-01"
[336] "2025-12-02" "2025-12-03" "2025-12-04" "2025-12-05" "2025-12-06"
[341] "2025-12-07" "2025-12-08" "2025-12-09" "2025-12-10" "2025-12-11"
[346] "2025-12-12" "2025-12-13" "2025-12-14" "2025-12-15" "2025-12-16"
[351] "2025-12-17" "2025-12-18" "2025-12-19" "2025-12-20" "2025-12-21"
[356] "2025-12-22" "2025-12-23" "2025-12-24" "2025-12-25" "2025-12-26"
[361] "2025-12-27" "2025-12-28" "2025-12-29" "2025-12-30" "2025-12-31"
[366] "2026-01-01"

Exercises

Load the diamonds dataset, and filter to the first 1000 diamonds.

data(diamonds)
diamonds <- diamonds %>% 
    slice_head(n = 1000)

Using tidyverse functions, complete the following:

  1. Subset to diamonds that are less than 400 dollars or more than 10000 dollars.
  2. Subset to diamonds that are between 500 and 600 dollars (inclusive).
  3. How many diamonds are of either Fair, Premium, or Ideal cut (a total count)? What fraction of diamonds are of Fair, Premium, or Ideal cut?
    • First, do this a wrong way with ==. Predict the warning message that you will receive.
    • Second, do this the correct way with an appropriate logical operator.
  4. Are there any diamonds of Fair cut that are more than $3000? Are all diamonds of Ideal cut more than $2000?
  5. Create two new categorized versions of price by looking up the documentation for if_else() and case_when():
    • price_cat1: “low” if price is less than 500 and “high” otherwise
    • price_cat2: “low” if price is less than 500, “medium” if price is between 500 and 1000 dollars inclusive, and “high” otherwise.
#1

#2

#3

#4

#5

Numerics

Notes

Numerical data can be of class integer or numeric (representing real numbers).

x <- 1:3
x
[1] 1 2 3
class(x)
[1] "integer"
x <- c(1+1e-9, 2, 3)
x
[1] 1 2 3
class(x)
[1] "numeric"

. . .

The Numbers chapter in R4DS covers the following functions that are all useful for wrangling numeric data:

  • n(), n_distinct(): Counting and counting the number of unique values
  • sum(is.na()): Counting the number of missing values
  • min(), max()
  • pmin(), pmax(): Get the min and max across several vectors
  • Integer division: %/%. Remainder: %%
    • 121 %/% 100 = 1 and 121 %% 100 = 21
  • round(), floor(), ceiling(): Rounding functions (to a specified number of decimal places, to the largest integer below a number, to the smallest integer above a number)
  • cut(): Cut a numerical vector into categories
  • cumsum(), cummean(), cummin(), cummax(): Cumulative functions
  • rank(): Provide the ranks of the numbers in a vector
  • lead(), lag(): shift a vector by padding with NAs
  • Numerical summaries: mean, median, min, max, quantile, sd, IQR
    • Note that all numerical summary functions have an na.rm argument that should be set to TRUE if you have missing data.

Exercises

Exercises will be on HW4.

The best way to add these functions and operators to your vocabulary is to need to recall them. Refer to the list of functions above as you try the exercises.

You will need to reference function documentation to look at arguments and look in the Examples section.

Dates

Notes

The lubridate package contains useful functions for working with dates and times. The lubridate function reference is a useful resource for finding the functions you need. We’ll take a brief tour of this reference page.

. . .

We’ll use the lakers dataset in the lubridate package to illustrate some examples.

lakers <- as_tibble(lakers)
head(lakers)
# A tibble: 6 × 13
     date opponent game_type time  period etype team  player result points type 
    <int> <chr>    <chr>     <chr>  <int> <chr> <chr> <chr>  <chr>   <int> <chr>
1  2.01e7 POR      home      12:00      1 jump… OFF   ""     ""          0 ""   
2  2.01e7 POR      home      11:39      1 shot  LAL   "Pau … "miss…      0 "hoo…
3  2.01e7 POR      home      11:37      1 rebo… LAL   "Vlad… ""          0 "off"
4  2.01e7 POR      home      11:25      1 shot  LAL   "Dere… "miss…      0 "lay…
5  2.01e7 POR      home      11:23      1 rebo… LAL   "Pau … ""          0 "off"
6  2.01e7 POR      home      11:22      1 shot  LAL   "Pau … "made"      2 "hoo…
# ℹ 2 more variables: x <int>, y <int>

. . .

Below we use date-time parsing functions to represent the date and time variables with date-time classes:

lakers <- lakers %>%
    mutate(
        date = ymd(date),
        time = ms(time)
    )

. . .

Below we use extraction functions to get components of the date-time objects:

lakers_clean <- lakers %>%
    mutate(
        year = year(date),
        month = month(date),
        day = day(date),
        day_of_week = wday(date, label = TRUE),
        minute = minute(time),
        second = second(time)
    )
lakers_clean %>% select(year:second)
# A tibble: 34,624 × 6
    year month   day day_of_week minute second
   <dbl> <dbl> <int> <ord>        <dbl>  <dbl>
 1  2008    10    28 Tue             12      0
 2  2008    10    28 Tue             11     39
 3  2008    10    28 Tue             11     37
 4  2008    10    28 Tue             11     25
 5  2008    10    28 Tue             11     23
 6  2008    10    28 Tue             11     22
 7  2008    10    28 Tue             11     22
 8  2008    10    28 Tue             11     22
 9  2008    10    28 Tue             11      0
10  2008    10    28 Tue             10     53
# ℹ 34,614 more rows
lakers_clean <- lakers_clean %>%
    group_by(date, opponent, period) %>%
    arrange(date, opponent, period, desc(time)) %>%
    mutate(
        diff_btw_plays_sec = as.numeric(time - lag(time, 1))
    )
lakers_clean %>% select(date, opponent, time, period, diff_btw_plays_sec)
# A tibble: 34,624 × 5
# Groups:   date, opponent, period [314]
   date       opponent time     period diff_btw_plays_sec
   <date>     <chr>    <Period>  <int>              <dbl>
 1 2008-10-28 POR      12M 0S        1                 NA
 2 2008-10-28 POR      11M 39S       1                -21
 3 2008-10-28 POR      11M 37S       1                 -2
 4 2008-10-28 POR      11M 25S       1                -12
 5 2008-10-28 POR      11M 23S       1                 -2
 6 2008-10-28 POR      11M 22S       1                 -1
 7 2008-10-28 POR      11M 22S       1                  0
 8 2008-10-28 POR      11M 22S       1                  0
 9 2008-10-28 POR      11M 0S        1                -22
10 2008-10-28 POR      10M 53S       1                 -7
# ℹ 34,614 more rows

Exercises

Exercises will be on HW4.

Factors

Notes

Creating factors

In R, factors are made up of two components: the actual values of the data and the possible levels within the factor. Creating a factor requires supplying both pieces of information.

months <- c("Mar", "Dec", "Jan",  "Apr", "Jul")

. . .

However, if we were to sort this vector, R would sort this vector alphabetically.

# alphabetical sort
sort(months)
[1] "Apr" "Dec" "Jan" "Jul" "Mar"

. . .

We can fix this sorting by creating a factor version of months. The levels argument is a character vector that specifies the unique values that the factor can take. The order of the values in levels defines the sorting of the factor.

months_fct <- factor(months, levels = month.abb) # month.abb is a built-in variable
months_fct
[1] Mar Dec Jan Apr Jul
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(months_fct)
[1] Jan Mar Apr Jul Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

. . .

What if we try to create a factor with values that aren’t in the levels? (e.g., a typo in a month name)

months2 <- c("Jna", "Mar")
factor(months2, levels = month.abb)
[1] <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

. . .

Because the NA is introduced silently (without any error or warnings), this can be dangerous. It might be better to use the fct() function in the forcats package instead:

fct(months2, levels = month.abb)
Error in `fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "Jna"

. . .

Reordering factors

We’ll use a subset of the General Social Survey (GSS) dataset available in the forcats pacakges.

data(gss_cat)
head(gss_cat)
# A tibble: 6 × 9
   year marital         age race  rincome        partyid     relig denom tvhours
  <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA

. . .

Reordering the levels of a factor can be useful in plotting when categories would benefit from being sorted in a particular way:

relig_summary <- gss_cat %>%
    group_by(relig) %>%
    summarize(
        tvhours = mean(tvhours, na.rm = TRUE),
        n = n()
    )

ggplot(relig_summary, aes(x = tvhours, y = relig)) + 
    geom_point() +
    theme_classic()

. . .

We can use fct_reorder() in forcats.

  • The first argument is the factor that you want to reorder the levels of
  • The second argument determines how the factor is sorted (analogous to what you put inside arrange() when sorting the rows of a data frame.)
ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
    geom_point() +
    theme_classic()

. . .

For bar plots, we can use fct_infreq() to reorder levels from most to least common. This can be combined with fct_rev() to reverse the order (least to most common):

gss_cat %>%
    ggplot(aes(x = marital)) +
    geom_bar() +
    theme_classic()

gss_cat %>%
    mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
    ggplot(aes(x = marital)) +
    geom_bar() +
    theme_classic()

. . .

Modifying factor levels

We talked about reordering the levels of a factor–what about changing the values of the levels themselves?

For example, the names of the political parties in the GSS could use elaboration (“str” isn’t a great label for “strong”) and clean up:

gss_cat %>% count(partyid)
# A tibble: 10 × 2
   partyid                n
   <fct>              <int>
 1 No answer            154
 2 Don't know             1
 3 Other party          393
 4 Strong republican   2314
 5 Not str republican  3032
 6 Ind,near rep        1791
 7 Independent         4119
 8 Ind,near dem        2499
 9 Not str democrat    3690
10 Strong democrat     3490

. . .

We can use fct_recode() on partyid with the new level names going on the left and the old levels on the right. Any levels that aren’t mentioned explicitly (i.e., “Don’t know” and “Other party”) will be left as is:

gss_cat %>%
    mutate(
        partyid = fct_recode(partyid,
            "Republican, strong"    = "Strong republican",
            "Republican, weak"      = "Not str republican",
            "Independent, near rep" = "Ind,near rep",
            "Independent, near dem" = "Ind,near dem",
            "Democrat, weak"        = "Not str democrat",
            "Democrat, strong"      = "Strong democrat"
        )
    ) %>%
    count(partyid)
# A tibble: 10 × 2
   partyid                   n
   <fct>                 <int>
 1 No answer               154
 2 Don't know                1
 3 Other party             393
 4 Republican, strong     2314
 5 Republican, weak       3032
 6 Independent, near rep  1791
 7 Independent            4119
 8 Independent, near dem  2499
 9 Democrat, weak         3690
10 Democrat, strong       3490

. . .

To combine groups, we can assign multiple old levels to the same new level (“Other” maps to “No answer”, “Don’t know”, and “Other party”):

gss_cat %>%
    mutate(
        partyid = fct_recode(partyid,
            "Republican, strong"    = "Strong republican",
            "Republican, weak"      = "Not str republican",
            "Independent, near rep" = "Ind,near rep",
            "Independent, near dem" = "Ind,near dem",
            "Democrat, weak"        = "Not str democrat",
            "Democrat, strong"      = "Strong democrat",
            "Other"                 = "No answer",
            "Other"                 = "Don't know",
            "Other"                 = "Other party"
        )
    )
# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Independe… Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Republica… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Independe… Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Democrat,… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Democrat,… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Republica… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Independe… Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Democrat,… Prot… Other       0
10  2000 Married          47 White $25000 or more Republica… Prot… Sout…       3
# ℹ 21,473 more rows

. . .

We can use fct_collapse() to collapse many levels:

gss_cat %>%
    mutate(
        partyid = fct_collapse(partyid,
            "Other" = c("No answer", "Don't know", "Other party"),
            "Republican" = c("Strong republican", "Not str republican"),
            "Independent" = c("Ind,near rep", "Independent", "Ind,near dem"),
            "Democrat" = c("Not str democrat", "Strong democrat")
        )
    ) %>%
    count(partyid)
# A tibble: 4 × 2
  partyid         n
  <fct>       <int>
1 Other         548
2 Republican   5346
3 Independent  8409
4 Democrat     7180

Exercises

  1. Create a factor version of the following data with the levels in a sensible order.
ratings <- c("High", "Medium", "Low")

More exercises will be on HW4.

Solutions

Logical Exercises

Solution
# 1
diamonds %>% 
    filter(price < 400 | price > 10000)
# A tibble: 30 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 20 more rows
# 2
diamonds %>% 
    filter(price >= 500, price <= 600)
# A tibble: 90 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.35 Ideal     I     VS1      60.9  57     552  4.54  4.59  2.78
 2  0.3  Premium   D     SI1      62.6  59     552  4.23  4.27  2.66
 3  0.3  Ideal     D     SI1      62.5  57     552  4.29  4.32  2.69
 4  0.3  Ideal     D     SI1      62.1  56     552  4.3   4.33  2.68
 5  0.42 Premium   I     SI2      61.5  59     552  4.78  4.84  2.96
 6  0.28 Ideal     G     VVS2     61.4  56     553  4.19  4.22  2.58
 7  0.32 Ideal     I     VVS1     62    55.3   553  4.39  4.42  2.73
 8  0.31 Very Good G     SI1      63.3  57     553  4.33  4.3   2.73
 9  0.31 Premium   G     SI1      61.8  58     553  4.35  4.32  2.68
10  0.24 Premium   E     VVS1     60.7  58     553  4.01  4.03  2.44
# ℹ 80 more rows
# 3
## Wrong way with ==
diamonds %>% 
    mutate(is_fpi = cut==c("Fair", "Premium", "Ideal")) %>% 
    summarize(num_fpi = sum(is_fpi), frac_fpi = mean(is_fpi))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `is_fpi = cut == c("Fair", "Premium", "Ideal")`.
Caused by warning in `==.default`:
! longer object length is not a multiple of shorter object length
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
# A tibble: 1 × 2
  num_fpi frac_fpi
    <int>    <dbl>
1     226    0.226
## Right way with %in%
diamonds %>% 
    mutate(is_fpi = cut %in% c("Fair", "Premium", "Ideal")) %>% 
    summarize(num_fpi = sum(is_fpi), frac_fpi = mean(is_fpi))
# A tibble: 1 × 2
  num_fpi frac_fpi
    <int>    <dbl>
1     685    0.685
# 4
diamonds %>% 
    filter(cut == "Fair") %>% 
    summarize(any_high = any(price > 3000))
# A tibble: 1 × 1
  any_high
  <lgl>   
1 FALSE   
diamonds %>% 
    filter(cut == "Ideal") %>% 
    summarize(all_high = all(price > 2000))
# A tibble: 1 × 1
  all_high
  <lgl>   
1 FALSE   
# 5
diamonds %>% 
    mutate(
        price_cat1 = if_else(price < 500, "low", "high"),
        price_cat2 = case_when(
            price < 500 ~ "low",
            price >= 500 & price <= 1000 ~ "medium",
            price > 1000 ~ "high"
        )
    )
# A tibble: 1,000 × 12
   carat cut       color clarity depth table price     x     y     z price_cat1
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>     
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 low       
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 low       
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 low       
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 low       
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 low       
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 low       
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 low       
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 low       
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 low       
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 low       
# ℹ 990 more rows
# ℹ 1 more variable: price_cat2 <chr>

Factor Exercises

Solution
ratings_fct <- fct(ratings, levels = c("Low", "Medium", "High"))
ratings_fct
[1] High   Medium Low   
Levels: Low Medium High

Reflection

What was challenging? What was easier? What ideas do you have for keeping track of the many functions relevant to data wrangling?





After Class

  • Take a look at the Schedule page to see how to prepare for the next class.
  • Finish Homework 3.
  • Continue narrowing your project work; Milestone 1 is due with HW4.