Topic 14 Regular Expressions

Learning Goals

Develop comfort in working with strings of text data
Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package.

Create a new Rmd file (save it as 14-Regex.Rmd). Put this file in a folder Assignment_09 in your COMP_STAT_112 folder.

Make sure to add alt text using fig.alt!

Regular Expressions and Character Strings

Regular expressions allow us to describe character patterns. Regular expressions allow us to:¹⁶

Search for particular items within a large body of text. For example, you may wish to identify and extract all email addresses.
Replace particular items. For example, you may wish to clean up some poorly formatted HTML by replacing all uppercase tags with lowercase equivalents.
Validate input. For example, you may want to check that a password meets certain criteria such as, a mix of uppercase and lowercase, digits and punctuation.
Coordinate actions. For example, you may wish to process certain files in a directory, but only if they meet particular conditions.
Reformat text. For example, you may want to split strings into different parts, each to form new variables.
and more…

Start by doing this interactive tutorial. Note that neither the tutorial nor regular expressions more generally are specific to R.

Some of the syntax in the tutorial is slightly different from what we’ll use in R, but it will still help you get acclimated to the main ideas of regular expressions.

Wrangling with Regular Expressions in `R`

Now that we have some idea how regular expressions work, let’s examine how to use them to achieve various tasks in R. It will be helpful to have your cheat sheet handy.

Note: Many of these tasks can either be accomplished with functions from the base (built-in) package in R or from the stringr package, which is part of the Tidyverse. In general, the stringr functions are faster, which will be noticeable when processing a large amount of text data.

example <- "The quick brown fox jumps over the lazy dog."
example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"

Search and replace patterns with `str_replace` or `str_replace_all` (`stringr`)

To search for a pattern and replace it, we can use the function str_replace and str_replace_all in the stringr package. Note that str_replace only replaces the first matched pattern, while str_replace_all replaces all. Here are some examples:

str_replace(example, "quick", "really quick")

## [1] "The really quick brown fox jumps over the lazy dog."

str_replace_all(example, "(fox|dog)", "****") # | reads as OR

## [1] "The quick brown **** jumps over the lazy ****."

str_replace_all(example, "(fox|dog).", "****") # "." for any character

## [1] "The quick brown ****jumps over the lazy ****"

str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period

## [1] "The quick brown fox jumps over the lazy ****"

str_replace_all(example, "the", "a") # case-sensitive only matches one

## [1] "The quick brown fox jumps over a lazy dog."

str_replace_all(example, "[Tt]he", "a") # # will match either t or T; could also make "a" conditional on capitalization of t

## [1] "a quick brown fox jumps over a lazy dog."

str_replace_all(example, "[Tt]he", "a") # first match only

## [1] "a quick brown fox jumps over a lazy dog."

Detect patterns with `str_detect` (`stringr`)

example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"
examples <- c(example, example2, example3)

pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well

str_detect(examples, pat) # TRUE/FALSE if it detects pattern

## [1]  TRUE  TRUE FALSE

str_subset(examples, pat) # Pulls out those that detects pattern

## [1] "The quick brown fox jumps over the lazy dog."                                                                                                        
## [2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

Locate patterns with `str_locate`

str_locate(example, pat) # starting position and ending position of first match

##      start end
## [1,]    23  25

Let’s check the answer:

str_sub(example, 23, 25)

## [1] "mps"

Extract patterns with `str_extract` and `str_extract_all`

pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant
str_extract(example2, pat2) # extract first match

## [1] "road"

str_extract_all(example2, pat2, simplify = TRUE) # extract all matches

##      [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
## [1,] "road" "wood" "coul" "tood" "look" "coul"

Count the number of characters with `str_length`

str_length(example2)

## [1] 148

Convert a string to all lower case letters with `str_to_lower`

str_to_lower(example2)

## [1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"

Split strings with `separate` (`tidyr`)

df <- tibble(ex = example2)
df <- separate(df, ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df$line1

## [1] "Two roads diverged in a yellow wood,"

df$line2

## [1] "And sorry I could not travel both"

df$line3

## [1] "And be one traveler, long I stood"

df$line4

## [1] "And looked down one as far as I could"

Note: The function separate() is in the tidyr package.

Practice: Fall 2022 Enrollment Exploration

The tibble courses has the Fall 2022 enrollment information from the Macalester Registrar’s website, which we could gain with web scraping tools. See code below if you are interested in trying out web scraping!

fall2022 <- read_html("https://www.macalester.edu/registrar/schedules/2022fall/class-schedule")

# Retrieve and inspect course numbers
course_nums <-
  fall2022 %>%
  html_nodes(".class-schedule-course-number") %>%
  html_text()

# Retrieve and inspect course names
course_names <-
  fall2022 %>%
  html_nodes(".class-schedule-course-title") %>%
  html_text()

course_nums_clean <- stringr::str_sub(course_nums, end = nchar(course_nums) - 6)

crn <- stringr::str_sub(course_nums, start = nchar(course_nums) - 4)

course_instructors <-
  fall2022 %>%
  html_nodes(".class-schedule-label:nth-child(6)") %>%
  html_text()
course_instructors_short <- stringr::str_sub(trimws(course_instructors), start = 13)

course_days <-
  fall2022 %>%
  html_nodes(".class-schedule-label:nth-child(3)") %>%
  html_text()
course_days_short <- trimws(stringr::str_sub(trimws(course_days), start = 7))

course_times <-
  fall2022 %>%
  html_nodes(".class-schedule-label:nth-child(4)") %>%
  html_text()
course_times_short <- stringr::str_sub(trimws(course_times), start = 7)

course_rooms <-
  fall2022 %>%
  html_nodes(".class-schedule-label:nth-child(5)") %>%
  html_text()
course_rooms_short <- stringr::str_sub(trimws(course_rooms), start = 7)

course_avail <-
  fall2022 %>%
  html_nodes(".class-schedule-label:nth-child(7)") %>%
  html_text()
course_avail_short <- stringr::str_sub(trimws(course_avail), start = 14)

safe_html <- possibly(.f = read_html,otherwise = NA)
SITES <- paste0("https://webapps.macalester.edu/registrardata/classdata/Fall2022/", crn) %>%
  purrr::map(~ posshtml(.x))

safe_read <- possibly(.f = function(x) html_nodes(x, "p:nth-child(1)") %>%
    html_text() %>%
    trimws(), otherwise = NA)
course_desc <- SITES %>%
  purrr::map_chr(~ safe_read(.x))

safe_read2 <- possibly(.f = function(x) html_nodes(x, "p:nth-child(2)") %>%
    html_text() %>%
    trimws() %>% stringr::str_sub(start = 32) %>%
    trimws(), otherwise = NA)
gen_ed <- SITES %>%
  purrr::map_chr(~ safe_read2(.x))


courses <-
  tibble(
    number = course_nums_clean,
    crn = crn,
    name = course_names,
    days = course_days_short,
    time = course_times_short,
    room = course_rooms_short,
    instructor = course_instructors_short,
    avail_max = course_avail_short,
    desc = course_desc,
    gen_ed = gen_ed
  )

write_csv(courses, file = 'Mac2022Courses.csv')

Table 14.1: First six entries in the Fall 2022 Macalester course info data.
number	crn	name	days	time	room	instructor	avail_max
AMST 194-01	10002	Introduction to Asian American Studies	M W F	01:10 pm-02:10 pm	HUM 212	Jake Nagasawa	-3 / 18
AMST 194-F1	10001	What’s After White Empire (and is it already here)?	T R	03:00 pm-04:30 pm	HUM 212	Karin Aguilar-San Juan	0 / 16
AMST 203-01	10545	Politics and Inequality: The American Welfare State	M W F	09:40 am-10:40 am	CARN 305	Lesley Lavery	Closed 0 / 25
AMST 225-01	10352	Native History to 1871	M W F	10:50 am-11:50 am	HUM 111	Jacob Jurss	Closed 3 / 20
AMST 231-01	10820	Sovereignty Matters: Critical Indigeneity, Gender and Governance	M	07:00 pm-10:00 pm	ARTCOM 102	Kirisitina Sailiata	Closed 1 / 16
AMST 237-01	10003	Environmental Justice	T R	01:20 pm-02:50 pm	HUM 212	Kirisitina Sailiata	-1 / 20

You can read in the data created from the webscrapping code above from the course site:

courses <- read_csv('https://bcheggeseth.github.io/112_spring_2023/data/Mac2022Courses.csv')

Exercise 14.1 (Rearrange data table) Make the following changes to the courses data table (save the updated table as courses):

Split number into three separate columns: dept, number, and section.
Split the avail_max variable into two separate variables: avail and max. It might be helpful to first remove all appearances of “Closed”.
Use avail and max to generate a new variable called enrollment.
Split the time variable into two separate columns: start_time and end_time. Convert all of these times into continuous 24 hour times (e.g., 2:15 pm should become 14.25). Hint: check out the documentation for the function parse_date_time().

Exercise 14.2 (WA courses) Make a bar plot showing the number of Fall 2022 sections satisfying the Writing WA requirement, sorted by department code.¹⁷ Note: some courses satisfy multiple requirements.

In the next series of exercises, we are going to build up an analysis to examine the number of student enrollments for each faculty member.

Exercise 14.3 (Filter cases) For this particular analysis, we do not want to consider certain types of sections. Remove all of the following from the data table (save subset dataset as courses2):

All sections in PE or INTD.
All music ensembles and dance practicum sections (these are all of the MUSI and THDA classes with numbers less than 100).
All lab sections. This is one is a bit tricky. You can search for “Lab” or “Laboratory”, but be careful not to eliminate courses with words such as “Labor”. Some of these have section numbers that end in “-L1”, for example.

Exercise 14.4 (Handle cross-listed courses) Some sections are listed under multiple different departments (cross-listed courses), and you will find the same instructor, day, time, etc. Note: they may have different course numbers.

For this activity, we only want to include each actual section once and it doesn’t really matter which department code we associate with this section. Eliminate all duplicated cross-listed courses from courses2, keeping each actual section just once (save updated table as courses3). Hint: look into the R function distinct, and think carefully about how to find duplicates.

Exercise 14.5 (Co-taught courses) Using courses3 (i.e. after removing non 4-credit courses and cross-listed duplicates), make a table with all Fall 2022 co-taught courses (i.e., more than one instructor). You don’t need to save the new table. Hint: There was a class in Fall 2022 called Land/Water that was co-taught. Look for patterns in the data that could let you distinguish which courses were co-taught.

Exercise 14.6 (Faculty enrollments) Using courses3 (i.e. after removing non 4-credit courses and cross-listed duplicates), make a table where each row contains a faculty, the number of sections they are teaching in Fall 2022, and the total enrollments in those section. Sort the table from highest total enrollments to lowest. You should find two current Comp/Stat 112 instructors in the top 3.¹⁸

Exercise 14.7 (Evening courses) Create and display a new table with all night courses (i.e., a subset of courses3). Also make a bar plot showing the number of these courses by day of the week.

Source: regular expression tutorial.↩︎
For this exercise, you can count cross-listed courses towards both departments’ WA counts.↩︎
For the purposes of this exercise, we are just going to leave co-taught courses as is so that you will have an extra row for each pair or triplet of instructors. Alternatives would be to allocate the enrollment number to each of the faculty members or to split it up between the members. The first option would usually be the most appropriate, although it might depend on the course.↩︎

Topic 14 Regular Expressions

Learning Goals

Regular Expressions and Character Strings

Wrangling with Regular Expressions in R

Search and replace patterns with str_replace or str_replace_all (stringr)

Detect patterns with str_detect (stringr)

Locate patterns with str_locate

Extract patterns with str_extract and str_extract_all

Count the number of characters with str_length

Convert a string to all lower case letters with str_to_lower

Split strings with separate (tidyr)