Topic 14 Regular Expressions

Learning Goals

Develop comfort in working with strings of text data
Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package.

You can download a template .Rmd of this activity here.

Regular Expressions and Character Strings

Regular expressions allow us to describe character patterns. Regular expressions allow us to:¹⁶

Search for particular items within a large body of text. For example, you may wish to identify and extract all email addresses.
Replace particular items. For example, you may wish to clean up some poorly formatted HTML by replacing all uppercase tags with lowercase equivalents.
Validate input. For example, you may want to check that a password meets certain criteria such as, a mix of uppercase and lowercase, digits and punctuation.
Coordinate actions. For example, you may wish to process certain files in a directory, but only if they meet particular conditions.
Reformat text. For example, you may want to split strings into different parts, each to form new variables.
and more…

Start by doing this interactive tutorial. Note that neither the tutorial nor regular expressions more generally are specific to R. Some of the syntax in the tutorial is slightly different from what we’ll use in R, but it will still help you get acclimated to the main ideas of regular expressions.

Wrangling with Regular Expressions in `R`

Now that we have some idea how regular expressions work, let’s examine how to use them to achieve various tasks in R. It will be helpful to have your cheat sheet handy. Many of these tasks can either be accomplished with functions from the base (built-in) package in R or from the stringr package, which is part of the Tidyverse. In general, the stringr functions are faster, which will be noticeable when processing a large amount of text data.

example <- "The quick brown fox jumps over the lazy dog."
example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"

Search and replace patterns with `str_replace` or `str_replace_all` (`stringr`)

To search for a pattern and replace it, we can use the function str_replace and str_replace_all in the stringr package. Note that str_replace only replaces the first matched pattern, while str_replace_all replaces all. Here are some examples:

str_replace(example, "quick", "really quick")

## [1] "The really quick brown fox jumps over the lazy dog."

str_replace_all(example, "(fox|dog)", "****")

## [1] "The quick brown **** jumps over the lazy ****."

str_replace_all(example, "(fox|dog).", "****") # "." for any character

## [1] "The quick brown ****jumps over the lazy ****"

str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period

## [1] "The quick brown fox jumps over the lazy ****"

str_replace_all(example, "the", "a") # case-sensitive only matches one

## [1] "The quick brown fox jumps over a lazy dog."

str_replace_all(example, "[Tt]he", "a") # # will match either t or T; could also make "a" conditional on capitalization of t

## [1] "a quick brown fox jumps over a lazy dog."

str_replace_all(example, "[Tt]he", "a") # first match only

## [1] "a quick brown fox jumps over a lazy dog."

Detect patterns with `str_detect` (`stringr`)

example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"
examples <- c(example, example2, example3)

pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well

str_detect(examples, pat) # TRUE/FALSE if it detects pattern

## [1]  TRUE  TRUE FALSE

str_subset(examples, pat) # Pulls out those that detects pattern

## [1] "The quick brown fox jumps over the lazy dog."                                                                                                        
## [2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

Locate patterns with `str_locate`

str_locate(example, pat) # starting position and ending position of first match

##      start end
## [1,]    23  25

Let’s check the answer:

str_sub(example, 23, 25)

## [1] "mps"

Extract patterns with `str_extract` and `str_extract_all`

pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant
str_extract(example2, pat2) # extract first match

## [1] "road"

str_extract_all(example2, pat2, simplify = TRUE) # extract all matches

##      [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
## [1,] "road" "wood" "coul" "tood" "look" "coul"

Count the number of characters with `str_length`

str_length(example2)

## [1] 148

Convert a string to all lower case letters with `str_to_lower`

str_to_lower(example2)

## [1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"

Split strings with `separate`

df <- tibble(ex = example2)
df <- separate(df, ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df$line1

## [1] "Two roads diverged in a yellow wood,"

df$line2

## [1] "And sorry I could not travel both"

df$line3

## [1] "And be one traveler, long I stood"

df$line4

## [1] "And looked down one as far as I could"

Note: The function separate is in the tidyr package.

Practice: Fall 2021 Enrollment Exploration

The tibble courses has the Fall 2021 enrollment information from the Macalester Registrar’s website, which we could gain with web scraping tools.

fall2021 <- read_html("https://www.macalester.edu/registrar/schedules/2021fall/class-schedule")

# Retrieve and inspect course numbers
course_nums <-
  fall2021 %>%
  html_nodes(".class-schedule-course-number") %>%
  html_text()

# Retrieve and inspect course names
course_names <-
  fall2021 %>%
  html_nodes(".class-schedule-course-title") %>%
  html_text()

course_nums_clean <- stringr::str_sub(course_nums, end = nchar(course_nums) - 6)

crn <- stringr::str_sub(course_nums, start = nchar(course_nums) - 4)

course_instructors <-
  fall2021 %>%
  html_nodes(".class-schedule-label:nth-child(6)") %>%
  html_text()
course_instructors_short <- stringr::str_sub(trimws(course_instructors), start = 13)

course_days <-
  fall2021 %>%
  html_nodes(".class-schedule-label:nth-child(3)") %>%
  html_text()
course_days_short <- stringr::str_sub(trimws(course_days), start = 7)

course_times <-
  fall2021 %>%
  html_nodes(".class-schedule-label:nth-child(4)") %>%
  html_text()
course_times_short <- stringr::str_sub(trimws(course_times), start = 7)

course_rooms <-
  fall2021 %>%
  html_nodes(".class-schedule-label:nth-child(5)") %>%
  html_text()
course_rooms_short <- stringr::str_sub(trimws(course_rooms), start = 7)

course_avail <-
  fall2021 %>%
  html_nodes(".class-schedule-label:nth-child(7)") %>%
  html_text()
course_avail_short <- stringr::str_sub(trimws(course_avail), start = 14)


SITES <- paste0("https://webapps.macalester.edu/registrardata/classdata/Fall2021/", crn) %>%
  purrr::map(~ read_html(.x))

course_desc <- SITES %>%
  purrr::map_chr(~ html_nodes(.x, "p:nth-child(1)") %>%
    html_text() %>%
    trimws())

gen_ed <- SITES %>%
  purrr::map_chr(~ html_nodes(.x, "p:nth-child(2)") %>%
    html_text() %>%
    trimws() %>%
    stringr::str_sub(start = 32) %>%
    trimws())


courses <-
  tibble(
    number = course_nums_clean,
    crn = crn,
    name = course_names,
    days = course_days_short,
    time = course_times_short,
    room = course_rooms_short,
    instructor = course_instructors_short,
    avail_max = course_avail_short,
    desc = course_desc,
    gen_ed = gen_ed
  )

write_csv(courses, file = 'Mac2021Courses.csv')

Table 14.1: First six entries in the Fall 2021 Macalester course info data.
number	crn	name	days	time	room	instructor	avail_max
AMST 194-01	10068	The Obama Presidency	M W F	10:50 am-11:50 am	THEATR 002	Duchess Harris	0 / 16
AMST 194-02	10069	Introduction to Asian American Studies	M W F	01:10 pm-02:10 pm	HUM 212	Jake Nagasawa	Closed 7 / 18
AMST 209-01	10781	Civil Rights Movement	T R	01:20 pm-02:50 pm	MAIN 111	Walter Greason	4 / 26
AMST 219-01	10782	In Motion: African Americans in the US (African Americans in Digital Technologies)	T R	03:00 pm-04:30 pm	MAIN 111	Walter Greason	4 / 25
AMST 225-01	10420	Native History to 1871	T R	09:40 am-11:10 am	THEATR 203	Katrina Phillips	1 / 25
AMST 240-01	10271	Race, Culture, and Ethnicity in Education	T R	01:20 pm-02:50 pm	THEATR 201	Jonathan Hamilton	Closed 0 / 25

Exercise 14.1 (Rearrange data table) Make the following changes to the courses data table:

Split number into three separate columns: dept, number, and section.
Split the avail_max variable into two separate variables: avail and max. It might be helpful to first remove all appearances of “Closed”.
Use avail and max to generate a new variable called enrollment.
Split the time variable into two separate columns: start_time and end_time. Convert all of these times into continuous 24 hour times (e.g., 2:15 pm should become 14.25). Hint: check out the function parse_date_time.

Exercise 14.2 (WA courses) Make a bar plot showing the number of Fall 2021 sections satisfying the Writing WA requirement, sorted by department code.¹⁷

In the next series of exercises, we are going to build up an analysis to examine the number of student enrollments for each faculty member.

Exercise 14.3 (Filter cases) For this particular analysis, we do not want to consider certain types of sections. Remove all of the following from the data table:

All sections in PE or INTD.
All music ensembles and dance practicum sections (these are all of the MUSI and THDA classes with numbers less than 100).
All lab sections. This is one is a bit tricky. You can search for “Lab” or “Laboratory”, but be careful not to eliminate courses with words such as “Labor”. Some of these have section numbers that end in “-L1”“, for example.

Exercise 14.4 (Handle cross-listed courses) Some sections are listed under multiple different departments, and you will find the same instructor, time, enrollment data, etc. For this activity, we only want to include each actual section once and it doesn’t really matter which department code we associate with this section. Eliminate all duplicates, keeping each actual section just once. Hint: look into the R command distinct, and think carefully about how to find duplicates.

Exercise 14.5 (Co-taught courses) Make a table with all Fall 2021 co-taught courses (i.e., more than one instructor).

Exercise 14.6 (Faculty enrollments) Make a table where each row contains a faculty, the number of sections they are teaching in Fall 2021, and the total enrollments in those section. Sort the table from highest total enrollments to lowest.¹⁸

Exercise 14.7 (Evening courses) Create and display a new table with all night courses (i.e., a subset of the table you wrangled by the end of Exercise 14.4). Also make a bar plot showing the number of these courses by day of the week.

Source: regular expression tutorial.↩︎
For this exercise, you can count cross-listed courses towards both departments’ WA counts.↩︎
For the purposes of this exercise, we are just going to leave co-taught courses as is so that you will have an extra row for each pair or triplet of instructors. Alternatives would be to allocate the enrollment number to each of the faculty members or to split it up between the members. The first option would usually be the most appropriate, although it might depend on the course.↩︎

Topic 14 Regular Expressions

Learning Goals

Regular Expressions and Character Strings

Wrangling with Regular Expressions in R

Search and replace patterns with str_replace or str_replace_all (stringr)

Detect patterns with str_detect (stringr)

Locate patterns with str_locate

Extract patterns with str_extract and str_extract_all

Count the number of characters with str_length

Convert a string to all lower case letters with str_to_lower

Split strings with separate