Topic 14 Regular Expressions
Learning Goals
- Develop comfort in working with strings of text data
- Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the
stringr
package.
Create a new Rmd file (save it as 14-Regex.Rmd). Put this file in a folder Assignment_09
in your COMP_STAT_112
folder.
- Make sure to add alt text using fig.alt!
Regular Expressions and Character Strings
Regular expressions allow us to describe character patterns. Regular expressions allow us to:16
- Search for particular items within a large body of text. For example, you may wish to identify and extract all email addresses.
- Replace particular items. For example, you may wish to clean up some poorly formatted HTML by replacing all uppercase tags with lowercase equivalents.
- Validate input. For example, you may want to check that a password meets certain criteria such as, a mix of uppercase and lowercase, digits and punctuation.
- Coordinate actions. For example, you may wish to process certain files in a directory, but only if they meet particular conditions.
- Reformat text. For example, you may want to split strings into different parts, each to form new variables.
- and more…
Start by doing this interactive tutorial. Note that neither the tutorial nor regular expressions more generally are specific to R
.
- Some of the syntax in the tutorial is slightly different from what we’ll use in
R
, but it will still help you get acclimated to the main ideas of regular expressions.
Wrangling with Regular Expressions in R
Now that we have some idea how regular expressions work, let’s examine how to use them to achieve various tasks in R
. It will be helpful to have your cheat sheet handy.
Note: Many of these tasks can either be accomplished with functions from the base
(built-in) package in R
or from the stringr
package, which is part of the Tidyverse. In general, the stringr
functions are faster, which will be noticeable when processing a large amount of text data.
<- "The quick brown fox jumps over the lazy dog."
example <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example2 <- "This is a test" example3
Search and replace patterns with str_replace
or str_replace_all
(stringr
)
To search for a pattern and replace it, we can use the function str_replace
and str_replace_all
in the stringr
package. Note that str_replace
only replaces the first matched pattern, while str_replace_all
replaces all. Here are some examples:
str_replace(example, "quick", "really quick")
## [1] "The really quick brown fox jumps over the lazy dog."
str_replace_all(example, "(fox|dog)", "****") # | reads as OR
## [1] "The quick brown **** jumps over the lazy ****."
str_replace_all(example, "(fox|dog).", "****") # "." for any character
## [1] "The quick brown ****jumps over the lazy ****"
str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period
## [1] "The quick brown fox jumps over the lazy ****"
str_replace_all(example, "the", "a") # case-sensitive only matches one
## [1] "The quick brown fox jumps over a lazy dog."
str_replace_all(example, "[Tt]he", "a") # # will match either t or T; could also make "a" conditional on capitalization of t
## [1] "a quick brown fox jumps over a lazy dog."
str_replace_all(example, "[Tt]he", "a") # first match only
## [1] "a quick brown fox jumps over a lazy dog."
Detect patterns with str_detect
(stringr
)
<- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example2 <- "This is a test"
example3 <- c(example, example2, example3)
examples
<- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well
pat
str_detect(examples, pat) # TRUE/FALSE if it detects pattern
## [1] TRUE TRUE FALSE
str_subset(examples, pat) # Pulls out those that detects pattern
## [1] "The quick brown fox jumps over the lazy dog."
## [2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
Locate patterns with str_locate
str_locate(example, pat) # starting position and ending position of first match
## start end
## [1,] 23 25
Let’s check the answer:
str_sub(example, 23, 25)
## [1] "mps"
Extract patterns with str_extract
and str_extract_all
<- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant
pat2 str_extract(example2, pat2) # extract first match
## [1] "road"
str_extract_all(example2, pat2, simplify = TRUE) # extract all matches
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "road" "wood" "coul" "tood" "look" "coul"
Convert a string to all lower case letters with str_to_lower
str_to_lower(example2)
## [1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"
Split strings with separate
(tidyr
)
<- tibble(ex = example2)
df <- separate(df, ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df $line1 df
## [1] "Two roads diverged in a yellow wood,"
$line2 df
## [1] "And sorry I could not travel both"
$line3 df
## [1] "And be one traveler, long I stood"
$line4 df
## [1] "And looked down one as far as I could"
Note: The function separate()
is in the tidyr
package.
Practice: Fall 2022 Enrollment Exploration
The tibble courses
has the Fall 2022 enrollment information from the Macalester Registrar’s website, which we could gain with web scraping tools. See code below if you are interested in trying out web scraping!
<- read_html("https://www.macalester.edu/registrar/schedules/2022fall/class-schedule")
fall2022
# Retrieve and inspect course numbers
<-
course_nums %>%
fall2022 html_nodes(".class-schedule-course-number") %>%
html_text()
# Retrieve and inspect course names
<-
course_names %>%
fall2022 html_nodes(".class-schedule-course-title") %>%
html_text()
<- stringr::str_sub(course_nums, end = nchar(course_nums) - 6)
course_nums_clean
<- stringr::str_sub(course_nums, start = nchar(course_nums) - 4)
crn
<-
course_instructors %>%
fall2022 html_nodes(".class-schedule-label:nth-child(6)") %>%
html_text()
<- stringr::str_sub(trimws(course_instructors), start = 13)
course_instructors_short
<-
course_days %>%
fall2022 html_nodes(".class-schedule-label:nth-child(3)") %>%
html_text()
<- trimws(stringr::str_sub(trimws(course_days), start = 7))
course_days_short
<-
course_times %>%
fall2022 html_nodes(".class-schedule-label:nth-child(4)") %>%
html_text()
<- stringr::str_sub(trimws(course_times), start = 7)
course_times_short
<-
course_rooms %>%
fall2022 html_nodes(".class-schedule-label:nth-child(5)") %>%
html_text()
<- stringr::str_sub(trimws(course_rooms), start = 7)
course_rooms_short
<-
course_avail %>%
fall2022 html_nodes(".class-schedule-label:nth-child(7)") %>%
html_text()
<- stringr::str_sub(trimws(course_avail), start = 14)
course_avail_short
<- possibly(.f = read_html,otherwise = NA)
safe_html <- paste0("https://webapps.macalester.edu/registrardata/classdata/Fall2022/", crn) %>%
SITES ::map(~ posshtml(.x))
purrr
<- possibly(.f = function(x) html_nodes(x, "p:nth-child(1)") %>%
safe_read html_text() %>%
trimws(), otherwise = NA)
<- SITES %>%
course_desc ::map_chr(~ safe_read(.x))
purrr
<- possibly(.f = function(x) html_nodes(x, "p:nth-child(2)") %>%
safe_read2 html_text() %>%
trimws() %>% stringr::str_sub(start = 32) %>%
trimws(), otherwise = NA)
<- SITES %>%
gen_ed ::map_chr(~ safe_read2(.x))
purrr
<-
courses tibble(
number = course_nums_clean,
crn = crn,
name = course_names,
days = course_days_short,
time = course_times_short,
room = course_rooms_short,
instructor = course_instructors_short,
avail_max = course_avail_short,
desc = course_desc,
gen_ed = gen_ed
)
write_csv(courses, file = 'Mac2022Courses.csv')
number | crn | name | days | time | room | instructor | avail_max |
---|---|---|---|---|---|---|---|
AMST 194-01 | 10002 | Introduction to Asian American Studies | M W F | 01:10 pm-02:10 pm | HUM 212 | Jake Nagasawa | -3 / 18 |
AMST 194-F1 | 10001 | What’s After White Empire (and is it already here)? | T R | 03:00 pm-04:30 pm | HUM 212 | Karin Aguilar-San Juan | 0 / 16 |
AMST 203-01 | 10545 | Politics and Inequality: The American Welfare State | M W F | 09:40 am-10:40 am | CARN 305 | Lesley Lavery | Closed 0 / 25 |
AMST 225-01 | 10352 | Native History to 1871 | M W F | 10:50 am-11:50 am | HUM 111 | Jacob Jurss | Closed 3 / 20 |
AMST 231-01 | 10820 | Sovereignty Matters: Critical Indigeneity, Gender and Governance | M | 07:00 pm-10:00 pm | ARTCOM 102 | Kirisitina Sailiata | Closed 1 / 16 |
AMST 237-01 | 10003 | Environmental Justice | T R | 01:20 pm-02:50 pm | HUM 212 | Kirisitina Sailiata | -1 / 20 |
You can read in the data created from the webscrapping code above from the course site:
<- read_csv('https://bcheggeseth.github.io/112_spring_2023/data/Mac2022Courses.csv') courses
Exercise 14.1 (Rearrange data table) Make the following changes to the courses
data table (save the updated table as courses
):
- Split
number
into three separate columns:dept
,number
, andsection
. - Split the
avail_max
variable into two separate variables:avail
andmax
. It might be helpful to first remove all appearances of “Closed”. - Use
avail
andmax
to generate a new variable calledenrollment
. - Split the
time
variable into two separate columns:start_time
andend_time
. Convert all of these times into continuous 24 hour times (e.g., 2:15 pm should become 14.25). Hint: check out the documentation for the functionparse_date_time()
.
Exercise 14.2 (WA courses) Make a bar plot showing the number of Fall 2022 sections satisfying the Writing WA requirement, sorted by department code.17 Note: some courses satisfy multiple requirements.
In the next series of exercises, we are going to build up an analysis to examine the number of student enrollments for each faculty member.
Exercise 14.3 (Filter cases) For this particular analysis, we do not want to consider certain types of sections. Remove all of the following from the data table (save subset dataset as courses2
):
- All sections in
PE
orINTD
. - All music ensembles and dance practicum sections (these are all of the MUSI and THDA classes with numbers less than 100).
- All lab sections. This is one is a bit tricky. You can search for “Lab” or “Laboratory”, but be careful not to eliminate courses with words such as “Labor”. Some of these have section numbers that end in “-L1”, for example.
Exercise 14.4 (Handle cross-listed courses) Some sections are listed under multiple different departments (cross-listed courses), and you will find the same instructor, day, time, etc. Note: they may have different course numbers.
For this activity, we only want to include each actual section once and it doesn’t really matter which department code we associate with this section. Eliminate all duplicated cross-listed courses from courses2
, keeping each actual section just once (save updated table as courses3
). Hint: look into the R
function distinct
, and think carefully about how to find duplicates.
Exercise 14.5 (Co-taught courses) Using courses3
(i.e. after removing non 4-credit courses and cross-listed duplicates), make a table with all Fall 2022 co-taught courses (i.e., more than one instructor). You don’t need to save the new table. Hint: There was a class in Fall 2022 called Land/Water that was co-taught. Look for patterns in the data that could let you distinguish which courses were co-taught.
Exercise 14.6 (Faculty enrollments) Using courses3
(i.e. after removing non 4-credit courses and cross-listed duplicates), make a table where each row contains a faculty, the number of sections they are teaching in Fall 2022, and the total enrollments in those section. Sort the table from highest total enrollments to lowest. You should find two current Comp/Stat 112 instructors in the top 3.18
Exercise 14.7 (Evening courses) Create and display a new table with all night courses (i.e., a subset of courses3
). Also make a bar plot showing the number of these courses by day of the week.
Source: regular expression tutorial.↩︎
For this exercise, you can count cross-listed courses towards both departments’ WA counts.↩︎
For the purposes of this exercise, we are just going to leave co-taught courses as is so that you will have an extra row for each pair or triplet of instructors. Alternatives would be to allocate the enrollment number to each of the faculty members or to split it up between the members. The first option would usually be the most appropriate, although it might depend on the course.↩︎