Regular Expressions/Cleaning Text Data

Brianna Heggeseth

Announcements

MSCS Happenings

MSCS Student Instagram

Looking Ahead to Registration

MSCS Waitlist Info

Announcements

Projects

Please submit 1 idea for a data science project as soon as possible at this link

Assignments

8 (Data Import + EDA) was due last night
- EDA on flights, practice process of EDA -> tell a story about flight delays
9 (Regex) due next Wednesday (last coding assignment)
10 (1 Number Story)

Tidy Tuesday

Complete 3 (minimum) before the end of the semester
- 5 more weeks left!

Iterative Viz

IV1: Updated graphic by Friday to Moodle

Describing Patterns in Text

Regular expressions (regex) allow us to describe character patterns.

After class, try: Interactive Regex Tutorial

For now, we’ll use our cheatsheet (back is Regular Expressions)!

Example Regex:

“ab|c” means a, and then b OR c (eg. ab, ac)
“(ab)|c” means ab OR c (eg. ab, c)
“[abc]” means one of a, b, or c (eg. a, b, c)
“[^abc]” means one of anything but a, b, or c (eg. d, e, f, g, etc.)
“a*” means a zero or more times (eg. b, a, aa, aaa, aaaaaa, etc.)
“a+” means a one or more times (eg. a, aa, aaa, aaaaaa, etc.)

Text Examples

(example <- "The quick brown fox jumps over the lazy dog.")

[1] "The quick brown fox jumps over the lazy dog."

We’ll practice:

Replacing text patterns
Detecting text patterns
Locating text patterns
Changing case
Separate/split text

Search and replace patterns

To search for a pattern and replace it, we can use the functions str_replace and str_replace_all.

example

[1] "The quick brown fox jumps over the lazy dog."

str_replace(example, pattern = "quick", replacement = "really quick")

[1] "The really quick brown fox jumps over the lazy dog."

str_replace_all(example, pattern = "(fox|dog)",  replacement = "****")

[1] "The quick brown **** jumps over the lazy ****."

str_replace_all(example, "(fox|dog).", "****") # "." for any character

[1] "The quick brown ****jumps over the lazy ****"

str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period

[1] "The quick brown fox jumps over the lazy ****"

str_replace(example, "[Tt]he", "a") # only first match

[1] "a quick brown fox jumps over the lazy dog."

str_replace_all(example, "[Tt]he", "a") # all matches

[1] "a quick brown fox jumps over a lazy dog."

Detect patterns

example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"
(examples <- c(example, example2, example3))

[1] "The quick brown fox jumps over the lazy dog."                                                                                                        
[2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
[3] "This is a test"

pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well

str_detect(examples, pat)

[1]  TRUE  TRUE FALSE

str_subset(examples, pat)

[1] "The quick brown fox jumps over the lazy dog."                                                                                                        
[2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

Locate patterns

example

[1] "The quick brown fox jumps over the lazy dog."

str_locate(example, pat) # starting position and ending position of first match

     start end
[1,]    23  25

Let’s check the answer:

str_sub(example, 23, 25)

[1] "mps"

Extract patterns

example2

[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant

str_extract(example2, pat2) # extract first match

[1] "road"

str_extract_all(example2, pat2, simplify = TRUE) # extract all matches

     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
[1,] "road" "wood" "coul" "tood" "look" "coul"

Count the number of characters

example2

[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

str_length(example2)

[1] 148

Change case

example2

[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

str_to_lower(example2)

[1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"

Split strings

df <- tibble(ex = example2)
df <- df %>% separate(ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df

# A tibble: 1 × 4
  line1                                line2                         line3 line4
  <chr>                                <chr>                         <chr> <chr>
1 Two roads diverged in a yellow wood, And sorry I could not travel… And … And …

Practice

Go to our course website and create a new Rmd template file, save it in a folder called Assignment_09.

After Class

Regular Expressions

Try Interactive Regex Tutorial
Continue working on Regex activity (due next week)

Other Assignments

Iterative Viz