Regular Expressions/Cleaning Text Data

Brianna Heggeseth

Announcements

MSCS Happenings

Looking Ahead to Registration

Announcements

Projects

  • Please submit 1 idea for a data science project as soon as possible at this link

Assignments

  • 8 (Data Import + EDA) was due last night
    • EDA on flights, practice process of EDA -> tell a story about flight delays
  • 9 (Regex) due next Wednesday (last coding assignment)
  • 10 (1 Number Story)

Tidy Tuesday

  • Complete 3 (minimum) before the end of the semester
    • 5 more weeks left!

Iterative Viz

  • IV1: Updated graphic by Friday to Moodle

Describing Patterns in Text

Regular expressions (regex) allow us to describe character patterns.

After class, try: Interactive Regex Tutorial

For now, we’ll use our cheatsheet (back is Regular Expressions)!

Example Regex:

  • “ab|c” means a, and then b OR c (eg. ab, ac)
  • “(ab)|c” means ab OR c (eg. ab, c)
  • “[abc]” means one of a, b, or c (eg. a, b, c)
  • “[^abc]” means one of anything but a, b, or c (eg. d, e, f, g, etc.)
  • “a*” means a zero or more times (eg. b, a, aa, aaa, aaaaaa, etc.)
  • “a+” means a one or more times (eg. a, aa, aaa, aaaaaa, etc.)

Text Examples

(example <- "The quick brown fox jumps over the lazy dog.")
[1] "The quick brown fox jumps over the lazy dog."



We’ll practice:

  • Replacing text patterns
  • Detecting text patterns
  • Locating text patterns
  • Changing case
  • Separate/split text

Search and replace patterns

To search for a pattern and replace it, we can use the functions str_replace and str_replace_all.


example
[1] "The quick brown fox jumps over the lazy dog."
str_replace(example, pattern = "quick", replacement = "really quick")
[1] "The really quick brown fox jumps over the lazy dog."
str_replace_all(example, pattern = "(fox|dog)",  replacement = "****") 
[1] "The quick brown **** jumps over the lazy ****."
str_replace_all(example, "(fox|dog).", "****") # "." for any character
[1] "The quick brown ****jumps over the lazy ****"
str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period
[1] "The quick brown fox jumps over the lazy ****"
str_replace(example, "[Tt]he", "a") # only first match
[1] "a quick brown fox jumps over the lazy dog."
str_replace_all(example, "[Tt]he", "a") # all matches
[1] "a quick brown fox jumps over a lazy dog."

Detect patterns

example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"
(examples <- c(example, example2, example3))
[1] "The quick brown fox jumps over the lazy dog."                                                                                                        
[2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
[3] "This is a test"                                                                                                                                      
pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well
str_detect(examples, pat) 
[1]  TRUE  TRUE FALSE
str_subset(examples, pat)
[1] "The quick brown fox jumps over the lazy dog."                                                                                                        
[2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

Locate patterns

example
[1] "The quick brown fox jumps over the lazy dog."
str_locate(example, pat) # starting position and ending position of first match
     start end
[1,]    23  25

Let’s check the answer:

str_sub(example, 23, 25)
[1] "mps"

Extract patterns

example2
[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant

str_extract(example2, pat2) # extract first match
[1] "road"
str_extract_all(example2, pat2, simplify = TRUE) # extract all matches
     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
[1,] "road" "wood" "coul" "tood" "look" "coul"

Count the number of characters

example2
[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
str_length(example2)
[1] 148

Change case

example2
[1] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
str_to_lower(example2)
[1] "two roads diverged in a yellow wood, / and sorry i could not travel both / and be one traveler, long i stood / and looked down one as far as i could"

Split strings

df <- tibble(ex = example2)
df <- df %>% separate(ex, c("line1", "line2", "line3", "line4"), sep = " / ")
df
# A tibble: 1 × 4
  line1                                line2                         line3 line4
  <chr>                                <chr>                         <chr> <chr>
1 Two roads diverged in a yellow wood, And sorry I could not travel… And … And …

Practice

Go to our course website and create a new Rmd template file, save it in a folder called Assignment_09.

After Class

Regular Expressions

Other Assignments

  • Iterative Viz