7 Data wrangling - Part 2

Settling In

Sit with your proposed project group.

Introduce yourself.

Name, preferred pronouns
Favorite food
Favorite winter activity

Discuss project ideas.

Project overview
Milestone 1 due with HW4 to Moodle
Download a template Quarto file to start from here. Put this file in a folder called wrangling within the activities folder for this course.

Data Storytelling Moment

Go to https://geonarrative.usgs.gov/muledeer255/

What is the data story?
What is effective?
What could be improved?

Next Class - Quiz 1

Who: You!
What: File Organization, Git/Github, Data Visualization (Adv. ggplot + Spatial)
Where: Classroom
When: Weds 8am (you have 90 mins)
Why: External motivation to internalize concepts
How: Short answer and multiple choice; No note sheet; Bring a pencil

Learning goals

After this lesson, you should be able to:

Manipulate and explore strings using the stringr package
Construct regular expressions to find patterns in strings

The stringr cheatsheet (HTML, PDF) will be useful to have open and reference.

Motivation: 30 Years of American Anxieties

In 2018 the data journalism organization The Pudding featured a story called 30 Years of American Anxieties about themes in 30 years of posts to the Dear Abby column (an American advice column).

One way to understand themes in text data is to conduct a qualitative analysis, a methodology in which multiple readers read through instances of text several times to reach a consensus about themes.

Another way to understand themes in text data is computational text analysis.

This is what we will explore today.

Both qualitative analysis and computational tools can be used in tandem. Often, using computational tools can help focus a close reading of select texts, which parallels the spirit of a qualitative analysis.

To prepare ourselves for a computational analysis, let’s learn about strings.

Strings

Strings are objects of the character class (abbreviated as <chr> in tibbles).

When you print out strings, they display with double quotes:

some_string <- "banana"
some_string

[1] "banana"

. . .

Working with strings generally will involve the use of regular expressions, a tool for finding patterns in strings.

. . .

Regular expressions (regex, for short) look like the following:

"^the" (Strings that start with "the")
"end$" (Strings that end with "end")

Before getting to regular expressions, let’s go over some fundamentals about working with strings. The stringr package (available within tidyverse) is great for working with strings.

Creating strings

Creating strings by hand is useful for testing out regular expressions.

To create a string, type any text in either double quotes (") or single quotes '. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
string3 <- c(string1, string2) # string / character vector (of greater than length 1)


class(string1)

[1] "character"

class(string2)

[1] "character"

class(string3)

[1] "character"

length(string1)

[1] 1

length(string2)

[1] 1

length(string3)

[1] 2

. . .

We can view these strings “naturally” (without the opening and closing quotes) with str_view():

str_view(string1)

[1] │ This is a string

str_view(string2)

[1] │ If I want to include a "quote" inside a string, I use single quotes

str_view(string3)

[1] │ This is a string
[2] │ If I want to include a "quote" inside a string, I use single quotes

Exercise: Create the string It's Thursday. What happens if you put the string inside single quotes? Double quotes?

# Your code

. . .

Because " and ' are special characters in the creation of strings, R offers another way to put them inside a string. We can escape these special characters by putting a \ in front of them:

string1 <- "This is a string with \"double quotes\""
string2 <- "This is a string with \'single quotes\'"
str_view(string1)

[1] │ This is a string with "double quotes"

str_view(string2)

[1] │ This is a string with 'single quotes'

. . .

Given that \ is a special character, how can we put the \ character in strings? We have to escape it with \\.

Exercise: Create the string C:\Users. What happens when you don’t escape the \?

# Your code

. . .

Other special characters include:

\t (Creates a tab)
\n (Creates a newline)

Both can be useful in plots to more neatly arrange text.

string1 <- "Record temp:\t102"
string2 <- "Record temp:\n102"

str_view(string1)

[1] │ Record temp:{\t}102

str_view(string2)

[1] │ Record temp:
    │ 102

Can we get str_view() to show the tab instead of {\t}? We can use the html argument to have the string displayed as if on a webpage:

str_view(string1, html = TRUE)

. . .

Often we will want to create new strings within data frames. We can use str_c() or str_glue(), both of which are vectorized functions (meaning they take vectors as inputs and provide vectors as outputs - can be used within mutate()):

With str_c() the strings to be combined are all separate arguments separated by commas.
With str_glue() the desired string is written as a template with variable names inside curly braces {}.

. . .

df <- tibble(
    first_name = c("Arya", "Olenna", "Tyrion", "Melisandre"),
    last_name = c("Stark", "Tyrell", "Lannister", NA)
)
df

# A tibble: 4 × 2
  first_name last_name
  <chr>      <chr>    
1 Arya       Stark    
2 Olenna     Tyrell   
3 Tyrion     Lannister
4 Melisandre <NA>

df %>%
    mutate(
        full_name1 = str_c(first_name, " ", last_name),
        full_name2 = str_glue("{first_name} {last_name}")
    )

# A tibble: 4 × 4
  first_name last_name full_name1       full_name2      
  <chr>      <chr>     <chr>            <glue>          
1 Arya       Stark     Arya Stark       Arya Stark      
2 Olenna     Tyrell    Olenna Tyrell    Olenna Tyrell   
3 Tyrion     Lannister Tyrion Lannister Tyrion Lannister
4 Melisandre <NA>      <NA>             Melisandre NA

Exercise: In the following data frame, create a full date string in month-day-year format using both str_c() and str_glue().

df_dates <- tibble(
    year = c(2000, 2001, 2002),
    month = c("Jan", "Feb", "Mar"),
    day = c(3, 4, 5)
)

Extracting information from strings

The str_length() counts the number of characters in a string.

comments <- tibble(
    name = c("Alice", "Bob"),
    comment = c("The essay was well organized around the core message and had good transitions.", "Good job!")
)

comments %>%
    mutate(
        comment_length = str_length(comment)
    )

# A tibble: 2 × 3
  name  comment                                                   comment_length
  <chr> <chr>                                                              <int>
1 Alice The essay was well organized around the core message and…             78
2 Bob   Good job!                                                              9

. . .

The str_sub() function gets a substring of a string. The 2nd and 3rd arguments indicate the beginning and ending position to extract.

Negative positions indicate the position from the end of the word. (e.g., -3 indicates “3rd letter from the end”)
Specifying a position that goes beyond the word won’t result in an error. str_sub() will just go as far as possible.

x <- c("Apple", "Banana", "Pear")

str_sub(x, start = 1, end = 3)

[1] "App" "Ban" "Pea"

str_sub(x, start = -3, end = -1)

[1] "ple" "ana" "ear"

str_sub(x, start = 2, end = -1)

[1] "pple"  "anana" "ear"

str_sub("a", start = 1, end = 15)

[1] "a"

. . .

Exercise: Using str_sub(), create a new variable with only the middle letter of each word in the data frame below. (Challenge: How would you handle words with an even number of letters?)

df <- tibble(
    word_id = 1:3,
    word = c("replace", "match", "pattern")
)

Finding patterns in strings with regular expressions

Suppose that you’re exploring text data looking for places where people describe happiness. There are many ways to search. We could search for the word “happy” but that excludes “happiness” so we might search for “happi”.

Regular expressions (regex) are a powerful language for describing patterns within strings.

. . .

data(fruit)
data(words)
data(sentences)

We can use str_view() with the pattern argument to see what parts of a string match the regex supplied in the pattern argument. (Matches are enclosed in <>.)

str_view(fruit, "berry")

 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>

. . .

Essentials of forming a regex

Letters and numbers in a regex are matched exactly and are called literal characters.
Most punctuation characters, like ., +, *, [, ], and ?, have special meanings and are called metacharacters.
Quantifiers come after a regex and control how many times a pattern can match:
- ?: match the preceding pattern 0 or 1 times
- +: match the preceding pattern at least once
- *: match the preceding pattern at least 0 times (any number of times)

. . .

Exercise: Before running the code below, predict what matches will be made. Run the code to check your guesses. Note that in all regex’s below the ?, +, * applies to the b only (not the a).

str_view(c("a", "ab", "abb"), "ab?")
str_view(c("a", "ab", "abb"), "ab+")
str_view(c("a", "ab", "abb"), "ab*")

. . .

We can match any of a set of characters with [] (called a character class), e.g., [abcd] matches “a”, “b”, “c”, or “d”.
- We can invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”.

# Match words that have vowel-x-vowel
str_view(words, "[aeiou]x[aeiou]")

[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st

# Match words that have not_vowel-y-not_vowel
str_view(words, "[^aeiou]y[^aeiou]")

[836] │ <sys>tem
[901] │ <typ>e

. . .

Exercise Using the words data, find words that have two vowels in a row followed by an “m”.

# Your code

. . .

The alternation operator | can be read just like the logical operator | (“OR”) to pick between one or more alternative patterns. e.g., apple|banana searches for “apple” or “banana”.

str_view(fruit, "apple|melon|nut")

 [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>

. . .

Exercise: Using the fruit data, find fruits that have a repeated vowel (“aa”, “ee”, “ii”, “oo”, or “uu”.)

# Your code

. . .

The ^ operator indicates the beginning of a string, and the $ operator indicates the end of a string. e.g., ^a matches strings that start with “a”, and a$ matches words that end with “a”.
Parentheses group together parts of a regular expression that should be taken as a bundle. (Much like parentheses in arithmetic statements.)
- e.g., ab+ is a little confusing. Does it match “ab” one or more times? Or does it match “a” first, then just “b” one or more times? (The latter, as we saw in an earlier example.) We can be very explicit and use a(b)+.

. . .

Exercise: Using the words data, find (1) words that start with “y” and (2) words that don’t start with “y”.

# Your code

Exploring `stringr` functions

Read in the “Dear Abby” data underlying The Pudding’s 30 Years of American Anxieties article.

posts <- read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv")

Take a couple minutes to scroll through the 30 Years of American Anxieties article to get ideas for themes that you might want to search for using regular expressions.

The following are core stringr functions that use regular expressions:

str_view() - View the first occurrence in a string that matches the regex
str_count() - Count the number of times a regex matches within a string
str_detect() - Determine if (TRUE/FALSE) the regex is found within string
str_subset() - Return subset of strings that match the regex
str_extract(), str_extract_all() - Return portion of each string that matches the regex. str_extract() extracts the first instance of the match. str_extract_all() extracts all matches.
str_replace(), str_replace_all() - Replace portion of string that matches the regex with something else. str_replace() replaces the first instance of the match. str_replace_all() replaces all instances of the match.
str_remove(), str_remove_all() - Removes the portion of the string that matches the pattern. Equivalent to str_replace(x, "THE REGEX PATTERN", "")

Exercise: Starting from str_count(), explore each of these functions by pulling up the function documentation page and reading through the arguments. Try out each function using the posts data.

Solutions

Creating strings

Solution

x <- "It's Thursday" # We need double quotes because of the apostrophe
x
x <- 'It's Thursday'

Error in parse(text = input): <text>:3:10: unexpected symbol
2: x
3: x <- 'It's
            ^

x <- "C:\\Users"
str_view(x)

[1] │ C:\Users

# \U is the start of special escape characters for Unicode characters
# The \U is expected to be followed by certain types of letters and numbers--like \U0928
x <- "C:\Users"

Error: '\U' used without hex digits in character string (<input>:3:10)

df_dates <- tibble(
    year = c(2000, 2001, 2002),
    month = c("Jan", "Feb", "Mar"),
    day = c(3, 4, 5)
)

df_dates %>%
    mutate(
        date1 = str_c(month, "-", day, "-", year),
        date2 = str_glue("{month}-{day}-{year}")
    )

# A tibble: 3 × 5
   year month   day date1      date2     
  <dbl> <chr> <dbl> <chr>      <glue>    
1  2000 Jan       3 Jan-3-2000 Jan-3-2000
2  2001 Feb       4 Feb-4-2001 Feb-4-2001
3  2002 Mar       5 Mar-5-2002 Mar-5-2002

Extracting information from strings

Solution

df <- tibble(
    word_id = 1:3,
    word = c("replace", "match", "pattern")
)

df %>%
    mutate(
        word_length = str_length(word),
        middle_pos = ceiling(word_length/2),
        middle_letter = str_sub(word, middle_pos, middle_pos)
    )

# A tibble: 3 × 5
  word_id word    word_length middle_pos middle_letter
    <int> <chr>         <int>      <dbl> <chr>        
1       1 replace           7          4 l            
2       2 match             5          3 t            
3       3 pattern           7          4 t

Finding patterns in strings with regular expressions

Solution

# This regex finds "a" then "b" at most once (can't have 2 or more b's in a row)
str_view(c("a", "ab", "abb"), "ab?")

[1] │ <a>
[2] │ <ab>
[3] │ <ab>b

# There has to be an "a" followed by at least one b
# This is why the first string "a" isn't matched
str_view(c("a", "ab", "abb"), "ab+")

[2] │ <ab>
[3] │ <abb>

# There must be an "a" and then any number of b's (including zero)
str_view(c("a", "ab", "abb"), "ab*")

[1] │ <a>
[2] │ <ab>
[3] │ <abb>

str_view(words, "[aeiou][aeiou]m")

[154] │ cl<aim>
[714] │ r<oom>
[735] │ s<eem>
[844] │ t<eam>

str_view(fruit, "aa|ee|ii|oo|uu")

 [9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n

# Words that start with y
str_view(words, "^y")

[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
[980] │ <y>oung

# Words that don't start with y
str_view(words, "^[^y]")

 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
 [6] │ <a>ccount
 [7] │ <a>chieve
 [8] │ <a>cross
 [9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
[12] │ <a>dd
[13] │ <a>ddress
[14] │ <a>dmit
[15] │ <a>dvertise
[16] │ <a>ffect
[17] │ <a>fford
[18] │ <a>fter
[19] │ <a>fternoon
[20] │ <a>gain
... and 954 more

Reflection

What was challenging? What was easier? What ideas do you have for keeping track of the many functions relevant to data wrangling?

After Class

Take a look at the Schedule page to see how to prepare for the next class
Finish Homework 3.
Work on Homework 4.
Continue narrowing your project work; Milestone 1 is due with HW4.

Settling In

Data Storytelling Moment

Next Class - Quiz 1

Learning goals

Motivation: 30 Years of American Anxieties

Strings

Creating strings

Extracting information from strings

Finding patterns in strings with regular expressions

Exploring stringr functions

Solutions

Reflection

After Class

Exploring `stringr` functions