13 Working with character data: Strings

SETTLING IN

Welcome Back!

It’s Tuesday new-sday again!

Sit with new people / people you haven’t worked with much. Meet each other. Remember that you’ll be working in groups on the course project, so the more people you get to know now, the better!

Help each other with the following:

Prepare to take notes. Open the activity qmd file.

Learning goals

Learn some fundamentals of working with strings of text data.
Learn functions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package.

Additional resources

For more on this topic

Watch:

Working with strings (Lisa Lendway)

Read:

strings cheat sheet
Strings (Wickham, Çetinkaya-Rundel, & Grolemund)
Regular expressions (Baumer, Kaplan, and Horton)

Additional tutorials and tools:

RegExplain RStudio addin (Garrick Aden-Buie)
regexr exploration tool

13.1 Warm-up

WHERE ARE WE?

We’re in the last day of our “data preparation” unit:

Before spring break, we started discussing some considerations in working with special types of “categorical” variables: characters and factors.

Converting characters to factors (and factors to meaningful factors) (last time)
When categorical information is stored as a character variable, the categories of interest might not be labeled or ordered in a meaningful way. We can fix that!
Strings (today!)
When working with character strings, we might want to detect, replace, or extract certain patterns. For example, recall our data on courses:

courses_old <- read.csv("https://mac-stat.github.io/data/courses.csv")
    
# Check out the data
head(courses_old)
##     sessionID dept level    sem enroll     iid
## 1 session1784    M   100 FA1991     22 inst265
## 2 session1785    k   100 FA1991     52 inst458
## 3 session1791    J   100 FA1993     22 inst223
## 4 session1792    J   300 FA1993     20 inst235
## 5 session1794    J   200 FA1993     22 inst234
## 6 session1795    J   200 SP1994     26 inst230
    
# Check out the structure of each variable
# Many of these are characters!
str(courses_old)
## 'data.frame':    1718 obs. of  6 variables:
##  $ sessionID: chr  "session1784" "session1785" "session1791" "session1792" ...
##  $ dept     : chr  "M" "k" "J" "J" ...
##  $ level    : int  100 100 100 300 200 200 200 100 300 100 ...
##  $ sem      : chr  "FA1991" "FA1991" "FA1993" "FA1993" ...
##  $ enroll   : int  22 52 22 20 22 26 25 38 16 43 ...
##  $ iid      : chr  "inst265" "inst458" "inst223" "inst235" ...

Focusing on just the sem character variable, we might want to…

change FA to fall_ and SP to spring_
keep only courses taught in fall
split the variable into 2 new variables: semester (FA or SP) and year

Much more! (maybe in your projects or COMP/STAT 212)
There are a lot of ways to process character variables. For example, we might have a variable that records the text for a sample of news articles. We might want to analyze things like the articles’ sentiments, word counts, typical word lengths, most common words, etc.

ESSENTIAL STRING FUNCTIONS

The stringr package within tidyverse contains lots of functions to help process strings. We’ll focus on the most common. Letting x be a string variable…

function	arguments	returns
`str_replace()`	`x, pattern, replacement`	a modified string
`str_replace_all()`	`x, pattern, replacement`	a modified string
`str_to_lower()`	`x`	a modified string
`str_sub()`	`x, start, end`	a modified string
`str_extract()`	`x, pattern`	a modified string
`str_length()`	`x`	a number
`str_detect()`	`x, pattern`	TRUE/FALSE

EXAMPLE 1

Consider the following data with string variables :

library(tidyverse)

classes <- data.frame(
  sem        = c("SP2023", "FA2023", "SP2024"),
  area       = c("History", "Math", "Anthro"),
  enroll     = c("30 - people", "20 - people", "25 - people"),
  instructor = c("Ernesto Capello", "Lori Ziegelmeier", "Arjun Guneratne")
)

classes
##      sem    area      enroll       instructor
## 1 SP2023 History 30 - people  Ernesto Capello
## 2 FA2023    Math 20 - people Lori Ziegelmeier
## 3 SP2024  Anthro 25 - people  Arjun Guneratne

Using only your intuition, use our str_ functions to complete the following. NOTE: You might be able to use other wrangling verbs in some cases, but focus on the new functions here.

# Define a new variable "num" that adds up the number of characters in the area label

# Change the areas to "history", "math", "anthro" instead of "History", "Math", "Anthro"

# Create a variable that id's which courses were taught in spring

# Change the semester labels to "fall2023", "spring2024", "spring2023"

# Use sem to create 2 new variables, one with only the semester (SP/FA) and 1 with the year

# In the enroll variable, keep only the number and convert to a numeric variable

If you finish quickly, see if you can complete the same tasks above using a different approach (different pattern or different str_* function). Think about the assumptions you are making about the character patterns.

SUMMARY

Here’s what we learned about each function:

str_replace(x, pattern, replacement) finds the first part of x that matches the pattern and replaces it with replacement
str_replace_all(x, pattern, replacement) finds all instances in x that matches the pattern and replaces it with replacement
str_to_lower(x) converts all upper case letters in x to lower case
str_sub(x, start, end) only keeps a subset of characters in x, from start (a number indexing the first letter to keep) to end (a number indexing the last letter to keep)
str_extract(x, pattern) finds the first part of x that matches the pattern and extracts it
str_length(x) records the number of characters in x
str_detect(x, pattern) is TRUE if x contains the given pattern and FALSE otherwise

EXAMPLE 2

Suppose we only want the spring courses:

# How can we do this after mutating?
classes %>% 
  mutate(spring = str_detect(sem, "SP"))
##      sem    area      enroll       instructor spring
## 1 SP2023 History 30 - people  Ernesto Capello   TRUE
## 2 FA2023    Math 20 - people Lori Ziegelmeier  FALSE
## 3 SP2024  Anthro 25 - people  Arjun Guneratne   TRUE

# We don't have to mutate first!
classes %>% 
  filter(str_detect(sem, "SP"))
##      sem    area      enroll      instructor
## 1 SP2023 History 30 - people Ernesto Capello
## 2 SP2024  Anthro 25 - people Arjun Guneratne

# Yet another way
classes %>% 
  filter(!str_detect(sem, "FA"))
##      sem    area      enroll      instructor
## 1 SP2023 History 30 - people Ernesto Capello
## 2 SP2024  Anthro 25 - people Arjun Guneratne

EXAMPLE 3

Suppose we wanted to get separate columns for the first and last names of each course instructor in classes. Try doing this using str_sub(). But don’t try too long! Explain what trouble you ran into.

How would you describe what you want to do in words (think about describing then pattern of characters)?

EXAMPLE 4

We can use regular expressions to help us describe patterns in characters. For example, if we describe the pattern of a full name as “a set of lower and uppercase letters” and then “a space” and then “a set of lower and uppercase letters”, we can use the following regular expression to describe that whole pattern:

[a-zA-Z]+ [a-zA-Z]+ # + means 1 or more

To extract the first name, we could use the following regular expression that says to look at the beginning of the string (^) for a set of lower and upper case letters:

^[a-zA-Z]+

classes %>% 
  mutate(first = str_extract(instructor, "^[a-zA-Z]+"))
##      sem    area      enroll       instructor   first
## 1 SP2023 History 30 - people  Ernesto Capello Ernesto
## 2 FA2023    Math 20 - people Lori Ziegelmeier    Lori
## 3 SP2024  Anthro 25 - people  Arjun Guneratne   Arjun

To extract the last name, we could use the following regular expression that says to look at the end of the string ($) for a set of lower and upper case letters:

[a-zA-Z]+$

classes %>% 
  mutate(last = str_extract(instructor, "[a-zA-Z]+$"))
##      sem    area      enroll       instructor        last
## 1 SP2023 History 30 - people  Ernesto Capello     Capello
## 2 FA2023    Math 20 - people Lori Ziegelmeier Ziegelmeier
## 3 SP2024  Anthro 25 - people  Arjun Guneratne   Guneratne

What does this assume about the the structure of the instructor values?

EXAMPLE 5

Alternatively, we can use separate() to split a column into 2+ new columns

classes %>% 
  separate(instructor, c("first", "last"), sep = " ")
##      sem    area      enroll   first        last
## 1 SP2023 History 30 - people Ernesto     Capello
## 2 FA2023    Math 20 - people    Lori Ziegelmeier
## 3 SP2024  Anthro 25 - people   Arjun   Guneratne

# Sometimes the function can "intuit" how we want to separate the variable
classes %>% 
  separate(instructor, c("first", "last"))
##      sem    area      enroll   first        last
## 1 SP2023 History 30 - people Ernesto     Capello
## 2 FA2023    Math 20 - people    Lori Ziegelmeier
## 3 SP2024  Anthro 25 - people   Arjun   Guneratne

Separate enroll into 2 separate columns: students and people. (These columns don’t make sense this is just practice).

# classes %>% 
#   separate(___, c(___, ___), sep = "___")

We separated sem into semester and year above using str_sub(). Why would this be hard using separate()?
When we want to split a column into 2+ new columns (or do other types of string processing), but there’s no consistent pattern by which to do this, we can use regular expressions (an optional topic):

# (?<=[SP|FA]): any character *before* the split point is a "SP" or "FA"
# (?=2): the first character *after* the split point is a 2
classes %>% 
  separate(sem, 
          c("semester", "year"),
          "(?<=[SP|FA])(?=2)")
##   semester year    area      enroll       instructor
## 1       SP 2023 History 30 - people  Ernesto Capello
## 2       FA 2023    Math 20 - people Lori Ziegelmeier
## 3       SP 2024  Anthro 25 - people  Arjun Guneratne

# More general:
# (?<=[a-zA-Z]): any character *before* the split point is a lower or upper case letter
# (?=[0-9]): the first character *after* the split point is number
classes %>% 
  separate(sem, 
          c("semester", "year"),
          "(?<=[A-Z])(?=[0-9])")
##   semester year    area      enroll       instructor
## 1       SP 2023 History 30 - people  Ernesto Capello
## 2       FA 2023    Math 20 - people Lori Ziegelmeier
## 3       SP 2024  Anthro 25 - people  Arjun Guneratne

13.2 Exercises

Exercise 1: Time slots

The courses data includes actual data scraped from Mac’s class schedule. (Thanks to Prof Leslie Myint for the scraping code!!)

If you want to learn how to scrape data, take COMP/STAT 212, Intermediate Data Science! NOTE: For simplicity, I removed classes that had “TBA” for the days.

courses <- read.csv("https://mac-stat.github.io/data/registrar.csv")

# Check it out
head(courses)
##        number   crn                                                name  days
## 1 AMST 112-01 10318         Introduction to African American Literature M W F
## 2 AMST 194-01 10073              Introduction to Asian American Studies M W F
## 3 AMST 194-F1 10072 What’s After White Empire - And Is It Already Here?  T R 
## 4 AMST 203-01 10646 Politics and Inequality: The American Welfare State M W F
## 5 AMST 205-01 10842                         Trans Theories and Politics  T R 
## 6 AMST 209-01 10474                   Civil Rights in the United States   W  
##              time      room             instructor avail_max
## 1 9:40 - 10:40 am  MAIN 009       Daylanne English    3 / 20
## 2  1:10 - 2:10 pm MUSIC 219          Jake Nagasawa   -4 / 16
## 3  3:00 - 4:30 pm   HUM 214 Karin Aguilar-San Juan    0 / 14
## 4 9:40 - 10:40 am  CARN 305          Lesley Lavery    3 / 25
## 5  3:00 - 4:30 pm  MAIN 009              Myrl Beam   -2 / 20
## 6 7:00 - 10:00 pm  MAIN 010         Walter Greason   -1 / 15

Use our more familiar wrangling tools to warm up.

# Construct a table that indicates the number of classes offered in each day/time slot
# Print only the 6 most popular time slots

Exercise 2: Prep the data

So that we can analyze it later, we want to wrangle the courses data:

Let’s get some enrollment info:
- Split avail_max into 2 separate variables: avail and max.
- Use avail and max to define a new variable called enrollment. HINT: You’ll need as.numeric()
Split the course number into 3 separate variables: dept, number, and section. HINT: You can use separate() to split a variable into 3, not just 2 new variables.

Store this as courses_clean so that you can use it later.

Exercise 3: Courses by department

Using courses_clean…

# Identify the 6 departments that offered the most sections


# Identify the 6 departments with the longest average course titles

Exercise 4: STAT courses

Part a

Get a subset of courses_clean that only includes courses taught by Alicia Johnson.

Part b

Create a new dataset from courses_clean, named stat, that only includes STAT sections. In this dataset:

In the course names:
- Remove “Introduction to” from any name.
- Shorten “Statistical” to “Stat” where relevant.
Define a variable that records the start_time for the course.
Keep only the number, name, start_time, enroll columns.
The result should have 19 rows and 4 columns.

Exercise 5: More cleaning

In the next exercises, we’ll dig into enrollments. Let’s get the data ready for that analysis here. Make the following changes to the courses_clean data. Because they have different enrollment structures, and we don’t want to compare apples and oranges, remove the following:

all sections in PE and INTD (interdisciplinary studies courses)
all music ensembles and dance practicums, i.e. all MUSI and THDA classes with numbers less than 100. HINT: !(dept == "MUSI" & as.numeric(number) < 100)
all lab sections. Be careful which variable you use here. For example, you don’t want to search by “Lab” and accidentally eliminate courses with words such as “Labor”.

Save the results as enrollments (don’t overwrite courses_clean).

Exercise 6: Enrollment & departments

Explore enrollments by department. You decide what research questions to focus on. Use both visual and numerical summaries.

Exercise 7: Enrollment & faculty

Let’s now explore enrollments by instructor. In doing so, we have to be cautious of cross-listed courses that are listed under multiple different departments. For example:

enrollments %>%
  filter(dept %in% c("STAT", "COMP"), number == 112, section == "01")
##   dept number section   crn                         name  days           time
## 1 COMP    112      01 10248 Introduction to Data Science  T R  3:00 - 4:30 pm
## 2 STAT    112      01 10249 Introduction to Data Science  T R  3:00 - 4:30 pm
##       room        instructor avail max enroll
## 1 OLRI 254 Brianna Heggeseth     1  28     27
## 2 OLRI 254 Brianna Heggeseth     1  28     27

Notice that these are the exact same section! In order to not double count an instructor’s enrollments, we can keep only the courses that have distinct() combinations of days, time, instructor values:

enrollments_2 <- enrollments %>% 
  distinct(days, time, instructor, .keep_all = TRUE)

# NOTE: By default this keeps the first department alphabetically
# That's fine because we won't use this to analyze department enrollments!
enrollments_2 %>% 
  filter(instructor == "Brianna Heggeseth", name == "Introduction to Data Science")
##   dept number section   crn                         name  days           time
## 1 COMP    112      01 10248 Introduction to Data Science  T R  3:00 - 4:30 pm
##       room        instructor avail max enroll
## 1 OLRI 254 Brianna Heggeseth     1  28     27

Now, explore enrollments by instructor. You decide what research questions to focus on. Use both visual and numerical summaries.

CAVEAT: The above code doesn’t deal with co-taught courses that have more than one instructor. Thus instructors that co-taught are recorded as a pair, and their co-taught enrollments aren’t added to their total enrollments. This is tough to get around with how the data were scraped as the instructor names are smushed together, not separated by a comma!

Optional extra practice

# Make a bar plot showing the number of night courses by day of the week
# Use courses_clean

Dig Deeper: regex

Example 4 gave 1 small example of a regular expression.

These are handy when we want process a string variable, but there’s no consistent pattern by which to do this. You must think about the structure of the string and how you can use regular expressions to capture the patterns you want (and exclude the patterns you don’t want).

For example, how would you describe the pattern of a 10-digit phone number? Limit yourself to just a US phone number for now.

The first 3 digits are the area code.
The next 3 digits are the exchange code.
The last 4 digits are the subscriber number.

Thus, a regular expression for a US phone number could be:

[:digit:]{3}-[:digit:]{3}-[:digit:]{4} which limits you to XXX-XXX-XXXX pattern or
\$[:digit:]{3}\$ [:digit:]{3}-[:digit:]{4} which limits you to (XXX) XXX-XXXX pattern or
[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4} which limits you to XXX.XXX.XXXX pattern

The following would include the three patterns above in addition to the XXXXXXXXXX pattern (no dashes or periods): - [\$]*[:digit:]{3}[-.\$]*[:digit:]{3}[-.]*[:digit:]{4}

In order to write a regular expression, you first need to consider what patterns you want to include and exclude.

Work through the following examples, and the tutorial after them to learn about the syntax.

EXAMPLES

# Define some strings to play around with
example <- "The quick brown fox jumps over the lazy dog."

str_replace(example, "quick", "really quick")
## [1] "The really quick brown fox jumps over the lazy dog."

str_replace_all(example, "(fox|dog)", "****") # | reads as OR
## [1] "The quick brown **** jumps over the lazy ****."

str_replace_all(example, "(fox|dog).", "****") # "." for any character
## [1] "The quick brown ****jumps over the lazy ****"

str_replace_all(example, "(fox|dog)\\.$", "****") # at end of sentence only, "\\." only for a period
## [1] "The quick brown fox jumps over the lazy ****"

str_replace_all(example, "the", "a") # case-sensitive only matches one
## [1] "The quick brown fox jumps over a lazy dog."

str_replace_all(example, "[Tt]he", "a") # # will match either t or T; could also make "a" conditional on capitalization of t
## [1] "a quick brown fox jumps over a lazy dog."

str_replace_all(example, "[Tt]he", "a") # first match only
## [1] "a quick brown fox jumps over a lazy dog."

# More examples
example2 <- "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"
example3 <- "This is a test"

# Store the examples in 1 place
examples <- c(example, example2, example3)

pat <- "[^aeiouAEIOU ]{3}" # Regular expression for three straight consonants. Note that I've excluded spaces as well

str_detect(examples, pat) # TRUE/FALSE if it detects pattern
## [1]  TRUE  TRUE FALSE

str_subset(examples, pat) # Pulls out those that detects pattern
## [1] "The quick brown fox jumps over the lazy dog."                                                                                                        
## [2] "Two roads diverged in a yellow wood, / And sorry I could not travel both / And be one traveler, long I stood / And looked down one as far as I could"

pat2 <- "[^aeiouAEIOU ][aeiouAEIOU]{2}[^aeiouAEIOU ]{1}" # consonant followed by two vowels followed by a consonant

str_extract(example2, pat2) # extract first match
## [1] "road"

str_extract_all(example2, pat2, simplify = TRUE) # extract all matches
##      [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
## [1,] "road" "wood" "coul" "tood" "look" "coul"

TUTORIAL

Try out this interactive tutorial. Note that neither the tutorial nor regular expressions more generally are specific to R, but it still illustrates the main ideas of regular expressions.

13.3 Wrap-up

Our quiz is next Tuesday. Remember to be on time and review the quiz info on the syllabus and quiz practice. We’ll review on Thursday.
- The focus will be on wrangling but visualizations may be involved.
Due dates:
- Homework 6 is due Thursday.

13.4 Solutions

Click for Solutions

EXAMPLE 1

# Define a new variable "num" that adds up the number of characters in the area label
classes %>% 
  mutate(num = str_length(area))
##      sem    area      enroll       instructor num
## 1 SP2023 History 30 - people  Ernesto Capello   7
## 2 FA2023    Math 20 - people Lori Ziegelmeier   4
## 3 SP2024  Anthro 25 - people  Arjun Guneratne   6

# Change the areas to "history", "math", "anthro"
classes %>% 
  mutate(area = str_to_lower(area))
##      sem    area      enroll       instructor
## 1 SP2023 history 30 - people  Ernesto Capello
## 2 FA2023    math 20 - people Lori Ziegelmeier
## 3 SP2024  anthro 25 - people  Arjun Guneratne

# Create a variable that id's which courses were taught in spring 
classes %>% 
  mutate(spring = str_detect(sem, "SP"))
##      sem    area      enroll       instructor spring
## 1 SP2023 History 30 - people  Ernesto Capello   TRUE
## 2 FA2023    Math 20 - people Lori Ziegelmeier  FALSE
## 3 SP2024  Anthro 25 - people  Arjun Guneratne   TRUE

# Change the semester labels to "fall2023", "spring2024", "spring2023"
classes %>% 
  mutate(sem = str_replace(sem, "SP", "spring")) %>% 
  mutate(sem = str_replace(sem, "FA", "fall"))
##          sem    area      enroll       instructor
## 1 spring2023 History 30 - people  Ernesto Capello
## 2   fall2023    Math 20 - people Lori Ziegelmeier
## 3 spring2024  Anthro 25 - people  Arjun Guneratne


# Use sem to create 2 new variables, one with only the semester (SP/FA) and 1 with the year
classes %>% 
  mutate(semester = str_sub(sem, 1, 2),
         year = str_sub(sem, 3, 6))
##      sem    area      enroll       instructor semester year
## 1 SP2023 History 30 - people  Ernesto Capello       SP 2023
## 2 FA2023    Math 20 - people Lori Ziegelmeier       FA 2023
## 3 SP2024  Anthro 25 - people  Arjun Guneratne       SP 2024

# In the enroll variable, keep only the number and convert to a numeric variable

classes %>%
  mutate(enroll = as.numeric(str_extract(enroll, "[0-9]+")))
##      sem    area enroll       instructor
## 1 SP2023 History     30  Ernesto Capello
## 2 FA2023    Math     20 Lori Ziegelmeier
## 3 SP2024  Anthro     25  Arjun Guneratne

EXAMPLE 2

# How can we do this after mutating?
classes %>% 
  mutate(spring = str_detect(sem, "SP")) %>% 
  filter(spring == TRUE)
##      sem    area      enroll      instructor spring
## 1 SP2023 History 30 - people Ernesto Capello   TRUE
## 2 SP2024  Anthro 25 - people Arjun Guneratne   TRUE

EXAMPLE 3

The length of first and last names are not consistent, so str_sub() doesn’t work.

How would you describe what you want to do in words (think about describing then pattern of characters)?

If we assume that an instructor’s name is made of two words separated by a space, we could describe the pattern as “a set of lower and uppercase letters” followed by “a space” followed by “a set of lower and uppercase letters”.

EXAMPLE 4

We are assuming:

an instructor’s first name does not have a space or apostrophe in it
an instructor’s first name is listed first in the string
an instructor’s last name does not have a space or apostrophe in it
an instructor’s last name is listed last in the string
we ignore any middle names or initials

Exercise 1: Popular time slots

# Construct a table that indicates the number of classes offered in each day/time slot
# Print only the 6 most popular time slots
courses %>% 
  count(days, time) %>% 
  arrange(desc(n)) %>% 
  head()
##    days             time  n
## 1 M W F 10:50 - 11:50 am 76
## 2  T R   9:40 - 11:10 am 71
## 3 M W F  9:40 - 10:40 am 68
## 4 M W F   1:10 - 2:10 pm 66
## 5  T R    3:00 - 4:30 pm 62
## 6  T R    1:20 - 2:50 pm 59

Exercise 2: Prep the data

courses_clean <- courses %>% 
  separate(avail_max, c("avail", "max"), sep = " / ") %>% 
  mutate(enroll = as.numeric(max) - as.numeric(avail)) %>% 
  separate(number, c("dept", "number", "section"))
  
head(courses_clean)
##   dept number section   crn                                                name
## 1 AMST    112      01 10318         Introduction to African American Literature
## 2 AMST    194      01 10073              Introduction to Asian American Studies
## 3 AMST    194      F1 10072 What’s After White Empire - And Is It Already Here?
## 4 AMST    203      01 10646 Politics and Inequality: The American Welfare State
## 5 AMST    205      01 10842                         Trans Theories and Politics
## 6 AMST    209      01 10474                   Civil Rights in the United States
##    days            time      room             instructor avail max enroll
## 1 M W F 9:40 - 10:40 am  MAIN 009       Daylanne English     3  20     17
## 2 M W F  1:10 - 2:10 pm MUSIC 219          Jake Nagasawa    -4  16     20
## 3  T R   3:00 - 4:30 pm   HUM 214 Karin Aguilar-San Juan     0  14     14
## 4 M W F 9:40 - 10:40 am  CARN 305          Lesley Lavery     3  25     22
## 5  T R   3:00 - 4:30 pm  MAIN 009              Myrl Beam    -2  20     22
## 6   W   7:00 - 10:00 pm  MAIN 010         Walter Greason    -1  15     16

Exercise 3: Courses offered by department

# Identify the 6 departments that offered the most sections
courses_clean %>% 
  count(dept) %>% 
  arrange(desc(n)) %>% 
  head()
##   dept  n
## 1 SPAN 45
## 2 BIOL 44
## 3 ENVI 38
## 4 PSYC 37
## 5 CHEM 33
## 6 COMP 31

# Identify the 6 departments with the longest average course titles
courses_clean %>% 
  mutate(length = str_length(name)) %>% 
  group_by(dept) %>% 
  summarize(avg_length = mean(length)) %>% 
  arrange(desc(avg_length)) %>% 
  head()
## # A tibble: 6 × 2
##   dept  avg_length
##   <chr>      <dbl>
## 1 WGSS        46.3
## 2 INTL        41.4
## 3 EDUC        39.4
## 4 MCST        39.4
## 5 POLI        37.4
## 6 AMST        37.3

Exercise 4: STAT courses

Part a

courses_clean %>% 
  filter(str_detect(instructor, "Alicia Johnson")) 
##   dept number section   crn                         name  days            time
## 1 STAT    253      01 10806 Statistical Machine Learning  T R  9:40 - 11:10 am
## 2 STAT    253      02 10807 Statistical Machine Learning  T R   1:20 - 2:50 pm
## 3 STAT    253      03 10808 Statistical Machine Learning  T R   3:00 - 4:30 pm
##         room     instructor avail max enroll
## 1 THEATR 206 Alicia Johnson    -3  20     23
## 2 THEATR 206 Alicia Johnson    -3  20     23
## 3 THEATR 206 Alicia Johnson     2  20     18

Part b

stat <- courses_clean %>% 
  filter(dept == "STAT") %>% 
  mutate(name = str_replace(name, "Introduction to ", "")) %>%
  mutate(name = str_replace(name, "Statistical", "Stat")) %>% 
  mutate(start_time = str_sub(time, 1, 5)) %>% 
  select(number, name, start_time, enroll)

stat
##    number                      name start_time enroll
## 1     112              Data Science      3:00      27
## 2     112              Data Science      9:40      21
## 3     112              Data Science      1:20      25
## 4     125              Epidemiology      12:00     26
## 5     155             Stat Modeling      1:10      32
## 6     155             Stat Modeling      9:40      24
## 7     155             Stat Modeling      10:50     26
## 8     155             Stat Modeling      3:30      25
## 9     155             Stat Modeling      1:20      30
## 10    155             Stat Modeling      3:00      27
## 11    212 Intermediate Data Science      9:40      11
## 12    212 Intermediate Data Science      1:20      11
## 13    253     Stat Machine Learning      9:40      23
## 14    253     Stat Machine Learning      1:20      23
## 15    253     Stat Machine Learning      3:00      18
## 16    354               Probability      3:00      22
## 17    452           Correlated Data      9:40       7
## 18    452           Correlated Data      1:20       8
## 19    456  Projects in Data Science      9:40      11

dim(stat)
## [1] 19  4

Exercise 5: More cleaning

enrollments <- courses_clean %>% 
  filter(dept != "PE", dept != "INTD") %>% 
  filter(!(dept == "MUSI" & as.numeric(number) < 100)) %>% 
  filter(!(dept == "THDA" & as.numeric(number) < 100)) %>% 
  filter(!str_detect(section, "L"))
  
head(enrollments)
##   dept number section   crn                                                name
## 1 AMST    112      01 10318         Introduction to African American Literature
## 2 AMST    194      01 10073              Introduction to Asian American Studies
## 3 AMST    194      F1 10072 What’s After White Empire - And Is It Already Here?
## 4 AMST    203      01 10646 Politics and Inequality: The American Welfare State
## 5 AMST    205      01 10842                         Trans Theories and Politics
## 6 AMST    209      01 10474                   Civil Rights in the United States
##    days            time      room             instructor avail max enroll
## 1 M W F 9:40 - 10:40 am  MAIN 009       Daylanne English     3  20     17
## 2 M W F  1:10 - 2:10 pm MUSIC 219          Jake Nagasawa    -4  16     20
## 3  T R   3:00 - 4:30 pm   HUM 214 Karin Aguilar-San Juan     0  14     14
## 4 M W F 9:40 - 10:40 am  CARN 305          Lesley Lavery     3  25     22
## 5  T R   3:00 - 4:30 pm  MAIN 009              Myrl Beam    -2  20     22
## 6   W   7:00 - 10:00 pm  MAIN 010         Walter Greason    -1  15     16

Optional extra practice

# Make a bar plot showing the number of night courses by day of the week.
courses_clean %>% 
  filter(str_detect(time, "7:00")) %>% 
  ggplot(aes(x = days)) + 
    geom_bar()