Data Import

Brianna Heggeseth

Announcements

MSCS Happenings

MSCS Student Instagram
MSCS Mentoring Program
3:30pm Wednesday Problem solving
11:15am Thursday Coffee Break
12-1pm Thursday (OLRI 241) Iowa State Grad School Info Session

Outside Mac

Fall Data Challenge sponsored by the American Statistical Association (Deadline: Nov 6)
Electronic Undergraduate Statistics Conference (Nov 4th, free!)

Learning Goals

Develop comfort in finding an existing data set to import into R
Develop comfort in importing data of a variety of file types into R
Understand and implement the data cleaning process to make values consistent
Understand the implications of different ways of dealing with missing values with replace_na and drop_na

Key Ideas

Finding Existing Data Sets
Loading Data Sets
Checking Imported Data
Cleaning Data

More thorough notes available at https://bcheggeseth.github.io/112_fall_2022/data-import.html

Finding Existing Data Sets

Google: “movie reviews filetype:csv”
awesome-public-datasets GitHub repo
Kaggle data
data.world website
Brianna’s website

Loading Data Sets

Depending on the file type (csv, tsv, excel, Google sheet, stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:

read_csv()
read_delim()
read_sheet()
st_read()

Loading Data Sets

The Import Wizard can help you write the code!

Try importing data from:

https://bcheggeseth.github.io/112_fall_2022/data/imdb_5000_messy.csv

Note: When using the Import Wizard, make sure to copy and paste the code into a Rmd file.

Checking Imported Data

Always look at the data after importing with View()
Do a quick summary of all variables:

dataset_name %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  summary()

Cleaning Data

Cleaning Categorical Variables

“Clean” data has consistent values in terms of spelling and capitalization.

How could we clean this up?

imdbMessy <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/imdb_5000_messy.csv")
imdbMessy %>% count(color)

# A tibble: 6 × 2
  color               n
  <chr>           <int>
1 B&W                10
2 Black and White   199
3 color              30
4 Color            4755
5 COLOR              30
6 <NA>               19

imdbMessy <- imdbMessy %>% 
  mutate(color = factor(color)) %>%  #convert to a factor
  mutate(color = fct_recode(color, "B&W" = "Black and White", "Color" = "color", "Color" = "COLOR")) #recode levels

imdbMessy %>% count(color)

# A tibble: 3 × 2
  color     n
  <fct> <int>
1 B&W     209
2 Color  4815
3 <NA>     19

Cleaning Data

Study the individual observations with NAs carefully.

Why do you think they are missing?
Are certain types of observations more likely to have NAs?

Addressing Missing Data

You have several options for dealing with NAs (and they have different consequences):

You can remove observations with one or more NAs (see drop_na).
You can remove columns with many NA values with select.
You can replace NAs with a reasonable value (called imputing values). This could be a default value (like zero), or the average for a column. (see replace_na)

Cleaning Data

Let’s check to see how many values are missing per variable.

colSums(is.na(imdbMessy)) # Number of values that are missing per variable

                     ...1                     color             director_name 
                        0                        19                       104 
   num_critic_for_reviews                  duration   director_facebook_likes 
                       50                        15                       104 
   actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
                       23                        13                         7 
                    gross                    genres              actor_1_name 
                      884                         0                         7 
              movie_title           num_voted_users cast_total_facebook_likes 
                        0                         0                         0 
             actor_3_name      facenumber_in_poster             plot_keywords 
                       23                        13                       153 
          movie_imdb_link      num_user_for_reviews                  language 
                        0                        21                        12 
                  country            content_rating                    budget 
                        5                       303                       492 
               title_year    actor_2_facebook_likes                imdb_score 
                      108                        13                         0 
             aspect_ratio      movie_facebook_likes 
                      329                         0

Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?

imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()

# A tibble: 6 × 3
  movie_title              actor_1_name actor_1_facebook_likes
  <chr>                    <chr>                         <dbl>
1 Pink Ribbons, Inc.       <NA>                             NA
2 Sex with Strangers       <NA>                             NA
3 The Harvest/La Cosecha   <NA>                             NA
4 Ayurveda: Art of Being   <NA>                             NA
5 The Brain That Sings     <NA>                             NA
6 The Blood of My Brother  <NA>                             NA

To remove observations (rows) that are missing actor_1_facebook_likes,

imdbMessysub <- imdbMessy %>% filter(!is.na(actor_1_facebook_likes))  #Notice how I saved this new smaller dataset to a new name

To replace missing values of actor_1_facebook_likes with 0,

imdbMessysub2 <- imdbMessy %>% mutate(actor_1_facebook_likes = replace_na(actor_1_facebook_likes, 0))

Practice/Assignment

Find a dataset that is not built into R and is related to one of the following topics:

Macalester (FYC: there is a dataset on Moodle for Assignment 12 from Institutional Research)
a hobby or passion
hometown or place you have lived

Load the data into R, make sure it is clean, and construct one interesting visualization of the data.

Note: this might help you brainstorm ideas for projects

After Class

Assignment 11 (EDA on Flights) due Sunday
Assignment 12 (1 exercise) due Tuesday
Brainstorm Activity due Friday
Midterm Revisions Part 2 due Friday
IV1 due next week
- If you haven’t turned in IV0, you don’t have feedback to work from.