Data Import

Brianna Heggeseth

Switch it Up

Sit with someone you don’t know well
Introduce yourself
Ask them about their Fall Break!

Announcements

MSCS Happenings

MSCS Student Instagram [upcoming events!]
11:15am Thursday Coffee Break
Thursday MSCS Registration Event 11:30-1pm!!
- Come ask questions about any courses/curriculum in MSCS
MSCS & Society Talk next Monday at 4:45pm

Midterm Revisions

For each problem I marked with an X,

write a more correct answer and 1 sentence description of why that is a more correct answer (maybe why the original answer wasn’t correct).
If there isn’t room on the original page, feel free to add a blank piece of paper and mark the number.

Talk with others in the class, use online resources; help each other understand the WHY.

Turn into me by Thursday.

Learning Goals

Develop comfort in finding an existing data set to import into R
Develop comfort in importing data of a variety of file types into R
Understand and implement the data cleaning process to make values consistent
Understand the implications of different ways of dealing with missing values with replace_na and drop_na

Key Ideas

Finding Existing Data Sets
Saving Data Sets Locally
Loading Data Sets
Checking Imported Data
Cleaning Data

More thorough notes available at https://bcheggeseth.github.io/112_fall_2023/data-import.html

Finding Existing Data Sets

Google: “movie reviews filetype: csv”
awesome-public-datasets GitHub repo
Kaggle data
data.world website
Brianna’s website

Loading Data Sets

Depending on the file type (csv, tsv, excel, Google sheet, Stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:

read_csv()
read_delim()
read_sheet()
st_read()

Saving Data Sets Locally

When working with data sets, you need to:

know where you are saving the files
know how to tell the computer program where the files are located (a file path)

Absolute file path describes the location of a file from the root directory or folder, typically the user directory.

On a Mac, ~ refers to the user root directory.
On a Windows, the root directory is typically C:\

Example: A file called data.csv is located in the Assignment_08 folder in Comp_Stat_112 folder on the Desktop.
- On a Mac, the absolute file path is ~/Desktop/Comp_Stat_112/Assignment_08/data.csv
- On a Windows, the absolute file path is C:/Desktop/Comp_Stat_112/Assignment_08/data.csv

Relative file path describes the location of a file from the current working directory.

When working within an Rmd file, it will first look for files relative to the location of the Rmd file. Therefore, it is good practice to put the data file in the same folder as the Rmd file you are working on.
If you are working in the Console, you can change the working directory (Session > Set Working Directory).

Example: A file called data.csv is located in a data folder within Comp_Stat_112 folder on the Desktop
- If the working directory is ~/Desktop/Comp_Stat_112/Assignment_08/, the relative path is ../data/data.csv. The .. refers to the parent directory (go up one level to the folder containing Assignment_08).
- If the working directory is ~/Desktop/Comp_Stat_112/, the relative path is data/data.csv.
- If the working directory is ~/Desktop/Comp_Stat_112/data, the relative path is data.csv.

The best location to put a dataset is within a folder that is dedicated to the project or assignment.

Loading Data Sets

Try downloading the csv file from:

https://bcheggeseth.github.io/112_fall_2023/data/imdb_5000_messy.csv

Right-click and Save As

Put the data file in Assignment_08 folder.

Create a new Rmd file called Data_Import.Rmd (save it to the same folder) and load the data in with read_csv().

Checking Imported Data

Always look at the data after importing with View()
Do a quick summary of all variables:

dataset_name %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  summary()

Cleaning Data

Cleaning Categorical Variables

“Clean” data has consistent values in terms of spelling and capitalization.

How could we clean this up?

imdbMessy <- read_csv("imdb_5000_messy.csv")

imdbMessy %>% count(color)

# A tibble: 6 × 2
  color               n
  <chr>           <int>
1 B&W                10
2 Black and White   199
3 COLOR              30
4 Color            4755
5 color              30
6 <NA>               19

imdbMessy <- imdbMessy %>% 
  mutate(color = factor(color)) %>%  #convert to a factor
  mutate(color = fct_recode(color, "B&W" = "Black and White", "Color" = "color", "Color" = "COLOR")) #recode levels

imdbMessy %>% count(color)

# A tibble: 3 × 2
  color     n
  <fct> <int>
1 B&W     209
2 Color  4815
3 <NA>     19

Cleaning Data

Study the individual observations with NAs carefully.

Why do you think they are missing?
Are certain types of observations more likely to have NAs?

Addressing Missing Data

You have several options for dealing with NAs (and they have different consequences):

You can remove observations/rows with one or more NAs (see drop_na).
You can remove columns with many NA values with select.
You can replace NAs with a reasonable value (called imputing values). This could be a default value (like zero), or the average for a column. (see replace_na)

Cleaning Data

Let’s check to see how many values are missing per variable.

colSums(is.na(imdbMessy)) # Number of values that are missing per variable

                     ...1                     color             director_name 
                        0                        19                       104 
   num_critic_for_reviews                  duration   director_facebook_likes 
                       50                        15                       104 
   actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
                       23                        13                         7 
                    gross                    genres              actor_1_name 
                      884                         0                         7 
              movie_title           num_voted_users cast_total_facebook_likes 
                        0                         0                         0 
             actor_3_name      facenumber_in_poster             plot_keywords 
                       23                        13                       153 
          movie_imdb_link      num_user_for_reviews                  language 
                        0                        21                        12 
                  country            content_rating                    budget 
                        5                       303                       492 
               title_year    actor_2_facebook_likes                imdb_score 
                      108                        13                         0 
             aspect_ratio      movie_facebook_likes 
                      329                         0

Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?

imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()

# A tibble: 6 × 3
  movie_title              actor_1_name actor_1_facebook_likes
  <chr>                    <chr>                         <dbl>
1 Pink Ribbons, Inc.       <NA>                             NA
2 Sex with Strangers       <NA>                             NA
3 The Harvest/La Cosecha   <NA>                             NA
4 Ayurveda: Art of Being   <NA>                             NA
5 The Brain That Sings     <NA>                             NA
6 The Blood of My Brother  <NA>                             NA

To remove observations (rows) that are missing actor_1_facebook_likes,

imdbMessysub <- imdbMessy %>% filter(!is.na(actor_1_facebook_likes))  #Notice how I saved this new smaller dataset to a new name

To replace missing values of actor_1_facebook_likes with 0,

imdbMessysub2 <- imdbMessy %>% mutate(actor_1_facebook_likes = replace_na(actor_1_facebook_likes, 0))

Practice/Assignment

Find a dataset that is not built into R and is related to one of the following topics:

a hobby or passion
hometown or place you have lived

Load the data into R, make sure it is clean, and construct one interesting visualization of the data.

Note: this might help you brainstorm ideas for projects

After Class

Only 1 exercise for Data Import [Assignment 8]
Midterm Revisions due Thursday in class
TT9 due Friday (data: Horror Legends)
IV1 due next Friday (see feedback on spreadsheet)