Topic 12 Data Import

Learning Goals

Develop comfort in finding an existing data set to import into R
Develop comfort in importing data of a variety of file types into R
Understand and implement the data cleaning process to make values consistent
Understand the implications of different ways of dealing with missing values with replace_na and drop_na

Create a new Rmd file (save it as 12-Data Import.Rmd). Put this file in a folder Assignment_08 in your COMP_STAT_112 folder. - Make sure to add alt text to visualizations using fig.alt!

Finding, Importing, and Cleaning Data

Additional resources and readings:
1. Data Import Cheat Sheet
2. readr documentation
3. Data import from Wickham and Grolemund
4. Missing data from Wickham and Grolemund
5. Data intake from Baumer, Kaplan, and Horton
6. Using the import wizard from Prof. Lendway

In practice, data science is not as glamorous as building fancy classifiers and creating visualizations all the time. Data scientists spend 80% of their time acquiring and cleaning data. While the skill of data acquisition is best learned through experience, this section of the course will outline the most common approaches to acquiring data.

When importing and cleaning a dataset, take careful notes in your R Markdown. Explain where you found the dataset (the source). Record the steps you took to clean and import the data in case somebody else needs to replicate your analysis. You should also make sure to cite and credit the creator of the dataset.

Finding Existing Data Sets

An example Google search for a csv file by including filetype:csv.

Figure 12.1: An example Google search.

The easiest way to get data is by finding an existing dataset that has been created by somebody else. Search engines such as Google can be excellent tools, especially when using file type filters. For example, if you are looking for a dataset about movie reviews, you might search for “movie reviews filetype:csv”. You could also try searching for other common filetypes that are compatible with R, such as .tsv, .xls, .xlsx, or .rds.

Another good resource for datasets are compendiums of datasets such as the excellent and continuously-evolving awesome-public-datasets GitHub repo, Kaggle datasets or the data.world website website. You can find links to other similar compendiums at the end of the awesome-public-datasets page.

Saving Datasets Locally

Once you’ve found a dataset you are interested in, you need to download the file and save it to a location on your computer.

The best location to put a dataset is within a folder that is dedicated to the project or assignment. For example, you’ve created a folder for Assignment_08 and any datasets you use for this assignment should be saved in that folder.

Quick Note: One key idea when you are working with datasets on your local machine is that you need to:

know where the files are located
know how to tell the computer program where the files are located (a file path)

There are two common ways we refer to file locations: absolute file path and relative file path.

Absolute file path

An absolute file path describes the location of a file from the root directory or folder, typically the user directory.
- On a Mac, ~ refers to the user root directory.
- On a Windows, the root directory is typically C:\
Example: A file called data.csv is located in the Assignment_08 folder in Comp_Stat_112 folder on the Desktop
- On a Mac, the absolute file path is ~/Desktop/Comp_Stat_112/Assignment_08/data.csv
- On a Windows, the absolute file path is C:/Desktop/Comp_Stat_112/Assignment_08/data.csv
- *Windows Note: Switch from backslash to forward slashes in R or you need to use \ instead of *

Relative file path

A relative file path describes the location of a file from the current working directory (or in the case of an Rmd, the location of the Rmd file).
- When working within an Rmd file, it will first look for files relative to the location of the Rmd file. Therefore, it is good practice to put the data file in the same folder as the Rmd file you are working on.
- If you are working in the Console, you can change the working directory (Session > Set Working Directory).
Example: A file called data.csv is located in a data folder within Comp_Stat_112 folder on the Desktop
- If the working directory is ~/Desktop/Comp_Stat_112/Assignment_08/, the relative path is ../data/data.csv. The .. refers to the parent directory (go up one level to the folder containing Assignment_08).
- If the working directory is ~/Desktop/Comp_Stat_112/, the relative path is data/data.csv.
- If the working directory is ~/Desktop/Comp_Stat_112/data, the relative path is data.csv.

Loading Datasets

Once you have a dataset, it’s time to load it into R. Don’t be frustrated if this step takes some time.

The table below lists some common data import functions and when you would use them.

Function	Use when
`read_csv()`	data are saved in .csv (comma delimited or comma separated values) format - you can save Excel files and Google Sheets in this format
`read_delim()`	data are saved in other delimited formats (tab, space, etc.)
`read_sheet()`	data are in a Google Sheet
`st_read()`	reading in a shapefile

A few tips:

When reading in data from a file, one tip is to initially use the Import Wizard to help write the code and file path. DO NOT use it to import the data as you will need the code to read in the data in order to knit your document. Check out a video tutorial on the Import Wizard
The import functions read_csv, read_csv2, and read_tsv from the readr package are faster than their counterparts read.csv, read.csv2, and read.tsv from the base package for large files. They also have more flexible parsers (e.g., for dates, times, percentages). We recommend you use these functions instead of the base functions like read.csv. The package fread has other import functions and is also faster for large datasets. For smaller data sets (say 1MB or less), there won’t be that much difference in time for the three different packages.
read_csv2 is for semi-colon delimited files, whereas read_csv is for comma delimited files.
The readr functions automatically guess the type of data in each column (e.g., character, double, integer). You will often see a message just after the import telling you what it chose for each column. If you think there is an issue, you can use the function problems() to detect problems, and/or specify how the columns should be imported. See the section on “column specification” in the Data Import Cheat Sheet for more info.
If you have trouble importing a dataset, try to first import it into a different data such as Google Sheets or Excel tool and then export it as a TSV or CSV before reading it into R.
For really messy data, OpenRefine is complicated but powerful (YouTube demo).
When you are importing a large file, you might want to first try importing a subset of the data. For example, if you want to take the first 17 rows only, you can write read_csv("file.csv",n_max=17)
Similarly, you might want to skip the first \(n\) lines of the file when importing, select only certain columns to read in, or choose a random subset of the rows. See the cheat sheet for instructions on these tasks or just google!

Checking the Imported Datasets

After reading in new data, it is ALWAYS a good idea to do some quick checks of the data. Here are two first steps that are especially useful:

Open the data in the spreadsheet-like viewer with View(dataset_name) and take a look at it. Sort it by different variables by clicking on the arrows next to the variable name. Make sure there isn’t anything unexpected.
Do a quick summary of the data. The code below is one way to do this. For quantitative variables, it provides summary statistics and will let you know if there are missing values. For factors (they need to be factors, not just character variables - the mutate() changes them to factors), it shows you counts for the top categories and tells you if there are any missing values.

dataset_name %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  summary()

Cleaning Datasets

Cleaning Categorical Variables

First we want to make sure the factors are “clean”, meaning consistent values in the correct format. For example, true and TRUE and T will be three different factors. The easiest way to manage this is to look at the levels for the factor and replace values with a messy factor to a clean one. For example, the following code cleans up values in true/false values in column X in a data set called df:

df <- df %>% mutate(X = fct_recode(X, "TRUE" = "T", "TRUE" = "true", "FALSE" = "f", "FALSE" = "N", "FALSE" = "No"))

Example 12.1 (Clean up the levels on the Messy IMDB 5000 dataset) We will use a slightly “messied” version of the IMDB 5000 Dataset, collected by chuansun76 on Kaggle.¹²

Download the csv file of the IMDB 5000 dataset from “https://bcheggeseth.github.io/112_spring_2023/data/imdb_5000_messy.csv” (right-click, save file as), put it in your Assignment_08 folder, use read_csv to load it into RStudio, and save it as imdbMessy in R.

Solution

imdbMessy <- read_csv("imdb_5000_messy.csv") # Relative Path: If your Rmd file and csv file are in the same folder

b. Print out the variable names.

Solution

names(imdbMessy) #order = order in dataset

##  [1] "...1"                      "color"                    
##  [3] "director_name"             "num_critic_for_reviews"   
##  [5] "duration"                  "director_facebook_likes"  
##  [7] "actor_3_facebook_likes"    "actor_2_name"             
##  [9] "actor_1_facebook_likes"    "gross"                    
## [11] "genres"                    "actor_1_name"             
## [13] "movie_title"               "num_voted_users"          
## [15] "cast_total_facebook_likes" "actor_3_name"             
## [17] "facenumber_in_poster"      "plot_keywords"            
## [19] "movie_imdb_link"           "num_user_for_reviews"     
## [21] "language"                  "country"                  
## [23] "content_rating"            "budget"                   
## [25] "title_year"                "actor_2_facebook_likes"   
## [27] "imdb_score"                "aspect_ratio"             
## [29] "movie_facebook_likes"

ls(imdbMessy) #order = alphabetical order

##  [1] "actor_1_facebook_likes"    "actor_1_name"             
##  [3] "actor_2_facebook_likes"    "actor_2_name"             
##  [5] "actor_3_facebook_likes"    "actor_3_name"             
##  [7] "aspect_ratio"              "budget"                   
##  [9] "cast_total_facebook_likes" "color"                    
## [11] "content_rating"            "country"                  
## [13] "director_facebook_likes"   "director_name"            
## [15] "duration"                  "facenumber_in_poster"     
## [17] "genres"                    "gross"                    
## [19] "imdb_score"                "language"                 
## [21] "movie_facebook_likes"      "movie_imdb_link"          
## [23] "movie_title"               "num_critic_for_reviews"   
## [25] "num_user_for_reviews"      "num_voted_users"          
## [27] "plot_keywords"             "title_year"

c. Examine the color variable. What are the existing values?

Solution

imdbMessy %>% select(color) %>% head()

## # A tibble: 6 × 1
##   color
##   <chr>
## 1 Color
## 2 Color
## 3 Color
## 4 Color
## 5 <NA> 
## 6 Color

levels(factor(imdbMessy$color))

## [1] "B&W"             "Black and White" "color"          
## [4] "Color"           "COLOR"

unique(imdbMessy$color)

## [1] "Color"           NA                "Black and White"
## [4] "COLOR"           "color"           "B&W"

How often does each color occur? Hint: table or count (which is a short hand for a group_by/summarize(n=n()))

Solution

imdbMessy %>% count(color)

## # A tibble: 6 × 2
##   color               n
##   <chr>           <int>
## 1 B&W                10
## 2 Black and White   199
## 3 color              30
## 4 Color            4755
## 5 COLOR              30
## 6 <NA>               19

table(imdbMessy$color)

## 
##             B&W Black and White           color 
##              10             199              30 
##           Color           COLOR 
##            4755              30

The read_csv read in the color values as strings. For this exercise, let’s convert them to factor using the code: imdbMessy <- imdbMessy %>% mutate(color = factor(color)).

Solution

imdbMessy <- imdbMessy %>% mutate(color = factor(color))

Select what you think is the best value for each level and replace “messy” versions of the value with clean ones with the fct_recode function as shown above. How many entries are there for each level now?

Solution

imdbMessy <- imdbMessy %>% mutate(color = fct_recode(color, "B&W" = "Black and White", "Color" = "color", "Color" = "COLOR"))
imdbMessy %>% count(color)

## # A tibble: 3 × 2
##   color     n
##   <fct> <int>
## 1 B&W     209
## 2 Color  4815
## 3 <NA>     19

Addressing Missing Data

Finally, you should look for and address missing data, encoded as NA (not available) in R. There is no single formula for dealing with NAs. You should first look to see how many NAs appear in each column:

colSums(is.na(imdbMessy))

Study the individual observations with NAs carefully. Why do you think they are missing? Are certain types of observations more likely to have NAs?

You have several options for dealing with NAs (and they have different consequences):

You can remove observations with one or more NAs (see drop_na).
You can remove columns with many NA values.
You can replace NAs with a reasonable value (called imputing values). This could be a default value (like zero), or the average for a column. (see replace_na)
You can use packages such as missForest that fill in missing values with statistical predictions.¹³

There is no perfect approach to dealing with NAs, and you must think carefully about how removing or replacing missing data may affect your work.

Example 12.2 (Address NA values in the Messy IMDB 5000 dataset) Consider imdbMessy.

Print out the number of NAs in each of the columns.

Solution

colSums(is.na(imdbMessy))

##                      ...1                     color 
##                         0                        19 
##             director_name    num_critic_for_reviews 
##                       104                        50 
##                  duration   director_facebook_likes 
##                        15                       104 
##    actor_3_facebook_likes              actor_2_name 
##                        23                        13 
##    actor_1_facebook_likes                     gross 
##                         7                       884 
##                    genres              actor_1_name 
##                         0                         7 
##               movie_title           num_voted_users 
##                         0                         0 
## cast_total_facebook_likes              actor_3_name 
##                         0                        23 
##      facenumber_in_poster             plot_keywords 
##                        13                       153 
##           movie_imdb_link      num_user_for_reviews 
##                         0                        21 
##                  language                   country 
##                        12                         5 
##            content_rating                    budget 
##                       303                       492 
##                title_year    actor_2_facebook_likes 
##                       108                        13 
##                imdb_score              aspect_ratio 
##                         0                       329 
##      movie_facebook_likes 
##                         0

Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?

Solution

This variable is missing if actor_1_name is missing, which suggests that this movie doesn’t have an actor 1 listed.

imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% head()

## # A tibble: 6 × 29
##    ...1 color direc…¹ num_c…² durat…³ direc…⁴ actor…⁵ actor…⁶
##   <dbl> <fct> <chr>     <dbl>   <dbl>   <dbl>   <dbl> <chr>  
## 1  4503 Color Léa Po…      23      97       4      NA <NA>   
## 2  4520 Color Harry …      12     105       0      NA <NA>   
## 3  4721 Color U. Rob…       3      80       6      NA <NA>   
## 4  4838 Color Pan Na…      15     102      95      NA <NA>   
## 5  4946 Color Amal A…      NA      62      58      NA <NA>   
## 6  4947 Color Andrew…      12      90       0      NA <NA>   
## # … with 21 more variables: actor_1_facebook_likes <dbl>,
## #   gross <dbl>, genres <chr>, actor_1_name <chr>,
## #   movie_title <chr>, num_voted_users <dbl>,
## #   cast_total_facebook_likes <dbl>, actor_3_name <chr>,
## #   facenumber_in_poster <dbl>, plot_keywords <chr>,
## #   movie_imdb_link <chr>, num_user_for_reviews <dbl>,
## #   language <chr>, country <chr>, content_rating <chr>, …

Create a new dataframe that removes observations that have NAs for actor_1_facebook_likes.

Solution

imdbMessysub <- imdbMessy %>% filter(!is.na(actor_1_facebook_likes))  #Notice how I saved this new smaller dataset to a new name

Create a second new data frame that replaces NAs in actor_1_facebook_likes with 0.

Solution

imdbMessysub2 <- imdbMessy %>% mutate(actor_1_facebook_likes = replace_na(actor_1_facebook_likes,0))

Additional Practice

Exercise 12.1 Find a dataset that is not built into R and is related to one of the following topics:

A personal hobby or passion
Your hometown, or a place you have lived

Load the data into R, make sure it is clean, and construct one interesting visualization of the data and include alt text.

Another option for part (e) would be to leave them as strings and then use string processing to define the levels. We’ll learn this technique soon.↩︎
This is dangerous unless you know what you are doing.↩︎