Topic 13 Data Import

Learning Goals

Develop comfort in finding an existing data set to import into R
Develop comfort in importing data of a variety of file types into R
Understand and implement the data cleaning process to make values consistent
Understand the implications of different ways of dealing with missing values with replace_na and drop_na

You can download a template .Rmd of this activity here.

Finding, Importing, and Cleaning Data

Additional resources and readings:
1. Data Import Cheat Sheet
2. readr documentation
3. Data import from Wickham and Grolemund
4. Missing data from Wickham and Grolemund
5. Data intake from Baumer, Kaplan, and Horton
6. Using the import wizard from Prof. Lendway

In practice, data science is not as glamorous as building fancy classifiers and creating visualizations all the time. Data scientists spend 80% of their time acquiring and cleaning data. While the skill of data acquisition is best learned through experience, this section of the course will outline the most common approaches to acquiring data.

When importing and cleaning a dataset, take careful notes in your R Markdown. Explain where you found the dataset (the source). Record the steps you took to clean and import the data in case somebody else needs to replicate your analysis. You should also make sure to cite and credit the creator of the dataset.

Finding Existing Data Sets

Figure 13.1: An example Google search.

The easiest way to get data is by finding an existing dataset that has been created by somebody else. Search engines such as Google can be excellent tools, especially when using file type filters. For example, if you are looking for a dataset about movie reviews, you might search for “movie reviews filetype:csv”. You could also try searching for other common filetypes that are compatible with R, such as .tsv, .xls, .xlsx, or .rds.

Another good resource for datasets are compendiums of datasets such as the excellent and continuously-evolving awesome-public-datasets GitHub repo, Kaggle datasets or the data.world website website. You can find links to other similar compendiums at the end of the awesome-public-datasets page.

Loading Datasets

Once you have a dataset, it’s time to load it into R. Don’t be frustrated if this step takes some time.

The table below lists some common data import functions and when you would use them.

Function	Use when
`read_csv()`	data are saved in .csv (comma delimited or comma separated values) format - you can save Excel files and Google Sheets in this format
`read_delim()`	data are saved in other delimited formats (tab, space, etc.)
`read_sheet()`	data are in a Google Sheet
`st_read()`	reading in a shapefile

A few tips:

When reading in data from a file, one tip is to initially use the Import Wizard to help write the code. DO NOT use it to import the data as you will need the code to read in the data in order to knit your document. Prof. Lendway has posted a video tutorial on the Import Wizard
The import functions read_csv, read_csv2, and read_tsv from the readr package are faster than their counterparts read.csv, read.csv2, and read.tsv from the base package for large files. They also have more flexible parsers (e.g., for dates, times, percentages). We recommend you use these functions instead of the base functions like read.csv. The package fread has other import functions and is also faster for large datasets. For smaller data sets (say 1MB or less), there won’t be that much difference in time for the three different packages.
read_csv2 is for semi-colon delimited files, whereas read_csv is for comma delimited files.
The readr functions automatically guess the type of data in each column (e.g., character, double, integer). You will often see a message just after the import telling you what it chose for each column. If you think there is an issue, you can use the function problems() to detect problems, and/or specify how the columns should be imported. See the section on “column specification” in the Data Import Cheat Sheet for more info.
If you have trouble importing a dataset, try to first import it into a different data such as Google Sheets or Excel tool and then export it as a TSV or CSV before reading it into R.
For really messy data, OpenRefine is complicated but powerful (YouTube demo).
When you are importing a large file, you might want to first try importing a subset of the data. For example, if you want to take the first 17 rows only, you can write read_csv("file.csv",n_max=17)
Similarly, you might want to skip the first \(n\) lines of the file when importing, select only certain columns to read in, or choose a random subset of the rows. See the cheat sheet for instructions on these tasks or just google!

Checking the Imported Datasets

After reading in new data, it is ALWAYS a good idea to do some quick checks of the data. Here are two first steps that are especially useful:

Open the data in the spreadsheet-like viewer with View(dataset_name) and take a look at it. Sort it by different variables by clicking on the arrows next to the variable name. Make sure there isn’t anything unexpected.
Do a quick summary of the data. The code below is one way to do this. For quantitative variables, it provides summary statistics and will let you know if there are missing values. For factors (they need to be factors, not just character variables - the mutate() changes them to factors), it shows you counts for the top categories and tells you if there are any missing values.

dataset_name %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  summary()

Cleaning Datasets

Cleaning Categorical Variables

First we want to make sure the factors are “clean”, meaning consistent values in the correct format. For example, true and TRUE and T will be three different factors. The easiest way to manage this is to look at the levels for the factor and replace values with a messy factor to a clean one. For example, the following code cleans up values in true/false values in column X in a data set called df:

df <- df %>% mutate(X = fct_recode(X, "TRUE" = "T", "TRUE" = "true", "FALSE" = "f", "FALSE" = "N", "FALSE" = "No"))

Example 13.1 (Clean up the levels on the Messy IMDB 5000 dataset) We will use a slightly “messied” version of the IMDB 5000 Dataset, collected by chuansun76 on Kaggle.¹⁴

Use read_csv to load the IMDB 5000 dataset from “https://bcheggeseth.github.io/112_fall_2022/data/imdb_5000_messy.csv”, and save it as imdbMessy.

Solution

imdbMessy <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/imdb_5000_messy.csv")

b. Print out the variable names.

Solution

names(imdbMessy) #order = order in dataset

##  [1] "...1"                      "color"                     "director_name"            
##  [4] "num_critic_for_reviews"    "duration"                  "director_facebook_likes"  
##  [7] "actor_3_facebook_likes"    "actor_2_name"              "actor_1_facebook_likes"   
## [10] "gross"                     "genres"                    "actor_1_name"             
## [13] "movie_title"               "num_voted_users"           "cast_total_facebook_likes"
## [16] "actor_3_name"              "facenumber_in_poster"      "plot_keywords"            
## [19] "movie_imdb_link"           "num_user_for_reviews"      "language"                 
## [22] "country"                   "content_rating"            "budget"                   
## [25] "title_year"                "actor_2_facebook_likes"    "imdb_score"               
## [28] "aspect_ratio"              "movie_facebook_likes"

ls(imdbMessy) #order = alphabetical order

##  [1] "actor_1_facebook_likes"    "actor_1_name"              "actor_2_facebook_likes"   
##  [4] "actor_2_name"              "actor_3_facebook_likes"    "actor_3_name"             
##  [7] "aspect_ratio"              "budget"                    "cast_total_facebook_likes"
## [10] "color"                     "content_rating"            "country"                  
## [13] "director_facebook_likes"   "director_name"             "duration"                 
## [16] "facenumber_in_poster"      "genres"                    "gross"                    
## [19] "imdb_score"                "language"                  "movie_facebook_likes"     
## [22] "movie_imdb_link"           "movie_title"               "num_critic_for_reviews"   
## [25] "num_user_for_reviews"      "num_voted_users"           "plot_keywords"            
## [28] "title_year"

c. Examine the color variable. What are the existing values?

Solution

imdbMessy %>% select(color) %>% head()

## # A tibble: 6 × 1
##   color
##   <chr>
## 1 Color
## 2 Color
## 3 Color
## 4 Color
## 5 <NA> 
## 6 Color

levels(factor(imdbMessy$color))

## [1] "B&W"             "Black and White" "color"           "Color"           "COLOR"

unique(imdbMessy$color)

## [1] "Color"           NA                "Black and White" "COLOR"           "color"          
## [6] "B&W"

How often does each color occur? Hint: table or count

Solution

imdbMessy %>% count(color)

## # A tibble: 6 × 2
##   color               n
##   <chr>           <int>
## 1 B&W                10
## 2 Black and White   199
## 3 color              30
## 4 Color            4755
## 5 COLOR              30
## 6 <NA>               19

table(imdbMessy$color)

## 
##             B&W Black and White           color           Color           COLOR 
##              10             199              30            4755              30

The read_csv read in the color values as strings. For this exercise, let’s convert them to factor using the code: imdbMessy <- imdbMessy %>% mutate(color = factor(color)).

Solution

imdbMessy <- imdbMessy %>% mutate(color = factor(color))

Select what you think is the best value for each level and replace “messy” versions of the value with clean ones with the fct_recode function as shown above. How many entries are there for each level now?

Solution

imdbMessy <- imdbMessy %>% mutate(color = fct_recode(color, "B&W" = "Black and White", "Color" = "color", "Color" = "COLOR"))
imdbMessy %>% count(color)

## # A tibble: 3 × 2
##   color     n
##   <fct> <int>
## 1 B&W     209
## 2 Color  4815
## 3 <NA>     19

Addressing Missing Data

Finally, you should look for and address missing data, encoded as NA (not available) in R. There is no single formula for dealing with NAs. You should first look to see how many NAs appear in each column:

colSums(is.na(imdbMessy))

Study the individual observations with NAs carefully. Why do you think they are missing? Are certain types of observations more likely to have NAs?

You have several options for dealing with NAs (and they have different consequences):

You can remove observations with one or more NAs (see drop_na).
You can remove columns with many NA values.
You can replace NAs with a reasonable value (called imputing values). This could be a default value (like zero), or the average for a column. (see replace_na)
You can use packages such as missForest that fill in missing values with statistical predictions.¹⁵

There is no perfect approach to dealing with NAs, and you must think carefully about how removing or replacing missing data may affect your work.

Example 13.2 (Address NA values in the Messy IMDB 5000 dataset) Consider imdbMessy.

Print out the number of NAs in each of the columns.

Solution

colSums(is.na(imdbMessy))

##                      ...1                     color             director_name 
##                         0                        19                       104 
##    num_critic_for_reviews                  duration   director_facebook_likes 
##                        50                        15                       104 
##    actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
##                        23                        13                         7 
##                     gross                    genres              actor_1_name 
##                       884                         0                         7 
##               movie_title           num_voted_users cast_total_facebook_likes 
##                         0                         0                         0 
##              actor_3_name      facenumber_in_poster             plot_keywords 
##                        23                        13                       153 
##           movie_imdb_link      num_user_for_reviews                  language 
##                         0                        21                        12 
##                   country            content_rating                    budget 
##                         5                       303                       492 
##                title_year    actor_2_facebook_likes                imdb_score 
##                       108                        13                         0 
##              aspect_ratio      movie_facebook_likes 
##                       329                         0

Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?

Solution

This variable is missing if actor_1_name is missing, which suggests that this movie doesn’t have an actor 1 listed.

imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% head()

## # A tibble: 6 × 29
##    ...1 color directo…¹ num_c…² durat…³ direc…⁴ actor…⁵ actor…⁶ actor…⁷  gross genres actor…⁸
##   <dbl> <fct> <chr>       <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>  <dbl> <chr>  <chr>  
## 1  4503 Color Léa Pool       23      97       4      NA <NA>         NA  24784 Docum… <NA>   
## 2  4520 Color Harry Ga…      12     105       0      NA <NA>         NA 247740 Docum… <NA>   
## 3  4721 Color U. Rober…       3      80       6      NA <NA>         NA   2245 Docum… <NA>   
## 4  4838 Color Pan Nalin      15     102      95      NA <NA>         NA  16892 Docum… <NA>   
## 5  4946 Color Amal Al-…      NA      62      58      NA <NA>         NA     NA Docum… <NA>   
## 6  4947 Color Andrew B…      12      90       0      NA <NA>         NA     NA Docum… <NA>   
## # … with 17 more variables: movie_title <chr>, num_voted_users <dbl>,
## #   cast_total_facebook_likes <dbl>, actor_3_name <chr>, facenumber_in_poster <dbl>,
## #   plot_keywords <chr>, movie_imdb_link <chr>, num_user_for_reviews <dbl>, language <chr>,
## #   country <chr>, content_rating <chr>, budget <dbl>, title_year <dbl>,
## #   actor_2_facebook_likes <dbl>, imdb_score <dbl>, aspect_ratio <dbl>,
## #   movie_facebook_likes <dbl>, and abbreviated variable names ¹director_name,
## #   ²num_critic_for_reviews, ³duration, ⁴director_facebook_likes, ⁵actor_3_facebook_likes, …

Create a new dataframe that removes observations that have NAs for actor_1_facebook_likes.

Solution

imdbMessysub <- imdbMessy %>% filter(!is.na(actor_1_facebook_likes))  #Notice how I saved this new smaller dataset to a new name

Create a second new dataframe that replaces NAs in actor_1_facebook_likes with 0.

Solution

imdbMessysub2 <- imdbMessy %>% mutate(actor_1_facebook_likes = replace_na(actor_1_facebook_likes,0))

Additional Practice

Exercise 13.1 Find a dataset that is not built into R and is related to one of the following topics:

Macalester College (On Moodle Assignment 12, there is a dataset available from Macalester College IR you could use)
A personal hobby or passion
Your hometown, or a place you have lived

Load the data into R, make sure it is clean, and construct one interesting visualization of the data.

Another option for part (e) would be to leave them as strings and then use string processing to define the levels. We’ll learn this technique soon.↩︎
This is dangerous unless you know what you are doing.↩︎