MSCS Happenings
For each problem I marked with an X,
Talk with others in the class, use online resources; help each other understand the WHY.
Turn into me by Thursday.
replace_na
and drop_na
More thorough notes available at https://bcheggeseth.github.io/112_fall_2023/data-import.html
Depending on the file type (csv, tsv, excel, Google sheet, Stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:
read_csv()
read_delim()
read_sheet()
st_read()
When working with data sets, you need to:
Absolute file path describes the location of a file from the root directory or folder, typically the user directory.
~
refers to the user root directory.C:\
data.csv
is located in the Assignment_08
folder in Comp_Stat_112
folder on the Desktop.
~/Desktop/Comp_Stat_112/Assignment_08/data.csv
C:/Desktop/Comp_Stat_112/Assignment_08/data.csv
Relative file path describes the location of a file from the current working directory.
Example: A file called data.csv
is located in a data
folder within Comp_Stat_112
folder on the Desktop
~/Desktop/Comp_Stat_112/Assignment_08/
, the relative path is ../data/data.csv
. The ..
refers to the parent directory (go up one level to the folder containing Assignment_08
).~/Desktop/Comp_Stat_112/
, the relative path is data/data.csv
.~/Desktop/Comp_Stat_112/data
, the relative path is data.csv
.The best location to put a dataset is within a folder that is dedicated to the project or assignment.
Try downloading the csv file from:
https://bcheggeseth.github.io/112_fall_2023/data/imdb_5000_messy.csv
Right-click and Save As
Put the data file in Assignment_08
folder.
Create a new Rmd file called Data_Import.Rmd
(save it to the same folder) and load the data in with read_csv()
.
Always look at the data after importing with View()
Do a quick summary of all variables:
dataset_name %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
Cleaning Categorical Variables
“Clean” data has consistent values in terms of spelling and capitalization.
How could we clean this up?
# A tibble: 6 × 2
color n
<chr> <int>
1 B&W 10
2 Black and White 199
3 COLOR 30
4 Color 4755
5 color 30
6 <NA> 19
Study the individual observations with NAs carefully.
Addressing Missing Data
You have several options for dealing with NAs (and they have different consequences):
drop_na
).select
.replace_na
)Let’s check to see how many values are missing per variable.
...1 color director_name
0 19 104
num_critic_for_reviews duration director_facebook_likes
50 15 104
actor_3_facebook_likes actor_2_name actor_1_facebook_likes
23 13 7
gross genres actor_1_name
884 0 7
movie_title num_voted_users cast_total_facebook_likes
0 0 0
actor_3_name facenumber_in_poster plot_keywords
23 13 153
movie_imdb_link num_user_for_reviews language
0 21 12
country content_rating budget
5 303 492
title_year actor_2_facebook_likes imdb_score
108 13 0
aspect_ratio movie_facebook_likes
329 0
Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()
# A tibble: 6 × 3
movie_title actor_1_name actor_1_facebook_likes
<chr> <chr> <dbl>
1 Pink Ribbons, Inc. <NA> NA
2 Sex with Strangers <NA> NA
3 The Harvest/La Cosecha <NA> NA
4 Ayurveda: Art of Being <NA> NA
5 The Brain That Sings <NA> NA
6 The Blood of My Brother <NA> NA
To remove observations (rows) that are missing actor_1_facebook_likes
,
To replace missing values of actor_1_facebook_likes
with 0,
Find a dataset that is not built into R and is related to one of the following topics:
Load the data into R, make sure it is clean, and construct one interesting visualization of the data.
Note: this might help you brainstorm ideas for projects
Only 1 exercise for Data Import [Assignment 8]
Midterm Revisions due Thursday in class
TT9 due Friday (data: Horror Legends)
IV1 due next Friday (see feedback on spreadsheet)