MSCS Happenings
For each problem I marked with an X,
Talk with others in the class, use online resources; help each other understand the WHY.
Turn into me by Thursday.
replace_na and drop_naMore thorough notes available at https://bcheggeseth.github.io/112_fall_2023/data-import.html
Depending on the file type (csv, tsv, excel, Google sheet, Stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:
read_csv()read_delim()read_sheet()st_read()When working with data sets, you need to:
Absolute file path describes the location of a file from the root directory or folder, typically the user directory.
~ refers to the user root directory.C:\data.csv is located in the Assignment_08 folder in Comp_Stat_112 folder on the Desktop.
~/Desktop/Comp_Stat_112/Assignment_08/data.csvC:/Desktop/Comp_Stat_112/Assignment_08/data.csvRelative file path describes the location of a file from the current working directory.
Example: A file called data.csv is located in a data folder within Comp_Stat_112 folder on the Desktop
~/Desktop/Comp_Stat_112/Assignment_08/, the relative path is ../data/data.csv. The .. refers to the parent directory (go up one level to the folder containing Assignment_08).~/Desktop/Comp_Stat_112/, the relative path is data/data.csv.~/Desktop/Comp_Stat_112/data, the relative path is data.csv.The best location to put a dataset is within a folder that is dedicated to the project or assignment.
Try downloading the csv file from:
https://bcheggeseth.github.io/112_fall_2023/data/imdb_5000_messy.csv
Right-click and Save As
Put the data file in Assignment_08 folder.
Create a new Rmd file called Data_Import.Rmd (save it to the same folder) and load the data in with read_csv().
Always look at the data after importing with View()
Do a quick summary of all variables:
dataset_name %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
Cleaning Categorical Variables
“Clean” data has consistent values in terms of spelling and capitalization.
How could we clean this up?
# A tibble: 6 × 2
color n
<chr> <int>
1 B&W 10
2 Black and White 199
3 COLOR 30
4 Color 4755
5 color 30
6 <NA> 19
Study the individual observations with NAs carefully.
Addressing Missing Data
You have several options for dealing with NAs (and they have different consequences):
drop_na).select.replace_na)Let’s check to see how many values are missing per variable.
...1 color director_name
0 19 104
num_critic_for_reviews duration director_facebook_likes
50 15 104
actor_3_facebook_likes actor_2_name actor_1_facebook_likes
23 13 7
gross genres actor_1_name
884 0 7
movie_title num_voted_users cast_total_facebook_likes
0 0 0
actor_3_name facenumber_in_poster plot_keywords
23 13 153
movie_imdb_link num_user_for_reviews language
0 21 12
country content_rating budget
5 303 492
title_year actor_2_facebook_likes imdb_score
108 13 0
aspect_ratio movie_facebook_likes
329 0
Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()# A tibble: 6 × 3
movie_title actor_1_name actor_1_facebook_likes
<chr> <chr> <dbl>
1 Pink Ribbons, Inc. <NA> NA
2 Sex with Strangers <NA> NA
3 The Harvest/La Cosecha <NA> NA
4 Ayurveda: Art of Being <NA> NA
5 The Brain That Sings <NA> NA
6 The Blood of My Brother <NA> NA
To remove observations (rows) that are missing actor_1_facebook_likes,
To replace missing values of actor_1_facebook_likes with 0,
Find a dataset that is not built into R and is related to one of the following topics:
Load the data into R, make sure it is clean, and construct one interesting visualization of the data.
Note: this might help you brainstorm ideas for projects
Only 1 exercise for Data Import [Assignment 8]
Midterm Revisions due Thursday in class
TT9 due Friday (data: Horror Legends)
IV1 due next Friday (see feedback on spreadsheet)
