MSCS Happenings
For each problem I marked with an X,
Talk with others in the class; help each other understand the WHY.
Turn into me by next class.
UPDATE: You should have been notified of a shared pdf with feedback
Talk through some of the stumbling blocks with your classmates. Take notes for yourself.
By the end of THIS week, submit an updated version of the Midterm Part 2 to Moodle and write a reflection about the midterm in your reflection Google Doc for March.
My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.
replace_na
and drop_na
More thorough notes available at https://bcheggeseth.github.io/112_spring_2023/data-import.html
Depending on the file type (csv, tsv, excel, Google sheet, stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:
read_csv()
read_delim()
read_sheet()
st_read()
When working with data sets, you need to:
Absolute file path describes the location of a file from the root directory or folder, typically the user directory.
~
refers to the user root directory.C:\
data.csv
is located in the Assignment_08
folder in Comp_Stat_112
folder on the Desktop.
~/Desktop/Comp_Stat_112/Assignment_08/data.csv
C:/Desktop/Comp_Stat_112/Assignment_08/data.csv
Relative file path describes the location of a file from the current working directory.
Example: A file called data.csv
is located in a data
folder within Comp_Stat_112
folder on the Desktop
~/Desktop/Comp_Stat_112/Assignment_08/
, the relative path is ../data/data.csv
. The ..
refers to the parent directory (go up one level to the folder containing Assignment_08
).~/Desktop/Comp_Stat_112/
, the relative path is data/data.csv
.~/Desktop/Comp_Stat_112/data
, the relative path is data.csv
.The best location to put a dataset is within a folder that is dedicated to the project or assignment.
Try downloading the csv file from:
https://bcheggeseth.github.io/112_spring_2023/data/imdb_5000_messy.csv
Right-click and Save As
Put the data file in Assignment_08
folder.
Create a new Rmd file called Data_Import.Rmd
(save it to the same folder) and load the data in with read_csv()
.
Always look at the data after importing with View()
Do a quick summary of all variables:
dataset_name %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
Cleaning Categorical Variables
“Clean” data has consistent values in terms of spelling and capitalization.
How could we clean this up?
# A tibble: 6 × 2
color n
<chr> <int>
1 B&W 10
2 Black and White 199
3 color 30
4 Color 4755
5 COLOR 30
6 <NA> 19
Study the individual observations with NAs carefully.
Addressing Missing Data
You have several options for dealing with NAs (and they have different consequences):
drop_na
).select
.replace_na
)Let’s check to see how many values are missing per variable.
...1 color director_name
0 19 104
num_critic_for_reviews duration director_facebook_likes
50 15 104
actor_3_facebook_likes actor_2_name actor_1_facebook_likes
23 13 7
gross genres actor_1_name
884 0 7
movie_title num_voted_users cast_total_facebook_likes
0 0 0
actor_3_name facenumber_in_poster plot_keywords
23 13 153
movie_imdb_link num_user_for_reviews language
0 21 12
country content_rating budget
5 303 492
title_year actor_2_facebook_likes imdb_score
108 13 0
aspect_ratio movie_facebook_likes
329 0
Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()
# A tibble: 6 × 3
movie_title actor_1_name actor_1_facebook_likes
<chr> <chr> <dbl>
1 Pink Ribbons, Inc. <NA> NA
2 Sex with Strangers <NA> NA
3 The Harvest/La Cosecha <NA> NA
4 Ayurveda: Art of Being <NA> NA
5 The Brain That Sings <NA> NA
6 The Blood of My Brother <NA> NA
To remove observations (rows) that are missing actor_1_facebook_likes
,
To replace missing values of actor_1_facebook_likes
with 0,
Find a dataset that is not built into R and is related to one of the following topics:
Load the data into R, make sure it is clean, and construct one interesting visualization of the data.
Note: this might help you brainstorm ideas for projects
Only 1 exercise for Data Import [Assignment 8]
Midterm Revisions Part 2 due Friday
TT9 due Friday (data: Programming Languages)
IV1 due Friday