Brianna Heggeseth
MSCS Happenings
Outside Mac
replace_na
and drop_na
More thorough notes available at https://bcheggeseth.github.io/112_fall_2022/data-import.html
Depending on the file type (csv, tsv, excel, Google sheet, stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:
read_csv()
read_delim()
read_sheet()
st_read()
The Import Wizard can help you write the code!
Try importing data from:
https://bcheggeseth.github.io/112_fall_2022/data/imdb_5000_messy.csv
Note: When using the Import Wizard, make sure to copy and paste the code into a Rmd file.
Always look at the data after importing with View()
Do a quick summary of all variables:
dataset_name %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
Cleaning Categorical Variables
“Clean” data has consistent values in terms of spelling and capitalization.
How could we clean this up?
Study the individual observations with NAs carefully.
Addressing Missing Data
You have several options for dealing with NAs (and they have different consequences):
drop_na
).select
.replace_na
)Let’s check to see how many values are missing per variable.
...1 color director_name
0 19 104
num_critic_for_reviews duration director_facebook_likes
50 15 104
actor_3_facebook_likes actor_2_name actor_1_facebook_likes
23 13 7
gross genres actor_1_name
884 0 7
movie_title num_voted_users cast_total_facebook_likes
0 0 0
actor_3_name facenumber_in_poster plot_keywords
23 13 153
movie_imdb_link num_user_for_reviews language
0 21 12
country content_rating budget
5 303 492
title_year actor_2_facebook_likes imdb_score
108 13 0
aspect_ratio movie_facebook_likes
329 0
Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()
# A tibble: 6 × 3
movie_title actor_1_name actor_1_facebook_likes
<chr> <chr> <dbl>
1 Pink Ribbons, Inc. <NA> NA
2 Sex with Strangers <NA> NA
3 The Harvest/La Cosecha <NA> NA
4 Ayurveda: Art of Being <NA> NA
5 The Brain That Sings <NA> NA
6 The Blood of My Brother <NA> NA
To remove observations (rows) that are missing actor_1_facebook_likes
,
To replace missing values of actor_1_facebook_likes
with 0,
Find a dataset that is not built into R and is related to one of the following topics:
Load the data into R, make sure it is clean, and construct one interesting visualization of the data.
Note: this might help you brainstorm ideas for projects
Assignment 11 (EDA on Flights) due Sunday
Assignment 12 (1 exercise) due Tuesday
Brainstorm Activity due Friday
Midterm Revisions Part 2 due Friday
IV1 due next week