MSCS Happenings
For each problem I marked with an X,
Talk with others in the class; help each other understand the WHY.
Turn into me by next class.
UPDATE: You should have been notified of a shared pdf with feedback
Talk through some of the stumbling blocks with your classmates. Take notes for yourself.
By the end of THIS week, submit an updated version of the Midterm Part 2 to Moodle and write a reflection about the midterm in your reflection Google Doc for March.
My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.
replace_na and drop_naMore thorough notes available at https://bcheggeseth.github.io/112_spring_2023/data-import.html
Depending on the file type (csv, tsv, excel, Google sheet, stata file, shapefile, etc.), you’ll need to adjust the function you use. Here are some of the most common:
read_csv()read_delim()read_sheet()st_read()When working with data sets, you need to:
Absolute file path describes the location of a file from the root directory or folder, typically the user directory.
~ refers to the user root directory.C:\data.csv is located in the Assignment_08 folder in Comp_Stat_112 folder on the Desktop.
~/Desktop/Comp_Stat_112/Assignment_08/data.csvC:/Desktop/Comp_Stat_112/Assignment_08/data.csvRelative file path describes the location of a file from the current working directory.
Example: A file called data.csv is located in a data folder within Comp_Stat_112 folder on the Desktop
~/Desktop/Comp_Stat_112/Assignment_08/, the relative path is ../data/data.csv. The .. refers to the parent directory (go up one level to the folder containing Assignment_08).~/Desktop/Comp_Stat_112/, the relative path is data/data.csv.~/Desktop/Comp_Stat_112/data, the relative path is data.csv.The best location to put a dataset is within a folder that is dedicated to the project or assignment.
Try downloading the csv file from:
https://bcheggeseth.github.io/112_spring_2023/data/imdb_5000_messy.csvRight-click and Save As
Put the data file in Assignment_08 folder.
Create a new Rmd file called Data_Import.Rmd (save it to the same folder) and load the data in with read_csv().
Always look at the data after importing with View()
Do a quick summary of all variables:
dataset_name %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  summary()Cleaning Categorical Variables
“Clean” data has consistent values in terms of spelling and capitalization.
How could we clean this up?
# A tibble: 6 × 2
  color               n
  <chr>           <int>
1 B&W                10
2 Black and White   199
3 color              30
4 Color            4755
5 COLOR              30
6 <NA>               19Study the individual observations with NAs carefully.
Addressing Missing Data
You have several options for dealing with NAs (and they have different consequences):
drop_na).select.replace_na)Let’s check to see how many values are missing per variable.
                     ...1                     color             director_name 
                        0                        19                       104 
   num_critic_for_reviews                  duration   director_facebook_likes 
                       50                        15                       104 
   actor_3_facebook_likes              actor_2_name    actor_1_facebook_likes 
                       23                        13                         7 
                    gross                    genres              actor_1_name 
                      884                         0                         7 
              movie_title           num_voted_users cast_total_facebook_likes 
                        0                         0                         0 
             actor_3_name      facenumber_in_poster             plot_keywords 
                       23                        13                       153 
          movie_imdb_link      num_user_for_reviews                  language 
                        0                        21                        12 
                  country            content_rating                    budget 
                        5                       303                       492 
               title_year    actor_2_facebook_likes                imdb_score 
                      108                        13                         0 
             aspect_ratio      movie_facebook_likes 
                      329                         0 Consider the actor_1_facebook_likes column. Take a look at a few of the records that have NA values. Why do you think there are NAs?
imdbMessy %>% filter(is.na(actor_1_facebook_likes)) %>% select(movie_title,actor_1_name,actor_1_facebook_likes) %>% head()# A tibble: 6 × 3
  movie_title              actor_1_name actor_1_facebook_likes
  <chr>                    <chr>                         <dbl>
1 Pink Ribbons, Inc.       <NA>                             NA
2 Sex with Strangers       <NA>                             NA
3 The Harvest/La Cosecha   <NA>                             NA
4 Ayurveda: Art of Being   <NA>                             NA
5 The Brain That Sings     <NA>                             NA
6 The Blood of My Brother  <NA>                             NATo remove observations (rows) that are missing actor_1_facebook_likes,
To replace missing values of actor_1_facebook_likes with 0,
Find a dataset that is not built into R and is related to one of the following topics:
Load the data into R, make sure it is clean, and construct one interesting visualization of the data.
Note: this might help you brainstorm ideas for projects
Only 1 exercise for Data Import [Assignment 8]
Midterm Revisions Part 2 due Friday
TT9 due Friday (data: Programming Languages)
IV1 due Friday
