library(tidyverse)
<- read_csv("../../data/raw/weather.csv") weather
2 File organization, GitHub
Settling In
Data Storytelling Moment
Debrief on course learning goals
In reviewing your reflections about our course learning goals, I noted the following:
- You want to become more confident working with complex, messy data
- Don’t worry! We’ll have lots of practice.
- You want to improve your communication skills to individuals outside your field!
- Consider every day in class as an opportunity to ask each other questions and practice explaining concepts to each other. We’ll do paired programming today!
- Consider your audience; what priorities/knowledge do they bring?
- We’ll continue with Data Storyingtelling Moments; Consider the target audience for each…
- You want to learn new tools!
- We’ll start learning Github today!
- We’ll see SQL and how it relates to what you already know…
Learning goals
After this lesson, you should be able to:
- Set up an organized directory structure for data science projects
- Explain the difference between absolute and relative file paths and why relative file paths are preferred for reading in data
- Construct relative file paths to read in data
- Use the following
git
verbs (via Github.com and Github Desktop):clone
,add
,commit
,push
,pull
File organization
For a data science class
If you haven’t already, I suggest that you create a series of nested folders or directories on your computer to organize your work for this class. Here is a suggested directory structure for this data science class (Sub-bullets indicate folders that are inside other folders.):
Documents
orDesktop
STAT212
orCOMP212
orDS212
activities
homeworks
course_project
website
For a data science project
At minimum, a data science project should have a code
, data
, and results
folder. Not having these folders and mixing code, data, and results files all in one folder can quickly get hard to navigate for even small projects.
Documents
(This should be some place you can find easily through your Finder (Mac) or File Explorer (Windows).)descriptive_project_name
code
: All code files (.R
,.Rmd
,.qmd
) should go here. Recommendation:raw
: For messy code that you’re actively working on or used for explorationexplore_visualizations.qmd
: for exploratory plotsexplore_modeling.qmd
: for any statistical or predictive modeling
clean
: For code that you have cleaned up, documented, organized, and tested to run as expectedcleaning.qmd
: for data acquisition and wrangling. Save (write) the cleaned dataset at the end of this file withreadr::write_csv()
.final_visualizations.qmd
: for final plotsfinal_modeling.qmd
: for any statistical or predictive modeling
data
: All data files go here (raw and cleaned versions)raw
: Original data that hasn’t been cleaned (this might involve large files that can’t be pushed to Github)clean
: Any non-original data that has been processed in some way
results
: e.g., written narratives, plots saved as images, results tablesreport.qmd
: for written narrativefigures
: Saved plots (e.g., png, jpeg, svg, pdf, tiff) from using `ggsave()
that will be used in communicating your project conclusions should go here. (Using screenshots of output in RStudio is not a good practice.)tables
: Any sort of plain text file results (e.g., CSVs)interactive
: for interactive shiny apps
File paths
A file path is a text string that tells a computer how to navigate from one location to another. We use file paths to read in (and write out) data.
Essentially, file paths are what go inside read_csv()
.
There are two types of paths: absolute and relative.
. . .
Absolute file paths start at the “root” directory in a computer system.
For reading in data, absolute paths are not a good idea because if the code file is shared. The path will not work on a different computer.
. . .
Relative file paths start wherever you are right now (the working directory (WD)). The working directory when you’re working in a code file (.Rmd
, .qmd
) may be different from the working directory in the Console.
For reading in data, relative paths are preferred because if the project directory structure is used on a different computer, the relative paths will still work.
More information available in file organization notes.
Pair Programming
In this class, we will practice pair programming some of the days.
Pair programming is a software development technique in which two programmers work together at one computer.
- One person (the “driver”) writes code while
- The other person (the “navigator”) reviews each line of code as it is typed.
- The two programmers switch roles frequently.
For more information, read Best Practices for Pair Programming.
. . .
Benefits
- Fewer code errors
- Better code quality
- Faster learning
- Communication growth
. . .
Roles
- Driver: The person who is in charge of operating the computer (keyboard/mouse) and in charge of the details of code
- Navigator: The person who is thinking about next steps and carefully reviewing and guiding what the Driver is doing
You’ll only type code on the first Driver’s computer today. Put the other computer to the side; only to use it as a reference.
GitHub Setup
For more information, see the following resources:
Exercise Instructions
Driver:
- Go to https://github.com. Log in with your username.
- Click on Repositories. Click on New. Name your repository
file_org
. Click Create Repository. - Go to settings. Click on Collaborators. Add your partner (the Navigator) as a collaborator. They will get an email inviting them to this repository (repo for short). They’ll have to accept the invitation by opening their email.
Clone
the repository to your computer using Github Desktop (click Set up in Desktop). Set the Local Path to youractivities
folder in your class folder (calledSTAT112
orCOMP112
orDS212
). The folder calledfile_org
will be empty!
Switch roles (Driver becomes Navigator, Navigator becomes Driver).
New Driver:
- Download this Zip file and save it to the new
file_org
folder. Unzip the file. - Go to Github Desktop and notice that the files appear there (this automatically
add
s files to be followed/tracked). - Click on the
Changes
tab. Write a summary and description of the changes (e.g. Added initial files). ClickCommit to main
. - Click on the
Publish
button to push the changes to the remote repository on Github.com. From now on, that button will sayPush origin
instead ofPublish
. - Open https://github.com and go to your
file_org
repository. You should see the files there online.
Exercises
Switch roles (Driver becomes Navigator, Navigator becomes Driver).
Open the file_org_activity
folder and navigate to the code/clean/cleaning.qmd
file. Open it in RStudio. Follow the instructions in that file.
Exercise goals:
- Practice using relative paths in a realistic project context.
- Review data wrangling from 112.
- Practice using Github.
- Practice keyboard shorcuts.
Every 5 minutes, switch roles.
Solutions
- Load packages and read in data.
Solutions
- Clean the
PrecipYr
by replacing 99999 withNA
.
Solutions
<- weather %>%
weather_clean mutate(PrecipYr = na_if(PrecipYr, 99999))
- Add
dateInYear
variable.
Solutions
#Option 1
<- weather_clean %>%
weather_clean arrange(Month, Day) %>%
mutate(dateInYear = 1:365)
#Option 2
<- weather_clean %>%
weather_clean mutate(dateInYear = yday(mdy(date)))
- Add in 3-letter month abbreviations.
Solutions
# Option 1: via joins
<- tibble(
months Month = 1:12,
month_name = month.abb
)<- weather_clean %>%
weather_clean left_join(months)
# Option 2: via vector subsetting
%>%
weather mutate(month_name = month.abb[Month]) %>% head()
- Write out clean data to a CSV file.
Solutions
write_csv(weather_clean, file = "../../data/clean/weather_clean.csv")
After Class
- Take a look at the Solutions at the end of the day’s activity on the course website.
- Take a look at the Schedule page to see how to prepare for the next class (readings + checkpoints)
- Post an introduction of yourself in the
#introductions
channel on Slack to see who you connect with and who might be a potential good project partner. - Work on Homework 1; there are many separate tasks so start now.
- Practice the keyboard shortcuts described here.