2  File organization, GitHub

Settling In

Choose a card. Sit at that table today.

Introduce yourself

  • Name, pronouns
  • Macalester connections (e.g., majors/minors/concentrations, clubs, teams, events regularly attended)
  • How are you feeling about the first week?
  • What do you still keep with you from your childhood?

Data Storytelling Moment

Go to https://rhythm-of-food.net/#explore-foods

  • What is the data story?
  • What is effective?
  • What could be improved?

Debrief on course learning goals

In reviewing your reflections about our course learning goals, I noted the following:

  • You want to become more confident working with complex, messy data
    • Don’t worry! We’ll have lots of practice.
  • You want to improve your communication skills to individuals outside your field!
    • Consider every day in class as an opportunity to ask each other questions and practice explaining concepts to each other. We’ll do paired programming today!
    • Consider your audience; what priorities/knowledge do they bring?
    • We’ll continue with Data Storyingtelling Moments; Consider the target audience for each…
  • You want to learn new tools!
    • We’ll start learning Github today!
    • We’ll see SQL and how it relates to what you already know…





Learning goals

After this lesson, you should be able to:

  • Set up an organized directory structure for data science projects
  • Explain the difference between absolute and relative file paths and why relative file paths are preferred for reading in data
  • Construct relative file paths to read in data
  • Use the following git verbs (via Github.com and Github Desktop): clone, add, commit, push, pull





File organization

For a data science class

If you haven’t already, I suggest that you create a series of nested folders or directories on your computer to organize your work for this class. Here is a suggested directory structure for this data science class (Sub-bullets indicate folders that are inside other folders.):

  • Documents or Desktop
    • STAT212 or COMP212 or DS212
      • activities
      • homeworks
      • course_project
      • website



For a data science project

At minimum, a data science project should have a code, data, and results folder. Not having these folders and mixing code, data, and results files all in one folder can quickly get hard to navigate for even small projects.

  • Documents (This should be some place you can find easily through your Finder (Mac) or File Explorer (Windows).)
    • descriptive_project_name
      • code: All code files (.R, .Rmd, .qmd) should go here. Recommendation:
        • raw: For messy code that you’re actively working on or used for exploration
          • explore_visualizations.qmd: for exploratory plots
          • explore_modeling.qmd: for any statistical or predictive modeling
        • clean: For code that you have cleaned up, documented, organized, and tested to run as expected
          • cleaning.qmd: for data acquisition and wrangling. Save (write) the cleaned dataset at the end of this file with readr::write_csv().
          • final_visualizations.qmd: for final plots
          • final_modeling.qmd: for any statistical or predictive modeling
      • data: All data files go here (raw and cleaned versions)
        • raw: Original data that hasn’t been cleaned (this might involve large files that can’t be pushed to Github)
        • clean: Any non-original data that has been processed in some way
      • results: e.g., written narratives, plots saved as images, results tables
        • report.qmd: for written narrative
        • figures: Saved plots (e.g., png, jpeg, svg, pdf, tiff) from using `ggsave() that will be used in communicating your project conclusions should go here. (Using screenshots of output in RStudio is not a good practice.)
        • tables: Any sort of plain text file results (e.g., CSVs)
        • interactive: for interactive shiny apps


File paths

What are file paths?

A file path is a text string that tells a computer how to navigate from one location to another. We use file paths to read in (and write out) data.

Essentially, file paths are what go inside read_csv().

There are two types of paths: absolute and relative.



. . .

Absolute file paths start at the “root” directory in a computer system.

DON’T use absolute paths

For reading in data, absolute paths are not a good idea because if the code file is shared. The path will not work on a different computer.



. . .

Relative file paths start wherever you are right now (the working directory (WD)). The working directory when you’re working in a code file (.Rmd, .qmd) may be different from the working directory in the Console.

DO use relative paths

For reading in data, relative paths are preferred because if the project directory structure is used on a different computer, the relative paths will still work.

More information available in file organization notes.

Pair Programming

In this class, we will practice pair programming some of the days.

Pair programming is a software development technique in which two programmers work together at one computer.

  • One person (the “driver”) writes code while
  • The other person (the “navigator”) reviews each line of code as it is typed.
  • The two programmers switch roles frequently.

For more information, read Best Practices for Pair Programming.

. . .

Benefits

  • Fewer code errors
  • Better code quality
  • Faster learning
  • Communication growth

. . .

Roles

  • Driver: The person who is in charge of operating the computer (keyboard/mouse) and in charge of the details of code
  • Navigator: The person who is thinking about next steps and carefully reviewing and guiding what the Driver is doing

You’ll only type code on the first Driver’s computer today. Put the other computer to the side; only to use it as a reference.

GitHub Setup

For more information, see the following resources:

Exercise Instructions

Driver:

  • Go to https://github.com. Log in with your username.
  • Click on Repositories. Click on New. Name your repository file_org. Click Create Repository.
  • Go to settings. Click on Collaborators. Add your partner (the Navigator) as a collaborator. They will get an email inviting them to this repository (repo for short). They’ll have to accept the invitation by opening their email.
  • Clone the repository to your computer using Github Desktop (click Set up in Desktop). Set the Local Path to your activities folder in your class folder (called STAT112 or COMP112 or DS212). The folder called file_org will be empty!

Switch roles (Driver becomes Navigator, Navigator becomes Driver).

New Driver:

  • Download this Zip file and save it to the new file_org folder. Unzip the file.
  • Go to Github Desktop and notice that the files appear there (this automatically adds files to be followed/tracked).
  • Click on the Changes tab. Write a summary and description of the changes (e.g. Added initial files). Click Commit to main.
  • Click on the Publish button to push the changes to the remote repository on Github.com. From now on, that button will say Push origin instead of Publish.
  • Open https://github.com and go to your file_org repository. You should see the files there online.

Exercises

Switch roles (Driver becomes Navigator, Navigator becomes Driver).

Open the file_org_activity folder and navigate to the code/clean/cleaning.qmd file. Open it in RStudio. Follow the instructions in that file.

Exercise goals:

  • Practice using relative paths in a realistic project context.
  • Review data wrangling from 112.
  • Practice using Github.
  • Practice keyboard shorcuts.

Every 5 minutes, switch roles.

Solutions

  1. Load packages and read in data.
Solutions
library(tidyverse)
weather <- read_csv("../../data/raw/weather.csv")
  1. Clean the PrecipYr by replacing 99999 with NA.
Solutions
weather_clean <- weather %>% 
    mutate(PrecipYr = na_if(PrecipYr, 99999))
  1. Add dateInYear variable.
Solutions
#Option 1
weather_clean <- weather_clean %>% 
    arrange(Month, Day) %>% 
    mutate(dateInYear = 1:365)
#Option 2
weather_clean <- weather_clean %>% 
    mutate(dateInYear = yday(mdy(date)))
  1. Add in 3-letter month abbreviations.
Solutions
# Option 1: via joins
months <- tibble(
    Month = 1:12,
    month_name = month.abb
)
weather_clean <- weather_clean %>% 
    left_join(months)

# Option 2: via vector subsetting
weather %>% 
    mutate(month_name = month.abb[Month]) %>% head()
  1. Write out clean data to a CSV file.
Solutions
write_csv(weather_clean, file = "../../data/clean/weather_clean.csv")





After Class

  • Take a look at the Solutions at the end of the day’s activity on the course website.
  • Take a look at the Schedule page to see how to prepare for the next class (readings + checkpoints)
  • Post an introduction of yourself in the #introductions channel on Slack to see who you connect with and who might be a potential good project partner.
  • Work on Homework 1; there are many separate tasks so start now.
  • Practice the keyboard shortcuts described here.