2 RStudio workshop
RStudio warm-up (together)
- Continue touring RStudio and its basic functionality
- Learn about a reproducible workflow in RStudio
Exercises, aka Homework 1 (in groups)
- Tidy data
- Basic features of tidy data
Watch some videos prepared by Lisa Lendway, a Mac alum, former Mac prof that helped shape this course, and R/RStudio whiz:
- Setting up for success in COMP/STAT 112 (Lisa Lendway)
Read:
2.1 Warm-up
Background
What’s the point of this course?
Build knowledge from data within a particular domain of inquiry, and particular contexts.
Why will we use R/RStudio as a tool in this course?
It’s open access (free!), open source (anyone can contribute to its development), and used broadly. Just some examples where R is used:
- Logan Pratico: making “eviction data accessible to the legal aid community”
- Ahmadou Dicko: humanitarians creating “life saving data products”
- Shelmith Kariuki: Kenyan government census
- Laura DeCicco: U.S. Geological Survey (USGS) discovery and retrieval of hydrologic data
- Nick Snellgrove & Uli Muellner: studying aquatic invasive species in MN
RStudio layout
Last class, we spent most of our time in the console:
Console is good for… | Console is bad for most everything else, including… |
---|---|
quick calculations | documenting our work |
trying out code | editing our work |
pulling up help files | communicating our work |
being able to reproduce our work |
Quarto
Reproducibility with Quarto
It’s important to document and communicate every step in a data analysis (e.g. data collection, cleaning, and analysis), so that others and ourselves can reproduce (hence verify and build upon) our work.
RStudio includes tools for creating reproducible and lovely documents, webpages, books (like this online manual!), etc that allow us to interleave text, code, output, images, tables, etc:
- Quarto = A technology that incorporates code from many programming languages including R along with styled text (e.g. headers, bold, italics, links) using markdown, and is integrated into RStudio. If you’ve used R Markdown, it looks very similar.
Quarto EXAMPLE
Download and open this Quarto document in RStudio, i.e. a .qmd file.
Follow the prompts therein.
NOTE: This document explores the basics. We’ll pick up more details as we go, often by making and learning from mistakes. The Quarto cheatsheets at the top of this activity present more features of Quarto.
R/RStudio hot tips
.qmd vs console
The console does not communicate with Quarto.
Things you define or type in the console are NOT defined, stored, or run in the .qmd.Quarto can communicate with the console, but only if you tell it to.
- YES: If you run a chunk inside your .qmd (by clicking the green arrow), it is also run and stored in the console.
- NO: If you render your .qmd to html but do not also run the chunks inside the .qmd, the results will be displayed in the html but not run or stored in the console.
COMMENTING CODE: code = communication
Leaving short notes about what our code is doing is an important aspect of communication. It reminds our future selves, and communicates to others, what our thought and code process was (hence is important to reproducibility!). In our R chunks, we will write comments after a pound sign: # This is my comment
.
STYLE GUIDES: code = communication
All of this emphasis on communication is not specific to this class, it is a general expectation. Further, the code structure we’ll use this semester reflects common practice, but not the only practice. Various companies / entities have their own R “style guides”:
- https://style.tidyverse.org/ (e.g. R developers do not like CamelCase)
- https://google.github.io/styleguide/Rguide.html (e.g. Google likes CamelCase!)
2.2 Exercises (HW1)
Goals
The following exercises will be due as homework 1. You should work on the exercises in groups, but write up your own work. The goals are to:
- explore tidy data…
- in RStudio…
- and communicate your work in a reproducible Quarto document.
Directions
- Be kind to yourself. Making (and learning from) mistakes is expected and important.
- Collaboration
- Actively contribute to discussion.
- Actively include all other group members in discussion.
- Create a space where others feel comfortable making mistakes & sharing ideas.
- Stay in sync while respecting that everybody has different learning & work rhythms.
- Growth
- If you are new to R: I hope you leave class today feeling confident about the purpose and general structure of Quarto documents.
- If you are familiar with R: I hope you think more deeply about concepts you might have taken for granted in the past, support those new to R in your group (explaining ideas deepens your own understanding and retention), and challenge yourself. There are optional challenge exercises at the end.
- Ask questions
We will not discuss these exercises as a class. Your group should ask me questions as I walk around the room.
Getting started
For most homeworks and in-class activities, I’ll provide a .qmd template. However, it’s also important to practice starting your own .qmd from scratch. You’ll do that here. Before starting the exercises, take the following steps:
Open a new .qmd file, starting in the toop toolbar: File > New File > Quarto Document…
Replace the top “yaml” with the following. Don’t change ANYTHING other than author line. The
toc
andtoc_float
lines will add a nice floating table of contents (toc) to your html. Theembed-resources
line ensures that when you send your rendered html file to others, they are able to view all of your code and output – definitely do NOT remove this.--- title: "Homework 1" author: "PUT YOUR NAME HERE" format: html: toc: true toc-depth: 1 embed-resources: true ---
Below the yaml section, add section headers for each homework exercise:
# Exercise 1: Warming up # Exercise 2: Import tidy data # Exercise 3: Tidy data properties # Exercise 4: Get to know the data # Exercise 5: Data structure # Exercise 6: Your turn # Exercise 7: Brainstorm
Render the document to html. This will require that you name and save your document:
- If using RStudio desktop: Save this to the “DS 112 > Homework” folder you created as “homework_1”
- If using Mac’s server: Save this to the “DS 112 > Homework” folder you created as “homework_1”. You’ll later export this to your computer, but right now it’s living on the server.
Put your answers to each exercise below under the appropriate
# Exercise
section. You do not need to write out the question/prompt itself.
Exercise 1: Warming up
Below the “Exercise 1” section header:
- Write 1 sentence about one of your favorite foods at Cafe Mac. Make sure to include an italicized word and a bold word.
- Show a .png image of the food from the web. In Google, you can add
filetype:png
to the beginning of your search term, click on the photo you want, and copy the image address. - Render the document again. Make sure it looks like you want it to.
Exercise 2: Import tidy data
You filled out a quick survey before class. Let’s work with this data! The first step to working with data in RStudio is getting it in there! How we do this depends on:
- file format (eg: .xls Excel spreadsheet, .csv, .txt)
- storage locations (eg: online, on your desktop, built into RStudio itself).
Our data is stored as a .csv file online. Within a new R chunk, import and store this data as survey
. (Take note of the file name and the argument it takes.)
# Import the data
<- read.csv("https://mac-stat.github.io/data/112_fall_2024_survey.csv") survey
Note that nothing new appears in your document after you import the data. This is because you stored, but didn’t print, the data. And we don’t want to print the data in our .qmd – it would be too messy. There are 2 quick ways to check out the entire data table to get a sense of its structure and contents. Try them:
- Type
View(survey)
in the console. - In the Environment tab (upper right pane), click on
survey
.
Exercise 3: Tidy data properties
Write your answers to these questions/prompts in a bulleted list (-
), in the order they’re presented.
- What are the units of observation? That is, what does each row represent?
- Name one quantitative variable (column) in the dataset.
- Name one categorical variable (column) in the dataset.
Exercise 4: Get to know the data
Before we can learn anything from our data, we must understand its structure. One at a time for each function below: try it out and write a short comment/note about what the function does. To make for easier recall later, try to connect your comment on what the function does to how it’s named.
# Replace this with a comment on what dim() does
dim(survey)
# Replace this with a comment on what nrow() does
nrow(survey)
# Replace this with a comment on what head() does
head(survey)
# Replace this with a comment on what head(___, 3) does
head(survey, 3)
# Replace this with a comment on what tail() does
tail(survey)
# Replace this with a comment on what names() does
names(survey)
Exercise 5: Data structure
It’s important that we understand the different types or structures of the objects we store. It will inform what types of analyses are appropriate, and the appropriate R code for these analyses. The class()
function is important here. Check out an example:
<- 3
x class(x)
[1] "numeric"
<- "pizza"
y class(y)
[1] "character"
There are various object classes, including: num
or numeric
, int
or integer
, chr
or character
, factor
, data.frame
. Complete the chunk below to explore the classes / structure of our survey
data and the variables within the survey
data:
# Obtain the overall class of the survey object
# Examine the structure of each variable within survey (including class)
# Just take note of what information we gain here (no need to write more)
str(survey)
Exercise 6: Your turn
Let’s practice these same ideas using data on World Cup football/soccer:
https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv
Data is only useful if we know what it’s measuring! You can find a codebook, i.e. document that describes the data, here. Address each prompt below using R functions. Include both the #
prompt and your code in the chunk.
# Import and name the dataset (you pick a name!)
# Print the first 6 rows of the dataset
# How many world cup competitions do we have data on? And how many measurements do we have on each competition?
# Get a list of all variable names in the dataset
# Display the class and other information for each variable in the dataset
Exercise 7: Brainstorm
We’ve just scratched the surface. In a bulleted list (-
), write out 3 questions about the World Cup that we might answer using these data. Be creative – these don’t have to be questions we’ve learned how to answer yet.
Finalize your homework
Render your qmd one more time. Confirm that it appears as you expect it and that it’s correctly formatted. If the formatting is amiss, we can’t grade it :/
If you’re working on Mac’s RStudio server, you have one more step that you should take at the end of each activity / assignment: export your files to your computer. To do so:
- Go to the Files tab in the lower right pane.
- Click the boxes next to the two homework files: homework_1.qmd and homework_1.html.
- Still within the Files tab, click on the “More” button that has a gear symbol next to it.
- Click “Export” then “Download”.
- The files were likely exported from the RStudio server to the Downloads folder on your computer. It’s important to now move them to the “DS 112 > Homework” folder that you created at the beginning of class. They are now there for safe keeping :)
Submit your html file to the Homework 1 assignment on Moodle. Do NOT submit a .qmd or pdf or any other file type – we will not be able to grade them.
You’re done with Homework 1. Congrats!!
If you finish early: dig deeper
Here are some things to think about if you finish early. In deciding what to think about, reflect upon the ways you’re hoping to grow this semester. This will vary from student to student, based on current R experience, post-graduate goals, interests, etc.
- Thinking about communication (including aesthetics!)
- Check out other features of Quarto, shown in the Quarto Start up Guide linked at the top of this activity.
- Check out the different themes or ways we might style an Quarto document.
- In the Data Science and adjacent professions, having an online portfolio that includes some sample projects (small or big) is expected. Check out this gallery of Quarto websites (and other documents) and learn about how to build Quarto websites right from RStudio. No need to build anything right now, just plant the seed!
- Thinking about data
The World Cup data came from TidyTuesday, a “weekly social data project”. A community of R users from around the globe share and dig into 1 different dataset per week, and share their results in various channels (YouTube, social media, etc). You’ll be picking and analyzing TidyTuesday data to analyze throughout the semester.- Check out the repository of datasets at https://github.com/rfordatascience/tidytuesday. In the DataSets section, click on the year and then scroll down to a table of datasets posted that year.
- Pick a dataset of interest, import this into R, and play around! Be creative.