1 Welcome
1.1 Introductions
Instructor
Who I am
Prof. Brianna Heggeseth (she/her)
[bree-AH-na] (Anna like in Frozen) [HEG-eh-seth]
Where I’ve been
1.2 Background
Data
The word data often brings spreadsheets to mind, like this one on penguins.
In this and every tidy data set:
- each row = a unit of observation (here, a penguin)
- each column = a measure on some variable of interest, either quantitative (numbers with units) or categorical (discrete possibilities or categories)
- each entry contains a single data value; no analysis, summaries, footnotes, comments, etc, and only one value per cell
# A tibble: 5 × 6
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 37.8 18.3 174 3400
2 Adelie Dream 39.5 16.7 178 3250
3 Adelie Torgersen 39.1 18.7 181 3750
4 Chinstrap Dream 46.5 17.9 192 3500
5 Gentoo Biscoe 46.1 13.2 211 4500
But the definition of data is much more expansive, like this one from Google:
THESE ARE DATA, TOO!
the emails in your inbox (text and more)
For each data example above:
- Indicate how the data can be converted into a tidy table. Specify:
- units of observation (what do the rows represent)
- at least 4 possible variables we might track for each observation (what goes into the columns)
- Indicate what people, groups, organizations, etc might use data like this.
Let’s do the first 1 together.
Then I’ll assign your group another one to discuss:
- Reintroduce yourselves: names, pronouns, majors, etc.
- Discuss your example.
- Once everyone is ready, I’ll have you share your example with the whole class
- At that time, introduce someone else in your group to the whole class.
Data Science
Data Science extracts knowledge from data within a particular domain of inquiry, and particular contexts. Examples from just within Macalester:
- Robin Shields-Cutler (Biology) uses big data to study microbiomes.
- Xavier Haro-Carrión (Geography) uses satellite & remote sensing data to study biodiversity conservation.
- Lisa Mueller (Political Science) uses data to analyze political protest outcomes.
- John Kim + Futures North (Media & Cultural Studies) uses data-driven art to explore & illustrate migration patterns.
- Bethany Miller + Macalester’s Institutional Research team use data to better understand and shape everything from student outcomes to peer school comparisons.
- Macalester Athletics uses data to study everything from training outcomes to team performance to sleep patterns.
- Macalester Admissions uses data to butter understand prospective students and their college choices.
DATA SCIENCE WORKFLOW
Though the examples above vary dramatically in their domain, context, and methodology, they share a general data science workflow. We will touch on each of these elements this semester:
Element | In 112 | Beyond 112 (or as part of your project) |
---|---|---|
data collection | the basics of getting data into RStudio | web scraping, databases, APIs |
data preparation | essential wrangling skills | advanced wrangling skills, natural language processing |
data visualization | essential univariate, multivariate, and mapping viz | interactive viz, animations |
data analysis | exploratory data analysis | prediction, statistical modeling, machine learning, AI |
data storytelling | yes! | yes! |
Software
Working with modern data, hence doing Data Science, requires statistical software – calculators, spreadsheet functionality, etc don’t cut it. We’ll exclusively use R and RStudio:
Why?
- it’s free
- it’s open source (the code is free & anybody can contribute to it)
- it has a huge online community (which is helpful for when you get stuck)
- it’s an industry standard (along with Python)
- it can be used to create reproducible and lovely documents (including this online manual!)
- Fun fact: it was started by Mac alum JJ Allaire and beta-tested at Mac!
1.3 Syllabus
Let’s navigate to the syllabus.
- I’ll highlight a few items
- You are responsible for reading through it outside of class.
Pause
Think and then discuss in groups:
- What comes to mind when you think of “grades”?
- What is the difference between grades and learning?
- What is the role of grades in your learning?
1.4 Exercises
GOALS
- Familiarize yourself with the RStudio layout.
- Play around in the RStudio console to gain familiarity with the basic structure of R code.
DIRECTIONS
Be kind to yourself
We will all make so many mistakes in RStudio! That’s part of learning any new language. If fact, mistakes are important to learning any new language.Collaboration
We are and will be sitting in groups for a reason. Collaboration improves higher-level thinking, confidence, communication, community, & more. You are expected to:- Actively contribute to discussion. Don’t work on your own.
- Actively include all other group members in discussion.
- Create a space where others feel comfortable making mistakes & sharing ideas.
- Stay in sync while respecting that everybody has different learning strategies, work styles, note taking strategies, etc. If some people are working on exercise 10 and others on exercise 2, that’s not a good collaboration.
- Don’t rush. You won’t hand anything in and can finish up outside of class.
Growth
This 100-level course assumes you have NO R experience, but welcomes all. Growth is expected of every student.- If you are new to R: I hope you leave class today simply feeling positive about opening RStudio.
- If you are familiar with R: I hope you think more deeply about concepts you might have taken for granted in the past, and support those new to R in your group. Explaining ideas to others deepens your own understanding and retention of these ideas.
Ask questions
We will not discuss these exercises as a class. Your group should ask me questions as I walk around the room.
Exercise 1: Open RStudio
- If you already downloaded and installed a desktop version of RStudio on your laptop, open that.
- Otherwise, log on to Mac’s RStudio server using your usual Mac credentials: https://rstudio.macalester.edu/. For the purposes of everybody being in the same place, use this for now. You’ll be prompted to install the software later.
Notice that there are four panes, each serving a different purpose. Today, we’ll work solely within the console and will not save any work.
Exercise 2: Use R as a calculator
We can do simple calculations in RStudio! Type the following lines in the console, one by one. After each line, hit your Return/Enter button and simply take note of what you get. In some cases you might even get an error! This error is important to learning how R code does and doesn’t work.
4 + 2
4/2
4^2
4*2
4(2)
Exercise 3: Functions and arguments
PAUSE: Make sure you’re still in sync with your group.
Having a calculator is nice, but we’ll typically use built-in functions to perform common (repetitive) and specific tasks. These functions have names and require information about arguments in order to run:
function(argument)
Try out the following functions in your console. Note each function’s name, the argument or information it needs to run, and what output it produces (i.e. what the function does):
sqrt(9)
sqrt(25)
nchar("snow")
nchar("macalester")
sqrt(nchar("snow"))
Some functions require more than 1 argument, separated by commas. To keep these straight, we often specify the arguments by name:
function(argument1 = ___, argument2 = ___)
Try out the following functions in your console, one by one. Note each function’s name, the arguments it needs to run, when it’s necessary to specify these arguments by name, and what output it produces.
rep(x = 2, times = 5)
rep(times = 5, x = 2)
rep(2, 5)
rep(5, 2)
seq(from = 2, to = 10, by = 2)
seq(2, 10, 2)
seq(from = 2, to = 10, length = 3)
seq(2, 10, 3)
Finally, note that R is case sensitive. Try the following code which uses Seq()
instead of seq()
. Take time to read the error message. You will experience this type of error message a lot! It will happen any time you misspell a function (among other reasons we’ll experience later).
Seq(2, 10, 3)
Exercise 4: Grammar
We’ll learn lots and lots of functions this semester. Nobody has every function memorized. That said, it does help to connect function names with their purpose. Do that for each function you used above.
sqrt()
= square rootnchar()
= ???rep()
= ???seq()
= ???
Exercise 5: Your turn
PAUSE: Make sure you’re still in sync with your group.
Use the functions you learned above to do the following:
- Count the number of letters in “data”.
- Create the sequence 3, 6, 9, 12. You might do this 2 ways.
- Create a sequence of 4 numbers that start at 1 and end at 10. You might do this 2 ways.
- Repeat the number “5” 8 times.
- CHALLENGE: Combine 2 functions to produce the sequence 3, 6, 9, 12, 3, 6, 9, 12
Exercise 6: Save it for later
For reasons that will quickly become clear, we’ll often want to store some R output for later use. In R:
name <- output
where
name
is the name under which to store a resultoutput
is the result we wish to store<-
is the assignment operator. I think of this as an arrow pointing theoutput
into thename
.
IMPORTANT: Try out each line one at a time. One of these will give you an error – why? Another does something, but won’t show any output – why?
<- -13
degrees_c degrees_c
Let’s now use what you stored! Again, do this one by one.
* (9/5) + 32
degrees_c <- degrees_c * (9/5) + 32
degrees_f degrees_f
Finally, try to print degrees_tomorrow
. Take time to read the error message. You will experience this type of error message a lot! It will happen when you either haven’t yet defined the object you’re trying to use, or you’ve misspelled its name (among other reasons we’ll experience later).
degrees_tomorrow
Exercise 7: Practice
PAUSE: Make sure you’re still in sync with your group.
- Name and store your current age in years.
- Confirm that your age is stored correctly by typing the name and pressing Return/Enter.
- Use your stored age to calculate how old you’ll be in 17 years.
Exercise 8: Code = communication
It’s important to recognize from day 1 that code is a form of communication, both to yourself and others!!!!! Code structure and details are important to readability and clarity, just as grammar, punctuation, spelling, paragraphs, and line spacing are important in written essays. All of the code below works, but has bad structure. With your group, discuss what is unfortunate about each line.
seq(from=1,to=9,by=2)
seq(from = 1, to=9,by=2)
<- -13
my_output <- -13
thisisthetemperaturetodayincelsius <- -13 this_is_the_temperature_today_in_celsius
Hot tips: Code = communication
Hot tip 1: Avoid smooshy code
# BAD: tough to read
seq(from=1,to=9,by=2)
# GOOD: spaces between "words" and punctuation helps
seq(from = 1, to = 9, by = 2)
Hot tip 2: Use good naming conventions
Good names aren’t too vague (my_result
), aren’t too long, and split up multiple words using space_case
or CamelCase
for readability. For example:
# BAD: too smooshy and hard to read
<- -13
degreescelsius
# BETTER: (though I personally don't like camel case)
# Why is it called camel case?!
<- -13
DegreesCelsius
# GOOD
<- -13 degrees_celsius
It’s also impossible, not just ill-advised to start names with numbers or symbols, or to use certain symbols in our names. Try it:
/18/24/degrees <- -13
Jan<- -13
1_18_24_degrees_c <- -13 _degrees_c
Exercise 9: You will make so many mistakes!
Mistakes are common when, and even important to, learning any new language. You’ll get better and better at interpreting error messages, finding help, and fixing errors. These are all important skills in computer programming in general. Consider a couple tips and tricks.
Console shortcut (aka saving time for more fun things)
With your cursor at the next prompt in the console (>
), press the up arrow multiple times. What does this do?! This shortcut will be very handy when you make mistakes and want to modify your code without having to start over.
Help files
You’ll often forget how functions are used. Luckily, there’s typically built-in documentation for built-in functions. Let’s practice:
- In the console, type
?rep
and press Return/Enter. - Check out the documentation file that pops up in the Help tab (lower right).
- Quickly scroll through, noting the type of information provided.
- Stop at the “Examples” at the bottom. Perhaps the most useful section, this is where a function’s functionality is demonstrated! Try out a couple of the provided examples in your console.
Exercise 10: History and environment
Finally, let’s leave the console.
Check out the “Environment” tab in the top right pane of RStudio. What do you observe there and when might this be helpful?
Similarly, check out the “History” tab in the top right pane of RStudio. What do you observe there and when might this be helpful?
Wrapping up
If you’ve finished the above exercises:
REQUIRED: Complete Checkpoint 1 (CP1) on Moodle. This will be due before our next class.
OPTIONAL: If you’d like to hear Prof. Johnson talk through the concepts you learned today, you can watch this RStudio tour video outside of class.
1.5 Wrap-up
Finishing the activity
If you didn’t finish the activity, no problem! Be sure to complete the activity outside of class, review the solutions in the online manual, and ask any questions on Slack or in office hours.
Online course manual (linked on Moodle)
- Bookmark this!
- All in-class activities will be compiled here, making for easier review.
- There are solutions at the bottom of each activity. Consult them!
Moodle
Where you can access the calendar, daily schedule, and all course materials (free!). Also where you will submit work.
Syllabus (linked on Moodle)
You’re expected to carefully review the syllabus outside of class.
Upcoming due dates
- Due Thursday: CP1 in Moodle
- Due Tuesday: CP2 in Moodle and Homework 1 (HW1)
HW1 will be posted Thursday and you’ll finish most of it in class.
1.6 Solutions
Click for Solutions
Exercise 2: Use R as a calculator
4 + 2
[1] 6
4/2
[1] 2
4^2
[1] 16
4*2
[1] 8
# This code gives an error! Multiplication requires *
4(2)
Exercise 3: Functions and arguments
# sqrt calculates square root
sqrt(9)
[1] 3
sqrt(25)
[1] 5
# nchar counts up the number of characters
nchar("cat")
[1] 3
nchar("macalester")
[1] 10
# rep repeats the value "x" the number of "times" indicated
# Order doesn't matter
rep(x = 2, times = 5)
[1] 2 2 2 2 2
rep(times = 5, x = 2)
[1] 2 2 2 2 2
# We don't need to label the arguments
# But the order matters! It assumes an order of "x" then "times"
rep(2, 5)
[1] 2 2 2 2 2
rep(5, 2)
[1] 5 5
# Create a sequence of numbers
# Removing the argument labels gives the same result
seq(from = 2, to = 10, by = 2)
[1] 2 4 6 8 10
seq(2, 10, 2)
[1] 2 4 6 8 10
# We can also define a sequence by its length, not increments
# But can't remove the argument labels (R assumes the 3rd argument is length)
seq(from = 2, to = 10, length = 3)
[1] 2 6 10
seq(2, 10, 3)
[1] 2 5 8
Exercise 4: Grammar
sqrt()
= square rootnchar()
= number of charactersrep()
= repeat / repetitionseq()
= sequence
Exercise 5: Your turn
# Count the number of letters in "data"
nchar("data")
[1] 4
# Create the sequence 3, 6, 9, 12
seq(from = 3, to = 12, by = 3)
[1] 3 6 9 12
seq(from = 3, to = 12, length = 4)
[1] 3 6 9 12
# Create a sequence of 4 numbers that start at 1 and end at 10
seq(from = 1, to = 10, length = 4)
[1] 1 4 7 10
seq(from = 1, to = 10, by = 3)
[1] 1 4 7 10
# Repeat the number "5" 8 times
rep(x = 5, times = 8)
[1] 5 5 5 5 5 5 5 5
rep(5, 8)
[1] 5 5 5 5 5 5 5 5
# Combine 2 functions to produce the sequence 3, 6, 9, 12, 3, 6, 9, 12
rep(x = seq(from = 3, to = 12, by = 3), times = 2)
[1] 3 6 9 12 3 6 9 12
Exercise 6: Save it for later
<- -13
degrees_c degrees_c
[1] -13
* (9/5) + 32 degrees_c
[1] 8.6
<- degrees_c * (9/5) + 32
degrees_f degrees_f
[1] 8.6
Exercise 7: Practice
<- 20
my_age my_age
[1] 20
+ 17 my_age
[1] 37
Exercise 8: Code = communication
# This is too smooshy and hard to read
seq(from=1,to=9,by=2)
# The use of spacing is inconsistent, hence hard to read
seq(from = 1, to=9,by=2)
# Too vague
<- -13
my_output
# Too smooshy
<- -13
thisisthetemperaturetodayincelsius
# Easier to read, but too long
<- -13 this_is_the_temperature_today_in_celsius
Exercise 10: History and environment
- Environment: shows what objects you’ve stored (eg:
degrees_c
) - History: shows what R code you’ve typed