1  Welcome

SETTLING IN

Welcome! As you settle in:

  • Sit in groups of 3-4. Your group should try to include:
    • at least 2 people you don’t know well
    • at least 1 person who has taken STAT 155
  • Meet each other! Be ready to introduce each other to the class. Here are some suggestions:
    • Share your names and pronunciation tips.
    • Share aspects of who you are and have been (e.g. pronouns, geographical identity, cultural identity, hobbies/passions)
    • Share aspects of who you’d like to be (e.g. personal/professional/academic goals)
    • Share how you are feeling about new semester (!?!)
    • Share a high point of the break.
  • You don’t need to take notes today, but writing notes to yourself can help your retention.

1.1 Introductions

Instructor

Who I am

Prof. Brianna Heggeseth (she/her)

[bree-AH-na] (Anna like in Frozen) [HEG-eh-seth]

https://bcheggeseth.github.io

Where I’ve been







1.2 Background

Data

The word data often brings spreadsheets to mind, like this one on penguins.

In this and every tidy data set:

  • each row = a unit of observation (here, a penguin)
  • each column = a measure on some variable of interest, either quantitative (numbers with units) or categorical (discrete possibilities or categories)
  • each entry contains a single data value; no analysis, summaries, footnotes, comments, etc, and only one value per cell
# A tibble: 5 × 6
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie    Biscoe              37.8          18.3               174        3400
2 Adelie    Dream               39.5          16.7               178        3250
3 Adelie    Torgersen           39.1          18.7               181        3750
4 Chinstrap Dream               46.5          17.9               192        3500
5 Gentoo    Biscoe              46.1          13.2               211        4500





But the definition of data is much more expansive, like this one from Google:





THESE ARE DATA, TOO!

  1. the emails in your inbox (text and more)

  2. social media posts (text and more)

  3. images

  4. videos

  5. audio files

For each data example above:

  • Indicate how the data can be converted into a tidy table. Specify:
    • units of observation (what do the rows represent)
    • at least 4 possible variables we might track for each observation (what goes into the columns)

  • Indicate what people, groups, organizations, etc might use data like this.



Let’s do the first 1 together.

Then I’ll assign your group another one to discuss:

  • Reintroduce yourselves: names, pronouns, majors, etc.
  • Discuss your example.
  • Once everyone is ready, I’ll have you share your example with the whole class
  • At that time, introduce someone else in your group to the whole class.







Data Science

Data Science extracts knowledge from data within a particular domain of inquiry, and particular contexts. Examples from just within Macalester:

(a) Shields-Cutler’s website

(b) Haro-Carrión’s website

(c) the Futures North website

Figure 1.1: Images from faculty at Macalester







DATA SCIENCE WORKFLOW

Though the examples above vary dramatically in their domain, context, and methodology, they share a general data science workflow. We will touch on each of these elements this semester:



Data Science workflow, as told through Legos. Source



Element In 112 Beyond 112 (or as part of your project)
data collection the basics of getting data into RStudio web scraping, databases, APIs
data preparation essential wrangling skills advanced wrangling skills, natural language processing
data visualization essential univariate, multivariate, and mapping viz interactive viz, animations
data analysis exploratory data analysis prediction, statistical modeling, machine learning, AI
data storytelling yes! yes!







Software

Working with modern data, hence doing Data Science, requires statistical software – calculators, spreadsheet functionality, etc don’t cut it. We’ll exclusively use R and RStudio:

Why?

  • it’s free
  • it’s open source (the code is free & anybody can contribute to it)
  • it has a huge online community (which is helpful for when you get stuck)
  • it’s an industry standard (along with Python)
  • it can be used to create reproducible and lovely documents (including this online manual!)
  • Fun fact: it was started by Mac alum JJ Allaire and beta-tested at Mac!







1.3 Syllabus

Let’s navigate to the syllabus.

  • I’ll highlight a few items
  • You are responsible for reading through it outside of class.

Pause

Think and then discuss in groups:

  • What comes to mind when you think of “grades”?
  • What is the difference between grades and learning?
  • What is the role of grades in your learning?







1.4 Exercises

GOALS

  • Familiarize yourself with the RStudio layout.
  • Play around in the RStudio console to gain familiarity with the basic structure of R code.



DIRECTIONS

  • Be kind to yourself
    We will all make so many mistakes in RStudio! That’s part of learning any new language. If fact, mistakes are important to learning any new language.

  • Collaboration
    We are and will be sitting in groups for a reason. Collaboration improves higher-level thinking, confidence, communication, community, & more. You are expected to:

    • Actively contribute to discussion. Don’t work on your own.
    • Actively include all other group members in discussion.
    • Create a space where others feel comfortable making mistakes & sharing ideas.
    • Stay in sync while respecting that everybody has different learning strategies, work styles, note taking strategies, etc. If some people are working on exercise 10 and others on exercise 2, that’s not a good collaboration.
    • Don’t rush. You won’t hand anything in and can finish up outside of class.
  • Growth
    This 100-level course assumes you have NO R experience, but welcomes all. Growth is expected of every student.

    • If you are new to R: I hope you leave class today simply feeling positive about opening RStudio.
    • If you are familiar with R: I hope you think more deeply about concepts you might have taken for granted in the past, and support those new to R in your group. Explaining ideas to others deepens your own understanding and retention of these ideas.
  • Ask questions
    We will not discuss these exercises as a class. Your group should ask me questions as I walk around the room.



Exercise 1: Open RStudio

  • If you already downloaded and installed a desktop version of RStudio on your laptop, open that.
  • Otherwise, log on to Mac’s RStudio server using your usual Mac credentials: https://rstudio.macalester.edu/. For the purposes of everybody being in the same place, use this for now. You’ll be prompted to install the software later.

Notice that there are four panes, each serving a different purpose. Today, we’ll work solely within the console and will not save any work.

Figure 1.2: RStudio Interface





Exercise 2: Use R as a calculator

We can do simple calculations in RStudio! Type the following lines in the console, one by one. After each line, hit your Return/Enter button and simply take note of what you get. In some cases you might even get an error! This error is important to learning how R code does and doesn’t work.

4 + 2
4/2
4^2
4*2
4(2)





Exercise 3: Functions and arguments

PAUSE: Make sure you’re still in sync with your group.


Having a calculator is nice, but we’ll typically use built-in functions to perform common (repetitive) and specific tasks. These functions have names and require information about arguments in order to run:

function(argument)

Try out the following functions in your console. Note each function’s name, the argument or information it needs to run, and what output it produces (i.e. what the function does):

sqrt(9)
sqrt(25)
nchar("snow")
nchar("macalester")
sqrt(nchar("snow"))



Some functions require more than 1 argument, separated by commas. To keep these straight, we often specify the arguments by name:

function(argument1 = ___, argument2 = ___)

Try out the following functions in your console, one by one. Note each function’s name, the arguments it needs to run, when it’s necessary to specify these arguments by name, and what output it produces.

rep(x = 2, times = 5)
rep(times = 5, x = 2)
rep(2, 5)
rep(5, 2)
seq(from = 2, to = 10, by = 2)
seq(2, 10, 2)
seq(from = 2, to = 10, length = 3)
seq(2, 10, 3)

Finally, note that R is case sensitive. Try the following code which uses Seq() instead of seq(). Take time to read the error message. You will experience this type of error message a lot! It will happen any time you misspell a function (among other reasons we’ll experience later).

Seq(2, 10, 3)





Exercise 4: Grammar

We’ll learn lots and lots of functions this semester. Nobody has every function memorized. That said, it does help to connect function names with their purpose. Do that for each function you used above.

  • sqrt() = square root
  • nchar() = ???
  • rep() = ???
  • seq() = ???





Exercise 5: Your turn

PAUSE: Make sure you’re still in sync with your group.


Use the functions you learned above to do the following:

  • Count the number of letters in “data”.
  • Create the sequence 3, 6, 9, 12. You might do this 2 ways.
  • Create a sequence of 4 numbers that start at 1 and end at 10. You might do this 2 ways.
  • Repeat the number “5” 8 times.
  • CHALLENGE: Combine 2 functions to produce the sequence 3, 6, 9, 12, 3, 6, 9, 12





Exercise 6: Save it for later

For reasons that will quickly become clear, we’ll often want to store some R output for later use. In R:

name <- output

where

  • name is the name under which to store a result
  • output is the result we wish to store
  • <- is the assignment operator. I think of this as an arrow pointing the output into the name.

IMPORTANT: Try out each line one at a time. One of these will give you an error – why? Another does something, but won’t show any output – why?

degrees_c <- -13
degrees_c

Let’s now use what you stored! Again, do this one by one.

degrees_c * (9/5) + 32
degrees_f <- degrees_c * (9/5) + 32
degrees_f

Finally, try to print degrees_tomorrow. Take time to read the error message. You will experience this type of error message a lot! It will happen when you either haven’t yet defined the object you’re trying to use, or you’ve misspelled its name (among other reasons we’ll experience later).

degrees_tomorrow





Exercise 7: Practice

PAUSE: Make sure you’re still in sync with your group.


  • Name and store your current age in years.
  • Confirm that your age is stored correctly by typing the name and pressing Return/Enter.
  • Use your stored age to calculate how old you’ll be in 17 years.





Exercise 8: Code = communication

It’s important to recognize from day 1 that code is a form of communication, both to yourself and others!!!!! Code structure and details are important to readability and clarity, just as grammar, punctuation, spelling, paragraphs, and line spacing are important in written essays. All of the code below works, but has bad structure. With your group, discuss what is unfortunate about each line.

seq(from=1,to=9,by=2)
seq(from = 1, to=9,by=2)
my_output <- -13
thisisthetemperaturetodayincelsius <- -13
this_is_the_temperature_today_in_celsius <- -13





Hot tips: Code = communication

Hot tip 1: Avoid smooshy code

# BAD: tough to read
seq(from=1,to=9,by=2)

# GOOD: spaces between "words" and punctuation helps
seq(from = 1, to = 9, by = 2)



Hot tip 2: Use good naming conventions

Good names aren’t too vague (my_result), aren’t too long, and split up multiple words using space_case or CamelCase for readability. For example:

# BAD: too smooshy and hard to read
degreescelsius <- -13

# BETTER: (though I personally don't like camel case)
# Why is it called camel case?!
DegreesCelsius <- -13

# GOOD
degrees_celsius <- -13

It’s also impossible, not just ill-advised to start names with numbers or symbols, or to use certain symbols in our names. Try it:

Jan/18/24/degrees <- -13
1_18_24_degrees_c <- -13
_degrees_c <- -13





Exercise 9: You will make so many mistakes!

Mistakes are common when, and even important to, learning any new language. You’ll get better and better at interpreting error messages, finding help, and fixing errors. These are all important skills in computer programming in general. Consider a couple tips and tricks.


Console shortcut (aka saving time for more fun things)

With your cursor at the next prompt in the console (>), press the up arrow multiple times. What does this do?! This shortcut will be very handy when you make mistakes and want to modify your code without having to start over.


Help files

You’ll often forget how functions are used. Luckily, there’s typically built-in documentation for built-in functions. Let’s practice:

  • In the console, type ?rep and press Return/Enter.
  • Check out the documentation file that pops up in the Help tab (lower right).
  • Quickly scroll through, noting the type of information provided.
  • Stop at the “Examples” at the bottom. Perhaps the most useful section, this is where a function’s functionality is demonstrated! Try out a couple of the provided examples in your console.





Exercise 10: History and environment

Finally, let’s leave the console.

  • Check out the “Environment” tab in the top right pane of RStudio. What do you observe there and when might this be helpful?

  • Similarly, check out the “History” tab in the top right pane of RStudio. What do you observe there and when might this be helpful?





Wrapping up

If you’ve finished the above exercises:

  • REQUIRED: Complete Checkpoint 1 (CP1) on Moodle. This will be due before our next class.

  • OPTIONAL: If you’d like to hear Prof. Johnson talk through the concepts you learned today, you can watch this RStudio tour video outside of class.





1.5 Wrap-up

Finishing the activity

If you didn’t finish the activity, no problem! Be sure to complete the activity outside of class, review the solutions in the online manual, and ask any questions on Slack or in office hours.



Online course manual (linked on Moodle)

  • Bookmark this!
  • All in-class activities will be compiled here, making for easier review.
  • There are solutions at the bottom of each activity. Consult them!



Moodle

Where you can access the calendar, daily schedule, and all course materials (free!). Also where you will submit work.



Syllabus (linked on Moodle)

You’re expected to carefully review the syllabus outside of class.



Upcoming due dates

  • Due Thursday: CP1 in Moodle
  • Due Tuesday: CP2 in Moodle and Homework 1 (HW1)
    HW1 will be posted Thursday and you’ll finish most of it in class.







1.6 Solutions

Click for Solutions

Exercise 2: Use R as a calculator

4 + 2
[1] 6
4/2
[1] 2
4^2
[1] 16
4*2
[1] 8
# This code gives an error! Multiplication requires *
4(2)

Exercise 3: Functions and arguments

# sqrt calculates square root
sqrt(9)
[1] 3
sqrt(25)
[1] 5
# nchar counts up the number of characters
nchar("cat")
[1] 3
nchar("macalester")
[1] 10
# rep repeats the value "x" the number of "times" indicated
# Order doesn't matter
rep(x = 2, times = 5)
[1] 2 2 2 2 2
rep(times = 5, x = 2)
[1] 2 2 2 2 2
# We don't need to label the arguments
# But the order matters! It assumes an order of "x" then "times"
rep(2, 5)
[1] 2 2 2 2 2
rep(5, 2)
[1] 5 5
# Create a sequence of numbers
# Removing the argument labels gives the same result 
seq(from = 2, to = 10, by = 2)
[1]  2  4  6  8 10
seq(2, 10, 2)
[1]  2  4  6  8 10
# We can also define a sequence by its length, not increments
# But can't remove the argument labels (R assumes the 3rd argument is length)
seq(from = 2, to = 10, length = 3)
[1]  2  6 10
seq(2, 10, 3)
[1] 2 5 8





Exercise 4: Grammar

  • sqrt() = square root
  • nchar() = number of characters
  • rep() = repeat / repetition
  • seq() = sequence





Exercise 5: Your turn

# Count the number of letters in "data"
nchar("data")
[1] 4
# Create the sequence 3, 6, 9, 12
seq(from = 3, to = 12, by = 3)
[1]  3  6  9 12
seq(from = 3, to = 12, length = 4)
[1]  3  6  9 12
# Create a sequence of 4 numbers that start at 1 and end at 10
seq(from = 1, to = 10, length = 4)
[1]  1  4  7 10
seq(from = 1, to = 10, by = 3)
[1]  1  4  7 10
# Repeat the number "5" 8 times
rep(x = 5, times = 8)
[1] 5 5 5 5 5 5 5 5
rep(5, 8)
[1] 5 5 5 5 5 5 5 5
# Combine 2 functions to produce the sequence 3, 6, 9, 12, 3, 6, 9, 12
rep(x = seq(from = 3, to = 12, by = 3), times = 2)
[1]  3  6  9 12  3  6  9 12





Exercise 6: Save it for later

degrees_c <- -13
degrees_c
[1] -13
degrees_c * (9/5) + 32
[1] 8.6
degrees_f <- degrees_c * (9/5) + 32
degrees_f
[1] 8.6





Exercise 7: Practice

my_age <- 20
my_age
[1] 20
my_age + 17
[1] 37





Exercise 8: Code = communication

# This is too smooshy and hard to read
seq(from=1,to=9,by=2)

# The use of spacing is inconsistent, hence hard to read
seq(from = 1, to=9,by=2)

# Too vague
my_output <- -13

# Too smooshy
thisisthetemperaturetodayincelsius <- -13

# Easier to read, but too long
this_is_the_temperature_today_in_celsius <- -13





Exercise 10: History and environment

  • Environment: shows what objects you’ve stored (eg: degrees_c)
  • History: shows what R code you’ve typed