Joining Data

Brianna Heggeseth

Announcements

This week in MSCS

MSCS BIPOC Affinity group meeting Feb 22 7-8:30pm in Smail Gallery

Due this Week

Assignment 5 (Six Main Verbs + Reshaping) on Wednesday [via Moodle]

If you haven’t turned in a Tidy Tuesday, this is your week!

Bob Ross Paintings

If your spreadsheet says “No submission for assignment” for Assignment 1-4,

submit your assignment to Moodle
email Brianna indicating that you’ve submitted
Brianna will be going through past assignments this week

A Exercise from Six Main Verbs

Calculate the most popular name for each year. Print out the answer for the years 2006-2015.

Let’s break this down. We only care about 2006 - 2015, so what should we do first?

filter(year >= 2006, year <= 2015)

Do we care about sex assigned at birth? No. Let’s combine the counts for sex assigned at birth for each name and year.

group_by(year,name) %>% summarize(TotalBirths = sum(n))

“Most popular name for each year” - What is the grouping structure implied?

A Year

If you only had data for 2006, how would you approach this? What is the main verb implied?

Filter

Pull out the row (name) that have the maximum counts

Now, perform that action within the groups defined by year.

babynames %>%
  filter(year >= 2006, year <= 2015) %>%
  group_by(name, year) %>% #to combine counts across sexes
  summarize(n = sum(n)) %>% #to combine counts across sexes
  group_by(year) %>%
  filter(n == max(n)) %>%
  arrange(year)

Learning Goals

Understand the concept of keys and variables that uniquely identify rows or cases
Understand the different types of joins, different ways of combining two data frames together
Develop comfort in using mutating joins: left_join, inner_join and full_join in the dplyr package
Develop comfort in using filtering joins: semi_join, anti_join in the dplyr package

Template File

Download a template .Rmd of this activity. Put the file in a Assignment_06 folder within your COMP_STAT_112 folder.

This .Rmd contains examples that we’ll work on in class and exercises you’ll finish for Assignment 6.

Joins

Definition

A join is a verb that means to combine two data tables.

These tables are often called the left and the right tables.

Left and Right Tables with intersecting light grey lines indicating potential matches and colored points indicating a match.

Kinds of Joins

All joins involve establishing a correspondence — a match — between each case in the left table and zero or more cases in the right table.
The various joins differ in how they handle multiple matches or missing matches.

Matches

Definition of Match

A match between a case in the left data table and a case in the right data table is made based on the values in keys, variables that uniquely define observations in a data table.

Match in Practice

In order to establish a match between two data tables,

You specify which variables (or keys) to use.
Each key is specified as a pair, where one variable from the left table corresponds to one variable from the right table.
Cases must have exactly equal values in the left variable and right variable for a match to be made.

Example - Student Grades and Courses

Student grades.
sid	sessionID	grade
S31842	session2207	B+
S32436	session3172	S
S31671	session3435	A-
S31929	session3512	NC

Information about each course section.
sessionID	dept	level	sem	enroll	iid
session2780	O	300	SP2003	21	inst298
session3520	k	300	FA2004	16	inst463
session1965	d	100	FA2001	25	inst414
session3257	o	200	SP2004	16	inst312

Example - Student Grades and Courses

What variables uniquely identifies a case for Grades?

sid (student ID) and sessionID (class ID) uniquely identify each case for Grades.

What variables uniquely identifies a case for Courses?

sessionID (class ID) and dept unique identify each case for Courses. You may have thought that sessionID alone was sufficient; however, if a course is cross-listed, then it may have multiple departments listed.

Example - Student Grades and Courses

What key might we use to join Grades and Courses?

sessionID (class ID) could match information from Grades for Courses.

Mutating Joins

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

left_join(): the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.

inner_join(): the output has only the cases from the left with a match in the right.

full_join(): the output has all cases from the left and the right. This is less common than the first two join operators.

For all of these mutating joins, if a row in the left matches multiple rows in the right, all the rows in right will be returned once for each matching row in the left.

Example - Student Grades and Courses

Determine the average class size from the viewpoint of the Provost / Admissions Office

How would you approach this?

CourseSizes <- Courses %>%
  group_by(sessionID) %>% # needed to deal with cross-listed courses
  summarise(total_enroll = sum(enroll))

head(CourseSizes)

# A tibble: 6 × 2
  sessionID   total_enroll
  <chr>              <dbl>
1 session1784           22
2 session1785           52
3 session1791           22
4 session1792           20
5 session1794           22
6 session1795           26

mean(CourseSizes$total_enroll)

[1] 21.45251

Example - Student Grades and Courses

Determine the average class size from the viewpoint of a student (average class size for each student experiences across their courses)

How would you approach this?

EnrollmentsWithClassSize <- Grades %>% #Student level info
  left_join(CourseSizes,
    by = c("sessionID" = "sessionID")
  ) %>%
  select(sid, sessionID, total_enroll)

head(EnrollmentsWithClassSize)

# A tibble: 6 × 3
  sid    sessionID   total_enroll
  <chr>  <chr>              <dbl>
1 S31185 session1784           22
2 S31185 session1785           52
3 S31185 session1791           22
4 S31185 session1792           20
5 S31185 session1794           22
6 S31185 session1795           26

AveClassEachStudent <- EnrollmentsWithClassSize %>%
  group_by(sid) %>%
  summarise(avg_enroll = mean(total_enroll, na.rm = TRUE))

head(AveClassEachStudent)

# A tibble: 6 × 2
  sid    avg_enroll
  <chr>       <dbl>
1 S31185       29  
2 S31188       27.6
3 S31191       29.1
4 S31194       19.5
5 S31197       26.3
6 S31200       26.5

mean(AveClassEachStudent$avg_enroll)

[1] 24.41885

Filtering Joins

Filtering joins affect the observations, not the variables.

semi_join(): discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.

anti_join(): discards any cases in the left table that have a match in the right table.

Example - Student Grades and Courses

Find a subset of the Grades data that only contains data on the four largest sections in the Courses data set.

LargeSections <- Courses %>%
  group_by(sessionID) %>%
  summarise(total_enroll = sum(enroll)) %>%
  arrange(desc(total_enroll)) %>% head(4)

LargeSections

# A tibble: 4 × 2
  sessionID   total_enroll
  <chr>              <dbl>
1 session2956          120
2 session2171          105
3 session2465           93
4 session2170           85

GradesFromLargeSections <- Grades %>%
  semi_join(LargeSections)

head(GradesFromLargeSections)

# A tibble: 6 × 3
  sid    sessionID   grade
  <chr>  <chr>       <chr>
1 S31188 session2956 S    
2 S31344 session2465 S    
3 S31365 session2956 S    
4 S31389 session2170 A-   
5 S31389 session2465 S    
6 S31482 session2170 B+

Example - Student Grades and Courses

Use semi_join() to create a table with a subset of the rows of Grades corresponding to all classes taken in department J.

JCourses <- Courses %>%
  filter(dept == "J")

JGrades <- Grades %>%
  semi_join(JCourses)

After Class

Work on the 6 exercises to turn in for Assignment 5.

These exercises help you deepen your understanding of

Data Wrangling (6 Main Verbs)
Data Visualization (including Spatial Viz)

We’ll plan to work on a subset of these exercises in class on Thursday.