sid | sessionID | grade |
---|---|---|
S31842 | session2207 | B+ |
S32436 | session3172 | S |
S31671 | session3435 | A- |
S31929 | session3512 | NC |
This week in MSCS
If you haven’t turned in a Tidy Tuesday, this is your week!
If your spreadsheet says “No submission for assignment” for Assignment 1-4,
Calculate the most popular name for each year. Print out the answer for the years 2006-2015.
Let’s break this down. We only care about 2006 - 2015, so what should we do first?
filter(year >= 2006, year <= 2015)
Do we care about sex assigned at birth? No. Let’s combine the counts for sex assigned at birth for each name and year.
group_by(year,name) %>% summarize(TotalBirths = sum(n))
“Most popular name for each year” - What is the grouping structure implied?
A Year
If you only had data for 2006, how would you approach this? What is the main verb implied?
Filter
Pull out the row (name) that have the maximum counts
Now, perform that action within the groups defined by year.
babynames %>%
filter(year >= 2006, year <= 2015) %>%
group_by(name, year) %>% #to combine counts across sexes
summarize(n = sum(n)) %>% #to combine counts across sexes
group_by(year) %>%
filter(n == max(n)) %>%
arrange(year)
left_join
, inner_join
and full_join
in the dplyr
packagesemi_join
, anti_join
in the dplyr
packageDownload a template .Rmd of this activity. Put the file in a Assignment_06
folder within your COMP_STAT_112
folder.
A join is a verb that means to combine two data tables.
All joins involve establishing a correspondence — a match — between each case in the left table and zero or more cases in the right table.
The various joins differ in how they handle multiple matches or missing matches.
A match between a case in the left data table and a case in the right data table is made based on the values in keys, variables that uniquely define observations in a data table.
In order to establish a match between two data tables,
sid | sessionID | grade |
---|---|---|
S31842 | session2207 | B+ |
S32436 | session3172 | S |
S31671 | session3435 | A- |
S31929 | session3512 | NC |
sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|
session2780 | O | 300 | SP2003 | 21 | inst298 |
session3520 | k | 300 | FA2004 | 16 | inst463 |
session1965 | d | 100 | FA2001 | 25 | inst414 |
session3257 | o | 200 | SP2004 | 16 | inst312 |
What variables uniquely identifies a case for Grades
?
sid
(student ID) andsessionID
(class ID) uniquely identify each case forGrades
.
What variables uniquely identifies a case for Courses
?
sessionID
(class ID) anddept
unique identify each case forCourses
. You may have thought thatsessionID
alone was sufficient; however, if a course is cross-listed, then it may have multiple departments listed.
What key might we use to join Grades
and Courses
?
sessionID
(class ID) could match information fromGrades
forCourses
.
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
left_join()
: the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.
inner_join()
: the output has only the cases from the left with a match in the right.
full_join()
: the output has all cases from the left and the right. This is less common than the first two join operators.
For all of these mutating joins, if a row in the left matches multiple rows in the right, all the rows in right will be returned once for each matching row in the left.
Determine the average class size from the viewpoint of the Provost / Admissions Office
CourseSizes <- Courses %>%
group_by(sessionID) %>% # needed to deal with cross-listed courses
summarise(total_enroll = sum(enroll))
head(CourseSizes)
# A tibble: 6 × 2
sessionID total_enroll
<chr> <dbl>
1 session1784 22
2 session1785 52
3 session1791 22
4 session1792 20
5 session1794 22
6 session1795 26
[1] 21.45251
Determine the average class size from the viewpoint of a student (average class size for each student experiences across their courses)
EnrollmentsWithClassSize <- Grades %>% #Student level info
left_join(CourseSizes,
by = c("sessionID" = "sessionID")
) %>%
select(sid, sessionID, total_enroll)
head(EnrollmentsWithClassSize)
# A tibble: 6 × 3
sid sessionID total_enroll
<chr> <chr> <dbl>
1 S31185 session1784 22
2 S31185 session1785 52
3 S31185 session1791 22
4 S31185 session1792 20
5 S31185 session1794 22
6 S31185 session1795 26
AveClassEachStudent <- EnrollmentsWithClassSize %>%
group_by(sid) %>%
summarise(avg_enroll = mean(total_enroll, na.rm = TRUE))
head(AveClassEachStudent)
# A tibble: 6 × 2
sid avg_enroll
<chr> <dbl>
1 S31185 29
2 S31188 27.6
3 S31191 29.1
4 S31194 19.5
5 S31197 26.3
6 S31200 26.5
[1] 24.41885
Filtering joins affect the observations, not the variables.
semi_join()
: discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.
anti_join()
: discards any cases in the left table that have a match in the right table.
Find a subset of the Grades
data that only contains data on the four largest sections in the Courses
data set.
LargeSections <- Courses %>%
group_by(sessionID) %>%
summarise(total_enroll = sum(enroll)) %>%
arrange(desc(total_enroll)) %>% head(4)
LargeSections
# A tibble: 4 × 2
sessionID total_enroll
<chr> <dbl>
1 session2956 120
2 session2171 105
3 session2465 93
4 session2170 85
# A tibble: 6 × 3
sid sessionID grade
<chr> <chr> <chr>
1 S31188 session2956 S
2 S31344 session2465 S
3 S31365 session2956 S
4 S31389 session2170 A-
5 S31389 session2465 S
6 S31482 session2170 B+
Use semi_join()
to create a table with a subset of the rows of Grades
corresponding to all classes taken in department J
.
Work on the 6 exercises to turn in for Assignment 5.
These exercises help you deepen your understanding of
We’ll plan to work on a subset of these exercises in class on Thursday.