Multivariate Visualizations + Idea Lab

Brianna Heggeseth

Switch it Up

Sit with someone new today!

  • Introduce yourself
  • Share best or worst part of the weekend

Announcements

This week in MSCS

  • Thursday 11:15am Coffee Break

This week in class

  • TT3 posted today (check it out if you haven’t already completed TT1 or TT2)

Today we’ll practice discussing “insights” we gain from our visualizations. Then, we create some visuals by hand!

Learning Goals

  • Understand how we can use additional aesthetics such as color and size to incorporate a third (or more variables) to a bivariate plot
  • Develop comfort with interpreting heat maps and star plots, which allow you to look for patterns in variation in many variables.

Template File

Download a template .Rmd of this activity. Put the file in a Assignment_04 folder within your COMP_STAT_112 folder.

  • This .Rmd contains examples that we’ll walk through in class and exercises you’ll finish for Assignment 4.
  • Choice
    • If you’d prefer to write code, open the Rmd up now.

    • If you’d prefer to see code (write later), open Slides for Today!

More Aesthetic Attributes

To go beyond 2 variables, we need to add aesthetics for each new variable!

Data: Exploring SAT Scores

Though far from a perfect assessment of academic preparedness, SAT scores have historically been used as one measurement of a state’s education system.


library(tidyverse)
education <- read.csv("https://bcheggeseth.github.io/112_spring_2023/data/sat.csv")


The first few rows of the SAT data.
State expend ratio salary frac verbal math sat fracCat
Alabama 4.405 17.2 31.144 8 491 538 1029 (0,15]
Alaska 8.963 17.6 47.951 47 445 489 934 (45,100]
Arizona 4.778 19.3 32.175 27 448 496 944 (15,45]
Arkansas 4.459 17.1 28.934 6 482 523 1005 (0,15]
California 4.992 24.0 41.078 45 417 485 902 (15,45]
Colorado 5.443 18.4 34.571 29 462 518 980 (15,45]

Data: Codebook

Codebook for SAT data. Source: https://www.macalester.edu/~kaplan/ISM/datasets/data-documentation.pdf

Univariate Density

Variability in average SAT scores from state to state:

ggplot(education, aes(x = sat)) +
  geom_density(fill = "blue", alpha = .5) + theme_classic()

Density plot of average SAT scores across U.S. states in mid-1990s. There are two groups of states, those with about 900 and those around 1050.

Bivariate Scatterplot

What degree do per pupil spending (expend) and teacher salary explain this variability?


ggplot(education, aes(y = sat, x = salary)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()
ggplot(education, aes(y = sat, x = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Is there anything that surprises you in the above plots? What are the relationship trends? Discuss as a group and write down 1 sentence summary of your thoughts in Rmd.

Example: Three Variables

Make a single scatterplot visualization that demonstrates the relationship between sat, salary, and expend.

Hints:

1. Try using the color or size aesthetics to incorporate the expenditure data.

2. Include some model smooths with geom_smooth() to help highlight the trends.

ggplot(education, aes(y = sat, x = salary, color = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

ggplot(education, aes(y = sat, x = salary)) +
  geom_point(aes(size = expend)) +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

Example: Three Variables

Another option!

Categorize your 3rd Quantitative Variable!

education %>% 
  mutate(expendCat = cut(expend,3)) %>%
ggplot(aes(y = sat, x = salary, color = expendCat)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

Example: Fraction who take SAT

The fracCat variable in the education data categorizes the fraction of the state’s students that take the SAT into low (below 15%), medium (15-45%), and high (at least 45%).

  1. Make a univariate visualization of the fracCat variable to better understand how many states fall into each category.
ggplot(education, aes(x = fracCat)) +
  geom_bar() + theme_classic()

Barplot of the fraction of the state's students that take the SAT, categorized into low, medium, and high. Most states have either at least 45% of the students take the SAT or less than 15%; there are not many states with participation in the middle.

Example: Fraction who take SAT

  1. Make a bivariate visualization that demonstrates the relationship between fracCat and sat. What story does your graphic tell?
ggplot(education, aes(x = fracCat, y = sat)) +
  geom_boxplot() + theme_classic()

Boxplot of average SAT scores by state participation in the SAT, categorized as low, medium, or high. Average SAT is higher among states with lower participation as there is self selection in who takes the SAT.

Example: Fraction who take SAT

  1. Make a trivariate visualization that demonstrates the relationship between fracCat, sat, and expend. Incorporate fracCat as the color of each point, and use a single call to geom_smooth to add three trendlines (one for each fracCat). What story does your graphic tell?
ggplot(education, aes(color = fracCat, y = sat, x = expend)) +
  geom_point() + geom_smooth(se = FALSE, method = 'lm') + theme_classic()

Scatterplot of expenditure and average SAT scores by student participation within the state. There is a slight positive relationship between expenditure and SAT scores once you account for the student participation.

Example: Fraction who take SAT

  1. Putting all of this together, explain this example of Simpson’s Paradox. That is, why does it appear that SAT scores decrease as spending increases even though the opposite is true?

Discuss!

Other Multivariate Visualization Techniques

After class, I want you to look through the heat maps and star plots. I have a few exercises in which I want you to reflect on the insight you gain.

Handmade Visualizations

Let’s go to Google Doc for the instructions.

Your task: Create a visualizations based on the data provided with any materials available.

Handmade Visualizations - Data

Name Area (acres) Max_depth (feet) Watershed_area (acres) Chain_of_lakes Town
Bde Maka Ska 401 87 2992 Yes Minneapolis
Lake Harriet 335 85 1139 Yes Minneapolis
Lake Nokomis 204 33 869 No Minneapolis
Cedar Lake 170 51 1956 Yes Minneapolis
Lake of the Isles 109 31 735 Yes Minneapolis
Lake Como 71 15 1783 No St Paul
Lake Phalen 198 91 14720 No St Paul