Multivariate Visualizations + Idea Lab

Brianna Heggeseth

Switch it Up

Sit with someone new today!

  • Introduce yourself
  • Share best or worst part of the week so far

Announcements

Every week in MSCS

  • Thursday 11:15am Coffee Break in Smail Gallery

This week in class

  • TT2 due tomorrow

Today we’ll practice discussing “insights” we gain from our visualizations. Then, we create some visuals by hand!

Learning Goals

  • Understand how we can use additional aesthetics such as color and size to incorporate a third (or more variables) to a bivariate plot
  • Develop comfort with interpreting heat maps and star plots, which allow you to look for patterns in variation in many variables.

More Aesthetic Attributes

To go beyond 2 variables, we need to add aesthetics for each new variable!

Data: Exploring SAT Scores

Though far from a perfect assessment of academic preparedness, SAT scores have historically been used as one measurement of a state’s education system.


library(tidyverse)
education <- read.csv("https://bcheggeseth.github.io/112_fall_2023/data/sat.csv")


The first few rows of the SAT data.
State expend ratio salary frac verbal math sat fracCat
Alabama 4.405 17.2 31.144 8 491 538 1029 (0,15]
Alaska 8.963 17.6 47.951 47 445 489 934 (45,100]
Arizona 4.778 19.3 32.175 27 448 496 944 (15,45]
Arkansas 4.459 17.1 28.934 6 482 523 1005 (0,15]
California 4.992 24.0 41.078 45 417 485 902 (15,45]
Colorado 5.443 18.4 34.571 29 462 518 980 (15,45]

Data: Codebook

Codebook for SAT data. Source: https://www.macalester.edu/~kaplan/ISM/datasets/data-documentation.pdf

Univariate Density

Variability in average SAT scores from state to state:

ggplot(education, aes(x = sat)) +
  geom_density(fill = "blue", alpha = .5) + theme_classic()

Density plot of average SAT scores across U.S. states in mid-1990s. There are two groups of states, those with about 900 and those around 1050.

Bivariate Scatterplot

What degree do per pupil spending (expend) and teacher salary explain this variability?


ggplot(education, aes(y = sat, x = salary)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic() +
    theme(text = element_text(size=20))
ggplot(education, aes(y = sat, x = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic() +
    theme(text = element_text(size=20))

Is there anything that surprises you in the above plots? What are the relationship trends? Discuss as a group.

Example: Three Variables

Let’s make a single scatterplot visualization that demonstrates the relationship between sat, salary, and expend.

Thoughts:

1. We could use the color or size aesthetics to incorporate the expenditure data.

2. Include some model smooths with geom_smooth() to help highlight the trends.

ggplot(education, aes(y = sat, x = salary, color = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

ggplot(education, aes(y = sat, x = salary)) +
  geom_point(aes(size = expend)) +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

Example: Three Variables

Another option!

Categorize your 3rd Quantitative Variable!

education %>% 
  mutate(expendCat = cut(expend,3)) %>%
ggplot(aes(y = sat, x = salary, color = expendCat)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

Example: Fraction who take SAT

The fracCat variable in the education data categorizes the fraction of the state’s students that take the SAT into low (below 15%), medium (15-45%), and high (at least 45%).

  1. Make a univariate visualization of the fracCat variable to better understand how many states fall into each category.
ggplot(education, aes(x = fracCat)) +
  geom_bar() + theme_classic()

Barplot of the fraction of the state's students that take the SAT, categorized into low, medium, and high. Most states have either at least 45% of the students take the SAT or less than 15%; there are not many states with participation in the middle.

Example: Fraction who take SAT

  1. Make a bivariate visualization that demonstrates the relationship between fracCat and sat. What story does your graphic tell?
ggplot(education, aes(x = fracCat, y = sat)) +
  geom_boxplot() + theme_classic()

Boxplot of average SAT scores by state participation in the SAT, categorized as low, medium, or high. Average SAT is higher among states with lower participation as there is self selection in who takes the SAT.

Example: Fraction who take SAT

  1. Make a trivariate visualization that demonstrates the relationship between fracCat, sat, and expend. Incorporate fracCat as the color of each point, and use a single call to geom_smooth to add three trendlines (one for each fracCat). What story does your graphic tell?
ggplot(education, aes(color = fracCat, y = sat, x = expend)) +
  geom_point() + geom_smooth(se = FALSE, method = 'lm') + theme_classic()

Scatterplot of expenditure and average SAT scores by student participation within the state. There is a slight positive relationship between expenditure and SAT scores once you account for the student participation.

Example: Fraction who take SAT

  1. Putting all of this together, explain this example of Simpson’s Paradox. That is, why does it appear that SAT scores decrease as spending increases even though the opposite is true?

Discuss!

Other Multivariate Visualization Techniques

After class, I want you to look through the heat maps and star plots. I want you to reflect on the insight you gain from the different plots.

Handmade Visualizations

Let’s go to Google Doc for the instructions.

Your task: Create a visualizations based on the data provided with any materials available.

Handmade Visualizations - Data

Name Area (acres)

Max

depth (feet)

Watershed

area (acres)

Chain of lakes Longitude Latitude City
Bde Maka Ska 401 87 2992 Yes -93.311883 44.941966 Minneapolis
Lake Harriet 335 85 1139 Yes -93.304514 44.921725 Minneapolis
Lake Nokomis 204 33 869 No -93.241582 44.908678 Minneapolis
Cedar Lake 170 51 1956 Yes -93.321751 44.959361 Minneapolis
Lake of the Isles 109 31 735 Yes -93.306507 44.955482 Minneapolis
Lake Hiawatha 54 33 1734 No -93.236044 44.920906 Minneapolis
Lake Como 71 15 1783 No -93.140153 44.979637 St Paul
Lake Phalen 198 91 14720 No -93.053102 44.986744 St Paul