Multivariate Visualizations + Idea Lab

Brianna Heggeseth

Switch it Up

Sit with someone new today!

Introduce yourself
Share best or worst part of the week so far

Announcements

Every week in MSCS

Thursday 11:15am Coffee Break in Smail Gallery

This week in class

TT2 due tomorrow

Today we’ll practice discussing “insights” we gain from our visualizations. Then, we create some visuals by hand!

Learning Goals

Understand how we can use additional aesthetics such as color and size to incorporate a third (or more variables) to a bivariate plot
Develop comfort with interpreting heat maps and star plots, which allow you to look for patterns in variation in many variables.

More Aesthetic Attributes

To go beyond 2 variables, we need to add aesthetics for each new variable!

Data: Exploring SAT Scores

Though far from a perfect assessment of academic preparedness, SAT scores have historically been used as one measurement of a state’s education system.

library(tidyverse)
education <- read.csv("https://bcheggeseth.github.io/112_fall_2023/data/sat.csv")

The first few rows of the SAT data.
State	expend	ratio	salary	frac	verbal	math	sat	fracCat
Alabama	4.405	17.2	31.144	8	491	538	1029	(0,15]
Alaska	8.963	17.6	47.951	47	445	489	934	(45,100]
Arizona	4.778	19.3	32.175	27	448	496	944	(15,45]
Arkansas	4.459	17.1	28.934	6	482	523	1005	(0,15]
California	4.992	24.0	41.078	45	417	485	902	(15,45]
Colorado	5.443	18.4	34.571	29	462	518	980	(15,45]

Data: Codebook

Codebook for SAT data. Source: https://www.macalester.edu/~kaplan/ISM/datasets/data-documentation.pdf

Univariate Density

Variability in average SAT scores from state to state:

ggplot(education, aes(x = sat)) +
  geom_density(fill = "blue", alpha = .5) + theme_classic()

Density plot of average SAT scores across U.S. states in mid-1990s. There are two groups of states, those with about 900 and those around 1050.

Bivariate Scatterplot

What degree do per pupil spending (expend) and teacher salary explain this variability?

ggplot(education, aes(y = sat, x = salary)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic() +
    theme(text = element_text(size=20))
ggplot(education, aes(y = sat, x = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic() +
    theme(text = element_text(size=20))

Is there anything that surprises you in the above plots? What are the relationship trends? Discuss as a group.

Example: Three Variables

Let’s make a single scatterplot visualization that demonstrates the relationship between sat, salary, and expend.

Thoughts:

1. We could use the color or size aesthetics to incorporate the expenditure data.

2. Include some model smooths with geom_smooth() to help highlight the trends.

ggplot(education, aes(y = sat, x = salary, color = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Scatterplot of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

ggplot(education, aes(y = sat, x = salary)) +
  geom_point(aes(size = expend)) +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Example: Three Variables

Another option!

Categorize your 3rd Quantitative Variable!

education %>% 
  mutate(expendCat = cut(expend,3)) %>%
ggplot(aes(y = sat, x = salary, color = expendCat)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + theme_classic()

Example: Fraction who take SAT

The fracCat variable in the education data categorizes the fraction of the state’s students that take the SAT into low (below 15%), medium (15-45%), and high (at least 45%).

Make a univariate visualization of the fracCat variable to better understand how many states fall into each category.

ggplot(education, aes(x = fracCat)) +
  geom_bar() + theme_classic()

$Barplot of the fraction of the state's students that take the SAT, categorized into low, medium, and high. Most states have either at least 45% of the students take the SAT or less than 15%; there are not many states with participation in the middle.$

Example: Fraction who take SAT

Make a bivariate visualization that demonstrates the relationship between fracCat and sat. What story does your graphic tell?

ggplot(education, aes(x = fracCat, y = sat)) +
  geom_boxplot() + theme_classic()

Boxplot of average SAT scores by state participation in the SAT, categorized as low, medium, or high. Average SAT is higher among states with lower participation as there is self selection in who takes the SAT.

Example: Fraction who take SAT

Make a trivariate visualization that demonstrates the relationship between fracCat, sat, and expend. Incorporate fracCat as the color of each point, and use a single call to geom_smooth to add three trendlines (one for each fracCat). What story does your graphic tell?

ggplot(education, aes(color = fracCat, y = sat, x = expend)) +
  geom_point() + geom_smooth(se = FALSE, method = 'lm') + theme_classic()

Scatterplot of expenditure and average SAT scores by student participation within the state. There is a slight positive relationship between expenditure and SAT scores once you account for the student participation.

Example: Fraction who take SAT

Putting all of this together, explain this example of Simpson’s Paradox. That is, why does it appear that SAT scores decrease as spending increases even though the opposite is true?

Discuss!

Other Multivariate Visualization Techniques

After class, I want you to look through the heat maps and star plots. I want you to reflect on the insight you gain from the different plots.

Handmade Visualizations

Let’s go to Google Doc for the instructions.

Your task: Create a visualizations based on the data provided with any materials available.

Handmade Visualizations - Data

Name	Area (acres)	Max depth (feet)	Watershed area (acres)	Chain of lakes	Longitude	Latitude	City
Bde Maka Ska	401	87	2992	Yes	-93.311883	44.941966	Minneapolis
Lake Harriet	335	85	1139	Yes	-93.304514	44.921725	Minneapolis
Lake Nokomis	204	33	869	No	-93.241582	44.908678	Minneapolis
Cedar Lake	170	51	1956	Yes	-93.321751	44.959361	Minneapolis
Lake of the Isles	109	31	735	Yes	-93.306507	44.955482	Minneapolis
Lake Hiawatha	54	33	1734	No	-93.236044	44.920906	Minneapolis
Lake Como	71	15	1783	No	-93.140153	44.979637	St Paul
Lake Phalen	198	91	14720	No	-93.053102	44.986744	St Paul