Topic 5 Multivariate Visualizations

Learning Goals

Understand how we can use additional aesthetics such as color and size to incorporate a third (or more variables) to a bivariate plot
Develop comfort with interpreting heat maps and star plots, which allow you to look for patterns in variation in many variables.

You can download a template .Rmd of this activity here. Put this in a new folder called Assignment_04 in your folder for COMP_STAT_112.

Adding More Aesthetic Attributes

Exploring SAT Scores

Though far from a perfect assessment of academic preparedness, SAT scores have historically been used as one measurement of a state’s education system. The education data contain various education variables for each state:

education <- read.csv("https://bcheggeseth.github.io/112_spring_2023/data/sat.csv")

Table 5.1: The first few rows of the SAT data.
State	expend	ratio	salary	frac	verbal	math	sat	fracCat
Alabama	4.405	17.2	31.144	8	491	538	1029	(0,15]
Alaska	8.963	17.6	47.951	47	445	489	934	(45,100]
Arizona	4.778	19.3	32.175	27	448	496	944	(15,45]
Arkansas	4.459	17.1	28.934	6	482	523	1005	(0,15]
California	4.992	24.0	41.078	45	417	485	902	(15,45]
Colorado	5.443	18.4	34.571	29	462	518	980	(15,45]

A codebook is provided by Danny Kaplan who also made these data accessible:

Figure 5.1: Codebook for SAT data. Source: Danny Kaplan

To examine the variability in average SAT scores from state to state, let’s start with a univariate density plot:

ggplot(education, aes(x = sat)) +
  geom_density(fill = "blue", alpha = .5) + 
  theme_classic()

Density plot of average SAT scores across U.S. states in mid-1990s. There are two groups of states, those with about 900 and those around 1050.

The first question we’d like to answer is to what degree do per pupil spending (expend) and teacher salary explain this variability? We can start by plotting each against sat, along with a best fit linear regression model:

ggplot(education, aes(y = sat, x = salary)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + 
  theme_classic()

ggplot(education, aes(y = sat, x = expend)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm") + 
  theme_classic()

Scatter plot of average SAT scores against teacher salary across U.S. states in mid-1990s. There is a weak negative relationship.

Example 5.1 Is there anything that surprises you in the above plots? What are the relationship trends?

Solution

These seem to suggest that spending more money on students or teacher salaries correlates with lower SAT scores. Say it ain’t so!

Example 5.2 Make a single scatterplot visualization that demonstrates the relationship between sat, salary, and expend. Summarize the trivariate relationship between sat, salary, and expend.

Hints: 1. Try using the color or size aesthetics to incorporate the expenditure data. 2. Include some model smooths with geom_smooth() to help highlight the trends.

Solution

Below are four different plots that one could make. There seems to be a high correlation between expend and salary, and both seem to be negatively correlated with sat.

#plot 1
g1 <- ggplot(education, aes(y=sat, x=salary, color=expend)) + 
    geom_point() + 
    geom_smooth(se=FALSE, method="lm") + theme_classic()
    
#plot 2
g2 <- ggplot(education, aes(y=sat, x=salary, size=expend)) + 
    geom_point() + 
    geom_smooth(se=FALSE, method="lm") + theme_classic()

#plot 3
g3 <- ggplot(education, aes(y=sat, x=salary, color=cut(expend,2))) + 
    geom_point() + 
    geom_smooth(se=FALSE, method="lm") + theme_classic()

#plot 4
g4 <- ggplot(education, aes(y=sat, x=salary, color=cut(expend,3))) + 
    geom_point() + 
    geom_smooth(se=FALSE, method="lm") + theme_classic()

library(gridExtra)
grid.arrange(g1, g2, g3, g4, ncol=2)

## Warning: The following aesthetics were dropped during statistical
## transformation: colour
## ℹ This can happen when ggplot fails to infer the correct
##   grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to
##   convert a numerical variable into a factor?

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2
## 3.4.0.
## ℹ Please use `linewidth` instead.

## Warning: The following aesthetics were dropped during statistical
## transformation: size
## ℹ This can happen when ggplot fails to infer the correct
##   grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to
##   convert a numerical variable into a factor?

Four scatterplots of average SAT scores against teacher salary by expenditure across U.S. states in mid-1990s. There seems to be a high correlation between expenditure and salary, and both seem to be negatively correlated with SAT scores.

Example 5.3 The fracCat variable in the education data categorizes the fraction of the state’s students that take the SAT into low (below 15%), medium (15-45%), and high (at least 45%).

Make a univariate visualization of the fracCat variable to better understand how many states fall into each category.
Make a bivariate visualization that demonstrates the relationship between fracCat and sat. What story does your graphic tell?
Make a trivariate visualization that demonstrates the relationship between fracCat, sat, and expend. Incorporate fracCat as the color of each point, and use a single call to geom_smooth to add three trendlines (one for each fracCat). What story does your graphic tell?
Putting all of this together, explain this example of Simpson’s Paradox. That is, why does it appear that SAT scores decrease as spending increases even though the opposite is true?

Solution

ggplot(education, aes(x = fracCat)) +
  geom_bar() + theme_classic()

$Barplot of the fraction of the state's students that take the SAT, categorized into low, medium, and high. Most states have either at least 45% of the students take the SAT or less than 15%; there are not many states with participation in the middle.$

ggplot(education, aes(x = fracCat, y = sat)) +
  geom_boxplot() + theme_classic()

Boxplot of average SAT scores by state participation in the SAT, categorized as low, medium, or high. Average SAT is higher among states with lower participation as there is self selection in who takes the SAT.

ggplot(education, aes(color = fracCat, y = sat, x = expend)) +
  geom_point() + geom_smooth(se = FALSE, method = 'lm') + theme_classic()

Scatterplot of expenditure and average SAT scores by student participation within the state. There is a slight positive relationship between expenditure and SAT scores once you account for the student participation.

Among student participation tends to be lower among states with lower expenditures. Those same states tend to have higher SAT scores because of the self-selection of who participates; only those most prepared take the SAT in those states.

Other Multivariate Visualization Techniques

Heat maps

Note that each variable (column) is scaled to indicate states (rows) with high values (yellow) to low values (purple/blue). With this in mind you can scan across rows & across columns to visually assess which states & variables are related, respectively. You can also play with the color scheme. Type ?cm.colors in the console to see various options.

ed <- as.data.frame(education) # convert from tibble to data frame

# convert to a matrix with State names as the row names
row.names(ed) <- ed$State  #added state names as the row names rather than a variable
ed <- ed %>% select(2:8) #select the 2nd through 8th columns
ed_mat <- data.matrix(ed) #convert to a matrix format

heatmap.2(ed_mat,
  Rowv = NA, Colv = NA, scale = "column",
  keysize = 0.7, density.info = "none",
  col = hcl.colors(256), margins = c(10, 20),
  colsep = c(1:7), rowsep = (1:50), sepwidth = c(0.05, 0.05),
  sepcolor = "white", cexRow = 2, cexCol = 2, trace = "none",
  dendrogram = "none"
)

Exercise 5.1 What do you notice? What insight do you gain about the variation across U.S. states?

Heat map with row clusters

It can be tough to identify interesting patterns by visually comparing across rows and columns. Including dendrograms helps to identify interesting clusters.

heatmap.2(ed_mat,
  Rowv = TRUE, #this argument changed
  Colv = NA, scale = "column", keysize = .7,
  density.info = "none", col = hcl.colors(256),
  margins = c(10, 20),
  colsep = c(1:7), rowsep = (1:50), sepwidth = c(0.05, 0.05),
  sepcolor = "white", cexRow = 2, cexCol = 2, trace = "none",
  dendrogram = "row" #this argument changed
)

Exercise 5.2 What new insight do you gain about the variation across U.S. states, now that states are grouped and ordered by similarity?

Heat map with column clusters

We can also construct a heat map which identifies interesting clusters of columns (variables).

heatmap.2(ed_mat,
  Colv = TRUE, #this argument changed
  Rowv = NA, scale = "column", keysize = .7,
  density.info = "none", col = hcl.colors(256),
  margins = c(10, 20),
  colsep = c(1:7), rowsep = (1:50), sepwidth = c(0.05, 0.05),
  sepcolor = "white", cexRow = 2, cexCol = 2, trace = "none",
  dendrogram = "column" #this argument changed
)

Exercise 5.3 What new insight do you gain about the variation across U.S. states, now that variables are grouped and ordered by similarity?

Star plots

There’s more than one way to visualize multivariate patterns. Like heat maps, these star plot visualizations indicate the relative scale of each variable for each state. With this in mind, you can use the star maps to identify which state is the most “unusual.” You can also do a quick scan of the second image to try to cluster states. How does that clustering compare to the one generated in the heat map with row clusters above?

stars(ed_mat,
  flip.labels = FALSE,
  locations = data.matrix(as.data.frame(state.center)), #added external data to arrange by geo location
  key.loc = c(-70, 30), cex = 1
)

stars(ed_mat,
  flip.labels = FALSE,
  locations = data.matrix(as.data.frame(state.center)), #added external data to arrange by geo location
  key.loc = c(-70, 30), cex = 1, 
  draw.segments = TRUE #changed argument
)

Exercise 5.4 What new insight do you gain about the variation across U.S. states with the star plots (arranged geographically) as compared to heat plots?