Introduction to Data Visualization

Brianna Heggeseth

Announcements

  • Check out the feedback template
    • You’ll should have gotten an individual version (shared with you + me)
    • Any immediate questions?
  • Great job on Scavenger Hunt!

Learning Goals

  • Understand the Grammar of Graphics
  • Use ggplot2 to create basic layers of graphics
  • Understand the different basic univariate visualizations for categorical and quantitative variables

Benefits of Visualizations

Visualizations help us understand what data we’re working with:

  • What are the scales of our variables?
  • Are there any outliers, i.e. unusual cases?
  • What are the patterns among our variables?

This understanding will inform our next steps:

  • What method of analysis / model is appropriate?


Once our analysis is complete, visualizations are a powerful way to communicate our findings and tell a story.

Glyphs

In its original sense, in archaeology, a glyph is a carved symbol.

Heiroglyph Mayan glyph
Heiroglyph Mayan glyph

Data Glyph

A data glyph is also a mark, e.g.

The features of a data glyph encodes the value of variables.

  • Some are very simple, e.g. a dot:
  • Some combine different elements, e.g. a pointrange:
  • Some are complicated, e.g. a dotplot:

Components of Graphics

  • frame: The position scale describing how data are mapped to x and y

  • glyph: The basic graphical unit that represents one case (also know as a mark and symbol).

Blood pressure readings from a random subset of the NHANES data set.

Components of Graphics

  • aesthetic: a visual property of a glyph such as position, size, shape, color, etc.

    • may be mapped based on data values: smoker -> color
    • may be set to particular non-data related values: color is black

Blood pressure readings from a random subset of the NHANES data set.

Components of Graphics

  • facet: a subplot that shows one subset of the data

    • rather than represent sex by shape, we could split into two subplots

Blood pressure readings from a random subset of the NHANES data set.

Components of Graphics

  • scale: A mapping that translates data values into aesthetics.

    • example: never-> pink; former-> aqua; current-> green
  • guide: An indication for the human viewer of the scale. This allows the viewer to translate aesthetics back into data values.

Blood pressure readings from a random subset of the NHANES data set.

Eye Training for the Layered Grammar of Graphics

Each group will be assigned one NY Times graphics to look; list at the course website.


If you haven’t already, follow the instructions to gain access (paid by Macalester MCGS) to content at NYTimes.com.

Exercise: Basic questions to ask of a data graphic

For your assigned graphic, discuss the following seven questions with your partner(s):

  1. What variables constitute the frame?
  2. What glyphs are used?
  3. What are the aesthetics for those glyphs?
  4. Which variable is mapped to each aesthetic?
  5. Which variable, if any, is used for faceting?
  6. Which scales are displayed with a guide?
  7. What raw data would be required for this plot, and what form should it be in?

Glyph-Ready Data

Glyph-ready data has this form:

  • There is one row for each glyph to be drawn.
  • The variables in that row are mapped to aesthetics of the glyph (including position).
sbp dbp sex smoker
112 55 male former
144 84 male never
143 84 female never
110 62 female never
121 72 female never
129 60 female never

Data Visualization Workflow + ggplot

Layers – Building up Complex Plots

Using the ggplot2 package, we can create graphics by building up layers, each of which may have its own data, glyphs, aesthetic mapping, etc.

Base Layer

The first layer just identifies the data set. It sets up a blank canvas, but does not actually plot anything:

ggplot(data = Tmp)

Geometry Layer

Next, we add a geometry layer to identify the mapping of data to aesthetics for each of the glyphs:

ggplot(data = Tmp) +
  geom_point(mapping = aes(x = sbp, y = dbp, shape = sex, color = smoker), size = 5, alpha = .8)

Guide Layer

Next, we can add some axes labels as guides:

ggplot(data = Tmp) +
  geom_point(mapping = aes(x = sbp, y = dbp, shape = sex, color = smoker), size = 5, alpha = .8) +
  xlab("Systolic BP") + ylab("Diastolic BP")

Scale Layer

We can change the scale of the color used for smoker status:

ggplot(data = Tmp) +
  geom_point(mapping = aes(x = sbp, y = dbp, shape = sex, color = smoker), size = 5, alpha = .8) +
  xlab("Systolic BP") + ylab("Diastolic BP") +
  scale_color_manual(values = c("#F8766D", "#00BFC4", "#00BA38"))

Facet Layer

If instead we wanted to facet into columns based on smoker status, we could add another layer for that:

ggplot(data = Tmp) +
  geom_point(mapping = aes(x = sbp, y = dbp, shape = sex, color = smoker), size = 5, alpha = .8) +
  xlab("Systolic BP") + ylab("Diastolic BP") +
  scale_color_manual(values = c("#F8766D", "#00BFC4", "#00BA38")) +
  facet_grid(. ~ smoker)

Getting Started

There’s no end to the number and type of visualizations you could make.

https://datavizproject.com/

  • Ask the data questions. Think about what insight the visual tells you.
  • Start with the basics and work incrementally.
  • Focus. Pick out a focused yet comprehensive set of visualizations.

Work Time until 4:10

Continue on with the activity working through in the template Rmd.

Work with each other and support one another.

Get immediate feedback with the Solutions online.

What is Tidy Tuesday?

A weekly data visualization project put on by some folks from the R Data Science community.

Tidy Tuesday Assignment

Tidy Tuesday Assignment on Moodle

Instructions for Tidy Tuesday

You must complete 3 Tidy Tuesday assignments, at a minimum.

  • You must complete at least one by Oct 13 (TT5).
  • They are posted Tuesday evening and due Friday night.
  • Should only spend about 1-2 hours on it.

TT1

Let’s open up

Rest of Class

You may choose to continue working on Data Viz activity or try TT1!

  • Assignment 2 is due next Wednesday on Moodle.
  • If you choose to do TT1, it is due Friday on Moodle.

After Class

Continue to work through the activity in the Rmd file.


You’ll submit a knitted html file of that template Rmd file for Assignment 2 (due next Wednesday).

  • We’ll give feedback on Exercises 2.1-2.5.