2.4 One Categorical Variable

First, we consider survey data of the electoral registrar in Whickham in the UK (Source: Appleton et al 1996). A survey was conducted in 1972-1974 to study heart disease and thyroid disease and a few baseline characteristics were collected: age and smoking status. 20 years later, a follow-up was done to check on mortality status (alive/dead).

Let’s first consider the age distribution of this sample. Age, depending on how it is measured, could act as a quantitative variable or categorical variable. In this case, age is recorded as a quantitative variable because it is recorded to the nearest year. But, for illustrative purposes, let’s create a categorical variable by separating age into intervals.

Distribution: the way something is spread out (the way in which values vary).

# Note: anything to the right of a hashtag is a comment and is not evaluated as R code

library(dplyr) # Load the dplyr package
library(ggplot2) # Load the ggplot2 package
data(Whickham) # Load the data set from Whickham R package

# Create a new categorical variable with 4 categories based on age
Whickham <- Whickham %>%
    mutate(ageCat = cut(age, 4)) 

head(Whickham)
##   outcome smoker age      ageCat
## 1   Alive    Yes  23 (17.9,34.5]
## 2   Alive    Yes  18 (17.9,34.5]
## 3    Dead    Yes  71 (67.5,84.1]
## 4   Alive     No  67   (51,67.5]
## 5   Alive     No  64   (51,67.5]
## 6   Alive    Yes  38   (34.5,51]

What do you lose when you convert a quantitative variable to a categorical variable? What do you gain?

2.4.1 Bar Plot

One of the best ways to show the distribution of one categorical variable is with a bar plot. For a bar plot,

  • The height of the bars is the only part that encodes the data (width is meaningless).
  • The height can either represent the frequency (count of units) or the relative frequency (proportion of units).
## Numerical summary (frequency and relative frequency)
Whickham %>%
    count(ageCat) %>%
    mutate(relfreq = n / sum(n)) 
##        ageCat   n   relfreq
## 1 (17.9,34.5] 408 0.3105023
## 2   (34.5,51] 367 0.2792998
## 3   (51,67.5] 347 0.2640791
## 4 (67.5,84.1] 192 0.1461187
## Graphical summary (bar plot)
Whickham %>%
    ggplot(aes(x = ageCat)) + 
    geom_bar(fill="steelblue") + 
    labs(x = 'Age Categories in Years', y = 'Counts') + 
    theme_classic()

What do you notice? What do you wonder?

2.4.2 Pie Chart

Pie charts are only useful if you have 2 to 3 possible categories and you want to show relative group sizes.

This is the best use for a pie chart:

We are intentionally not showing you how to make a pie chart because a bar chart is a better choice.

Here is a good summary of why many people strongly dislike pie charts: http://www.businessinsider.com/pie-charts-are-the-worst-2013-6. Keep in mind Visualization Principle #4: Facilitate Comparisons. We are much better at comparing heights of bars than areas of slices of a pie chart.