Topic 10 Categorical Variables as Factors

Learning Goals

  • Understand the difference between a variable stored as a character vs. a factor
  • Be able to convert a character variable to a factor
  • Develop comfort in manipulating the order and values of a factor with the forcats package to improve summaries and visualizations.

You can download a template .Rmd of this activity here. Put this in a folder Day_10 in your COMP_STAT_112 folder.

Example: Grade Distribution

Grades <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/grades.csv")
Grades <- Grades %>%
  select(sid, sessionID, grade) %>%
  distinct(sid, sessionID, .keep_all = TRUE)

We will continue with the grades data from the previous activity. Here is a sample to remember what it looks like:

Table 10.1: Student grades.
sid sessionID grade
S31842 session2207 B+
S32436 session3172 S
S31671 session3435 A-
S31929 session3512 NC

Here is a bar chart of the grade distribution:

ggplot(Grades, aes(x = grade)) +
  geom_bar()

We can also wrangle a table that just has each grade and the number of times it appears:

GradeDistribution <- Grades %>%
  group_by(grade) %>%
  summarize(count = n())
# Alternatively, we can use the count() function the creates a variable called n
Grades %>%
  count(grade) 
Table 10.2: Grade distribution.
grade count
A 1506
A- 1381
AU 27
B 804
B- 330
B+ 1003
C 137
C- 52
C+ 167
D 18
D- 6
D+ 8
NC 17
S 388

What could be improved about this graphic and table?

The grades are listed alphabetically, which isn’t particularly meaningful. Why are they listed that way? Because the variable grade is a character string type:

class(Grades$grade)
## [1] "character"

When dealing with categorical variables that take a finite number of values (levels, formally), it is often useful to store the variable as a factor, and specify a meaningful order for the levels.

For example, when the entries are stored as character strings, we cannot use the levels command to see the full list of values:

levels(Grades$grade)
## NULL

Converting to factor

Let’s first convert the grade variable to a factor:

Grades <- Grades %>%
  mutate(grade = factor(grade))

Now we can see the levels:

levels(Grades$grade)
##  [1] "A"  "A-" "AU" "B"  "B-" "B+" "C"  "C-" "C+" "D"  "D-" "D+" "NC" "S"

Moreover, the forcats package (part of tidyverse) allows us to manipulate these factors. Its commands include the following.

Changing the order of levels

  • fct_relevel(): manually reorder levels
  • fct_infreq(): order levels from highest to lowest frequency
  • fct_reorder(): reorder levels by values of another variable
  • fct_rev(): reverse the current order

Changing the value of levels

  • fct_recode(): manually change levels
  • fct_lump(): group together least common levels

More details on these and other commands can be found on the forcats cheat sheet or in Wickham & Grolemund’s chapter on factors.


Example 10.1 (Reorder factors) Let’s reorder the grades so that they are in a more meaningful order for the bar chart above. Here are three options:


Option 1: From high grade to low grade, with “S” and “AU” at the end:

Grades %>%
  mutate(grade = fct_relevel(grade, c("A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "NC", "S", "AU"))) %>%
  ggplot(aes(x = grade)) +
  geom_bar()

Option 2: In terms of ascending frequency:

ggplot(GradeDistribution) +
  geom_col(aes(x = fct_reorder(grade, count), y = count)) +
  labs(x = "grade")

Option 3: In terms of descending frequency:

ggplot(GradeDistribution) +
  geom_col(aes(x = fct_reorder(grade, count, .desc = TRUE), y = count)) +
  labs(x = "grade")

Example 10.2 (Recode factors) Because it may not be clear what “AU” and “S” stand for, let’s rename them to “Audit” and “Satisfactory”.

Grades %>%
  mutate(grade = fct_relevel(grade, c("A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "NC", "S", "AU"))) %>%
  mutate(grade = fct_recode(grade, "Satisfactory" = "S", "Audit" = "AU")) %>%
  ggplot(aes(x = grade)) +
  geom_bar()

Exercise 10.1 Now that you’ve developed your data visualization and wrangling skills,

  1. develop a research question to address with the grades and courses data,
  2. create a high quality visualization that addresses your research question,
  3. write a brief description of the visualization and include the insight you gain about the research question.
Courses <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/courses.csv")

Appendix: R Functions

Changing the order of levels

Function/Operator Action Example
fct_relevel() manually reorder levels of a factor Grades %>% mutate(grade = fct_relevel(grade, c("A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "NC", "S", "AU")))
fct_infreq() order levels from highest to lowest frequency ggplot(Grades) + geom_bar(aes(x = fct_infreq(grade)))
fct_reorder() reorder levels by values of another variable ggplot(GradeDistribution) + geom_col(aes(x = fct_reorder(grade, count), y = count))
fct_rev() reverse the current order ggplot(Grades) + geom_bar(aes(x = fct_rev(fct_infreq(grade))))

Changing the value of levels

Function/Operator Action Example
fct_recode() manually change levels Grades %>% mutate(grade = fct_recode(grade, "Satisfactory" = "S", "Audit" = "AU"))
fct_lump() group together least common levels Grades %>% mutate(grade = fct_lump(grade, n = 5))