---
title: "Univariate visualization and summaries (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```


::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity 2" or "2 univariate analysis").
- "Render" the qmd into an **HTML file** using the button in the menu bar at the top of this file. Scroll through and check out how the qmd and HTML correspond. Neat!
- For Practice sets and other assignments, you'll need to submit an HTML *file*. Let's practice finding & checking that file.
    - Find this HTML in your laptop files (not in RStudio). To find it, navigate to the `inclass_activities` sub-folder within your `stat155` folder (or wherever you saved the qmd).
    - Open the HTML. It will pop up in your browser instead of within RStudio.
    - Make sure the HTML is correctly formatted.
:::


## Warm-up  {-}

**Guiding question** 

What anxieties have been on Americans' minds over the decades?


\


**Context** 

*Dear Abby* is America's longest running advice column.
Started in 1956 by Pauline Phillips under the pseudonym Abigail van Buren, the column continues to this day under the stewardship of her daughter Jeanne.
We'll explore the contents of these letters!


\
\

### EXAMPLE 1: Background

-   [The Pudding](https://pudding.cool/), a data journalism site, published a visual article called [30 Years of American Anxieties](https://pudding.cool/2018/11/dearabby/) exploring themes in Dear Abby letters from 1985 to 2017. Check out what they explored.

-   Go to the "Data and Method" section at the very end of the article. In thinking about the 5 W's + H (who, what, when, where, why, and how) of data context, what concerns/limitations surface with regards to using this data to learn about concerns over the decades?

\
\
\
\


::: {.callout-note title="REMINDER: RStudio = a hammer, You = a carpenter"}

During the first few weeks of the semester, you will learn most of the code we'll need for this class.
It will be a lot and will be distracting! 
And you'll make so many mistakes (which is necessary to learning). Throughout, remember:

-   RStudio = a hammer (simply a tool needed for statistical modeling that you’ll learn through lots of practice, trial, and error)

-   You = a carpenter (somebody with knowledge about designing statistical models that are useful and correct)
:::


\
\
\
\


### EXAMPLE 2: Import the data

Let's import the Dear Abby data collected by The Pudding.
The *codebook*, i.e. description of the variables, is available [here](https://github.com/Mac-STAT/data/blob/main/dear_abby_codebook.md).

```{r}
# Load the tidyverse package
# We'll need this to plot and summarize the data!
library(tidyverse)

# Read in the Dear Abby data
abby <- read_csv("https://mac-stat.github.io/data/dear_abby.csv")
```

Throughout this activity, **we'll work only with the most recent year of data, from 2017**:

```{r}
# Wrangle the Dear Abby data
# Ignore this code for now!
abby <- abby %>% 
  filter(year == 2017) %>% 
  mutate(month = month(month, label = TRUE)) %>%
  mutate(
    parents = str_detect(question_only, "mother|mama|mom|father|papa|dad"),
    marriage = str_detect(question_only, "marriage|marry|married"),
    money = str_detect(question_only, "money|finance")
  ) %>%
  rowwise() %>%
  mutate(
    themes = c(
      if (parents) "parents",
      if (marriage) "marriage",
      if (money) "money"
    ) %>% paste(collapse = ", "),
    themes = ifelse(themes == "", "other", themes)
  ) %>%
  ungroup() %>% 
  select(year, month, day, question_only, bing_pos, themes)
```

- Pull up the `abby` dataset from the Environment tab. The variables are:    
    - `year`, `month`, `day`: date
    - `question_only`: contents of the letter to Dear Abby
    - `bing_pos`: sentiment of the letter as measured by the proportion of the words that are *positive* (0-1)
    - `themes`: general themes of the letter (among the list that we defined)

- In this **tidy** dataset:   
    - *rows*    
        The *cases* or *units of observation* are single questions or letters.
    - *columns*   
        We have multiple *quantitative variables* (eg: `bing_pos`, `year`) and multiple *categorical* variables (eg: `month`, `themes`)


\
\
\
\

### EXAMPLE 3: Get to know the data using R code

NOTE: Code = communication.
We'll use `#` to *comment* our code.
This provides critical signposts for our future selves and others.

```{r}
# How many cases & variables are there?


```

```{r}
# Print out the first 6 rows


```

```{r}
# Print out the first 10 rows


```

```{r}
# Print out the variable / column labels


```


\
\
\
\


### EXAMPLE 4: Streamlining code with pipes

Compare the code & output in the following 2 chunks:

```{r}
# Apply the head() function to the abby data
head(abby)
```

```{r}
# Pipe the abby data into the head() function
abby %>% 
  head()
```


Compare the code & output in the following 2 chunks:


```{r}
# Apply the dim() function to the head() function to the abby data
dim(head(abby))
```

```{r}
# Pipe the abby data into the head() function
# Then pipe that into the dim() function
abby %>% 
  head() %>% 
  dim()
```

The second chunks use the **pipe function** `%>%`:

-   `object %>% function()` is the same `function(object)`

-   Pipes allow us to build and communicate our code in a sequential, logical order.


\
\
\
\

## What's next?!?

This data seems interesting!
For example, it brings to mind the following questions:

- What `themes` are most common? Least?
- How do the `bing_pos` sentiment scores vary from letter to letter? What's a typical score? The spread in scores? The shape of the distribution in scores (e.g. are there clusters of negative vs positive letters, are the scores uniformly distributed across the range, etc)?

These are pretty straightforward *univariate* questions (having to do with 1 variable at a time, not the relationships among variables).

But it's impossible to answer these questions by just scanning the dataset.
We need to turn this data into information!!

```{r}
head(abby)
```

\
\
\
\

## Exercises {-}

**GOALS**

- Now that we have a sense for the structure of the data, you'll explore the trends, variability, and patterns in the `themes` and sentiments (`bing_pos`) of the Dear Abby questions using *numerical* and *visual* summaries.
- Practice "tidy" coding practices.

\
\


**DIRECTIONS**

You'll work on these exercises in your groups. 
Collaboration is a key learning goal in this course.

-   Why? Discussion & collaboration deepens your understanding of a topic while also improving confidence, communication, community, & more. (eg: Deeply learning a new language requires more than working through Duolingo alone. You need to talk with and listen to others!)

-   How? You are expected to:

    -   Use your group members' names & pronouns. It's ok to ask if you don’t remember!
    -   Actively contribute to discussion. Don't work on your own.
    -   Actively include all other group members in discussion.
    -   Create a space where others feel comfortable sharing ideas & questions.

-   We won't discuss these exercises as a class. With that, when you get stuck:    
    -   Carefully re-read the problem. Make sure you didn't miss any directions -- it can be tempting to skip words and go straight to an R chunk, but don't :).
    -   Discuss any questions you have with your group.
    -   If the question is unresolved by the group, ask the instructor!
    -   Remember that there are solutions in the online manual, at the bottom of the activity.


\
\
\
\


### Exercise 1: Data wrangling with `select` and `summarize` 

We'll typically need to **wrangle** our data throughout an analysis.
We'll use two wrangling "verbs" or functions from the `tidyverse` today: `select()` and `summarize()`.
Don't worry about memorizing anything -- just take note of how each function *works* & what output it *produces*.

```{r}
# ???
# NOTE: We're including head() just to check things out.
# Otherwise, all of the rows would be printed in your rendered HTML!
abby %>% 
  select(themes, bing_pos) %>% 
  head()
```

```{r}
# ???
abby %>% 
  summarize(mean(bing_pos))
```

```{r}
# ???
abby %>% 
  summarize(mean = mean(bing_pos, na.rm = TRUE))
```


\
\
\
\


### Exercise 2: Categorical variable summaries 

Let's explore the `themes` in the Dear Abby letters.
Since `themes` is a *categorical* variable, a simple **numerical summary** is provided by a *table of counts*:

```{r}
# Construct a table of counts
abby %>% 
  count(themes)
```

Since `themes` is a *categorical* variable, a simple **visual summary** is provided by a *bar plot*.
*Before making the plot*, share what you expect the plot might look like.
(Clearly defining your expectations first is good scientific practice to avoid confirmation bias.)

Now check your intuition!
Separately run each chunk below and add a comment (`#`) about what you observe.
The goal isn’t to memorize the code, but to start observing patterns in how the code works.

```{r}
# ???
abby %>% 
  ggplot(aes(x = themes))
```

```{r}
# ???
abby %>% 
  ggplot(aes(x = themes)) +
  geom_bar()
```

```{r}
# ???
abby %>% 
  ggplot(aes(x = themes)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```


**Reflection**

Describe what you learned about the `themes`.
For example, after "other", what are the 2 most common themes (or combination of themes)?
The least common?


\
\
\
\


### Exercise 3: Numerical summaries for quantitative variables 

Next let's explore the *quantitative* `bing_pos` variable which measures the sentiment of each question to Dear Abby.
Fill in the code to calculate some numerical summaries:

```{r}
# What's a typical sentiment?
# Calculate the mean & median (measures of central tendency)
___ %>% 
  ___(mean = mean(___, na.rm = TRUE), 
      median = median(___, na.rm = TRUE))
```

```{r}
# How varied are the sentiments?
# Calculate the min, max, & sd (measures of spread)
___ %>% 
  ___(minimum = min(___, na.rm = TRUE),
      maximum = max(___, na.rm = TRUE),
      stdev = sd(___, na.rm = TRUE))
```


```{r}
# What's the range of the middle 50% of sentiments, i.e. the interquartile range (IQR)?
# Calculate the 25th and 75th percentiles (i.e. 1st and 3rd quartiles)
abby %>% 
  summarize(first_q = quantile(bing_pos, 0.25, na.rm = TRUE),
            third_q = quantile(bing_pos, 0.75, na.rm = TRUE))
```


\
\
\
\


### Pause: Theory box

Below is an example of a "theory box".
You are *not* required to memorize, nor will you be assessed on, any formulas presented in this or any future theory box.
They serve 3 purposes:

1. To emphasize that there's "theory" / a formal structure behind what we're doing.
2. To provide students that plan to *continue* studying Statistics a glimpse into the formal statistical theory they'll explore in later courses.
3. To make happy the students that are simply interested in the theory!


\
\


::: {.callout-note title = "THEORY BOX: Univariate numerical summaries"}

Let $(y_1, y_2, ..., y_n)$ be a sample of $n$ data points.

mean: $$\overline{y} = \frac{y_1 + y_2 + \cdots + y_n}{n} = \frac{\sum_{i=1}^n y_i}{n}$$

variance: $$\text{var}(y) = \frac{(y_1 - \overline{y})^2 + (y_2 - \overline{y})^2 + \cdots + (y_n - \overline{y})^2}{n - 1} = \frac{\sum_{i=1}^n (y_i - \overline{y})^2}{n - 1}$$

standard deviation: $$\text{sd}(y) = \sqrt{\text{var}(y)}$$
:::


\
\
\
\


### Exercise 4: Visualizing a quantitative variable 

Let's complement the numerical summaries with visual summaries of `bing_pos`.
Before class, you learned about 3 possible visualizations for *quantitative* variables: boxplots, histograms, and density plots.

#### Part a (boxplots) 

Let's start with a boxplot.
Separately run each chunk below and add a comment (`#`) about what you observe.

```{r}
# ???
abby %>% 
  ggplot(aes(x = bing_pos))
```

```{r}
# ???
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_boxplot()
```

**Reflection**

Make sure that you can connect the boxplot to 5 of your numerical summaries from the previous exercise:

- minimum = 0
- 25th percentile = 0.167
- median = 0.333
- 75th percentile = 0.5
- maximum = 1

\
\


#### Part b (histograms) 


Build a **histogram** of the `bing_pos` variable:

```{r}
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_histogram()
```

**FOLLOW-UP QUESTION**

Roughly, what was the most common range of `bing_pos` scores (i.e. the range with the highest bar)?
Roughly how many letters scored in this range?

\
\

#### Part c (density plots) 

Using the code for other plots as a guide, try to make a **density** plot of the `bing_pos` variable:

```{r}
# Density plot


```

**FOLLOW-UP QUESTION**

This density plot is somewhat *tri-modal*, having 3-ish peaks.
These 3 peaks signal that there might be 3 common types of Dear Abby letters.
How would you "classify" or "label" these 3 types of letters?


\
\
\
\


### Exercise 5: What did we learn about sentiment? 

#### Part a 

You've built various numerical and visual summaries of `bing_pos`.
In words, summarize what you these tell us about the sentiment of questions to Dear Abby.
Be sure to weave in information about:

-   central tendency (typical sentiment)
-   spread (variability in sentiments)
-   shape of the distribution
-   any outliers you observe


#### Part b 

Each of the 3 visualization approaches has pros and cons.

-   What is one pro about the boxplot in comparison to the histogram and density plot?

-   What is one con about the boxplot in comparison to the histogram and density plots?

-   In this example, which plot do you prefer and why?


\
\
\
\


### Exercise 6: Customizing plots 

Color is one of many ways to customize a plot.
It can be an effective tool (e.g. to highlight key features or to align a plot with the aesthetics of a report).
It can also be simply gratuitous and distracting.
Play around with the following chunks!

```{r}
# Recall our histogram
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_histogram()
```

```{r}
# What does "color" do?
# Is this useful or gratuitous?
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_histogram(color = "white")
```

```{r}
# What's the difference between "color" and "fill"?
# Is this useful or gratuitous?
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_histogram(color = "blue", fill = "orange")
```

```{r}
# How do you think "color" and "fill" will work on the density plot?
# Try it! Modify the code below.
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_density()
```

```{r}
# Check out the full set of colors: https://r-charts.com/colors/
# Pick 2 of them for your color and fill
abby %>% 
  ggplot(aes(x = bing_pos)) +
  geom_density(color = "___", fill = "___")
```


\
\
\
\

### Exercise 7: Looking ahead 

In this activity, you explored Dear Abby data, focusing on one variable at a time: `themes` then sentiments (`bing_pos`).
What further curiosities do you have about the data? 
Specifically, what *relationships* between the variables in the `abby` dataset might be interesting to explore?

\
\
\
\


### Reflection 

**Learning goals**

Go to the top of this file and review the learning goals for today's activity.

- Which do you have a good handle on?
- Which are struggling with? What feels challenging right now?
- What are some wins from the day?

**Code**

In addition to exploring the learning goals, you learned some new code.

- If you haven't already, you're highly encouraged to start tracking and organizing new code in a cheat sheet (eg: a Google doc). This will be a handy reference for you, and the act of making it will help deepen your understanding and retention.

- Reflect upon the `ggplot()` function.   
    -  First, think about the first set of lines: `___ %>% ggplot(aes(x = ___))`
        - What does this first set of lines produce?
        - What information goes into the first argument (set of blanks)?
        - What information goes into the second argument, `x`?
        - What do you think `aes` is short for?
    - Next, think about the next set of lines: `___ %>% ggplot(aes(x = ___)) + ___`   
        - What's the purpose of putting a `+` at the end of the `ggplot()` line?
        - Why do you think we're using `+` instead of `%>%`?
        - What's the purpose of the next line (the one that comes after the `ggplot()` line?


\
\
\
\


## Extra practice {-}

You're highly encouraged to work on some extra practice problems, either during or after class.

\
\


### Exercise 8: Import and get to know the weather data 

Daily weather data are available for 3 locations in Perth, Australia.
A codebook is [here](https://github.com/Mac-STAT/data/blob/main/weather_3_locations_codebook.md).

```{r}
# Import the data and name it "weather"
___ <- ___("https://mac-stat.github.io/data/weather_3_locations.csv")
```

Now check out the basic features of the `weather` data:

```{r}
# Examine the first six cases

# Find the dimensions of the data

```

What's the unit of observation in this data, i.e. what does a "case" represent?


\
\
\
\


### Exercise 9: Exploring rainfall 

The `raintoday` variable contains information about rainfall.

-   Is this variable quantitative or categorical?
-   Create an appropriate visualization of `raintoday`.
-   Compute appropriate numerical summaries of `raintoday`.
-   Reflect on the correspondence between the visual and numerical summaries.
-   What do you learn about rainfall in Perth?

```{r}
# Visualization

# Numerical summaries

```


\
\
\
\


### Exercise 10: Exploring temperature 

The `maxtemp` variable contains information on the daily high temperature.

-   Is this variable quantitative or categorical?
-   Create an appropriate visualization of `maxtemp`.
-   Compute appropriate numerical summaries of the central tendency in `maxtemp` and the variability in `maxtemp` from day to day.
-   Reflect on the correspondence between the visual and numerical summaries.
-   What do you learn about high temperatures in Perth?

```{r}
# Visualization

# Numerical summaries

```


\
\
\
\


### Exercise 11: Customizing! (CHALLENGE) 

Though you will naturally absorb some RStudio code throughout the semester, being an effective statistical thinker and "programmer" does not require that we memorize *all* code.
That would be impossible!
In contrast, using the *foundation* you built today, do some digging online to learn how to customize your visualizations.

#### Part a 

For the histogram below, add a *title* and more meaningful *axis labels*.
Specifically:

- title the plot "Distribution of max temperatures in Perth"
- change the x-axis label to "Maximum temperature"
- change the y-axis label to "Number of days"

HINT: Do a Google search for something like "add axis labels ggplot".

```{r}
# Add a title and axis labels
weather %>% 
  ggplot(aes(x = maxtemp)) + 
  geom_histogram()
```


#### Part b 

Check out the `ggplot2` [cheat sheet](https://rstudio.github.io/cheatsheets/data-visualization.pdf).
Try making some of the other kinds of univariate plots outlined there.

```{r}

```


#### Part c 

What else would you like to change about your plot? Try it!


\
\
\
\


### Exercise 12: Optional challenge 

At the top of this activity, we searched for words related to some topics of interest (`parents`, `marriage`, `money`) and combined them into a single `theme` variable.
It looked something like this:

```{r}
abby_new <- abby %>% 
  mutate(
    parents = str_detect(question_only, "mother|mama|mom|father|papa|dad"),
    marriage = str_detect(question_only, "marriage|marry|married"),
    money = str_detect(question_only, "money|finance")
  ) %>%
  rowwise() %>%
  mutate(
    themes = c(
      if (parents) "parents",
      if (marriage) "marriage",
      if (money) "money"
    ) %>% paste(collapse = ", "),
    themes = ifelse(themes == "", "other", themes)
  ) %>%
  ungroup()
```

Check it out:

```{r}
head(abby_new)
```


#### Part a 

Understand the code!

-   Inside `mutate()` the line `parents = str_detect(question_only, "mother|mama|mom|father|papa|dad")` created a new variable called `parents`. This variable takes on `TRUE` or `FALSE`. Explain what `TRUE` and `FALSE` mean here.

-   The `themes` variable combines the information from the `parents`, `marriage`, and `money` variables. Check out the `themes` for the first 3 rows / data points. Convince yourself that you understand how it corresponds to the `parents`, `marriage`, and `money` variables.

#### Part b 

Beyond `parents`, `marriage`, and `money`, what are some other topics that might pop up in the Dear Abby letters (and that you're interested in exploring)?
Modify the code below to explore those topics!
Update the `themes` variable accordingly.

```{r}
abby_new <- abby %>% 
  mutate(
    parents = str_detect(question_only, "mother|mama|mom|father|papa|dad"),
    marriage = str_detect(question_only, "marriage|marry|married"),
    money = str_detect(question_only, "money|finance")
  ) %>%
  rowwise() %>%
  mutate(
    themes = c(
      if (parents) "parents",
      if (marriage) "marriage",
      if (money) "money"
    ) %>% paste(collapse = ", "),
    themes = ifelse(themes == "", "other", themes)
  ) %>%
  ungroup()

# Check out the raw data
head(abby_new)

# Check out the number of letters belonging to each theme
abby_new %>% 
  count(themes)
```


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!