---
title: "Logistic wrap-up + project time (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00","black", "#E69F00", "#56B4E9", "#009E73", "#F0E442",  "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


\
\
\
\


## Warm-up {-}


Fake news is a problem.
Using data obtained from [https://www.kaggle.com/mdepak/fakenewsnet](https://www.kaggle.com/mdepak/fakenewsnet), and cleaned using code provided at [https://www.kaggle.com/kumudchauhan/fake-news-analysis-and-classification](https://www.kaggle.com/kumudchauhan/fake-news-analysis-and-classification), we'll try to predict if an article is fake news based on various features.
NOTE: These data are from real articles posted online.

```{r eval = TRUE}
# Load packages and data
library(tidyverse)
fake_news <- read_csv("https://ajohns24.github.io/data/fake_news.csv") %>% 
  mutate(title_has_excl = title_excl > 0,
         is_fake = (type == "fake")) %>% 
  select(is_fake, title_has_excl, negative)
```


\
\
\
\


### Example 1: Intuition

We'll explore the relationship of whether or not an article `is_fake` (TRUE or FALSE) with `title_has_excl` (whether the title has an exclamation point) and `negative` (the percent of words that have negative sentiment):

```{r eval = TRUE}
head(fake_news)
```

The model formula for this relationship is:

log(odds the article is fake) = $\beta_0$ + $\beta_1$ `negative` + $\beta_2$ `title_has_exclTRUE`

Intuitively:

- Will $\beta_1$ be positive or negative?

- Will $\beta_2$ be positive or negative?


\
\
\
\


### Example 2: Visualize the relationships of interest

#### Part a

Briefly summarize what you observe.

```{r eval = TRUE, echo = FALSE}
# Plot relationship of is_fake with title_has_excl
# Remember: x = X, fill = Y
fake_news %>% 
  ggplot(aes(x = title_has_excl, fill = is_fake)) + 
  geom_bar(position = "fill")
```

```{r eval = TRUE, echo = FALSE}
# Plot relationship of is_fake with negative sentiment
fake_news %>% 
  mutate(negative_brackets = cut(negative, 15)) %>% 
  group_by(negative_brackets) %>% 
  summarize(fake_rate = mean(is_fake)) %>% 
  ggplot(aes(x = negative_brackets, y = fake_rate)) + 
  geom_point() + 
  theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
```

#### Part b

IMPORTANT: Why is the above plot of `fake_news` vs `negative` sentiment so much better than those below? (Why are the plots below wrong?)

```{r eval = TRUE, echo = FALSE}
# BAD plot of is_fake vs negative sentiment
fake_news %>% 
  ggplot(aes(x = negative, color = is_fake)) + 
  geom_density()

# BAD plot of is_fake vs negative sentiment
fake_news %>% 
  ggplot(aes(x = negative, y = is_fake)) + 
  geom_boxplot()
```


\
\
\
\


### Example 3: Build & interpret the logistic regression model

The estimated model is

log(odds the article is fake | `title_has_exclTRUE`, `negative`) = -1.8791 + 2.5265 `title_has_exclTRUE` + 0.3829 `negative`

```{r eval = TRUE, echo = FALSE}
# Model the relationship of is_fake with negative sentiment & title_has_excl
news_model <- glm(is_fake ~ title_has_excl + negative, fake_news, family = "binomial")
```

```{r eval = TRUE, echo = FALSE}
# Plot the model on the probability scale
fake_news %>% 
  ggplot(aes(y = as.numeric(is_fake), x = negative, color = title_has_excl)) + 
  geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial")) +
  labs(y = "probability of being fake") + 
  lims(y = c(0,1))
```

```{r eval = TRUE}
# Original coefficients
coef(news_model)

# Exponentiated coefficients
exp(coef(news_model))
```


Interpret the coefficients:


`title_has_exclTRUE` coefficient:


`negative` coefficient:


\
\
\
\


### Example 4: How good is the model at distinguishing between real and fake news?

We can use our model to calculate the probability that an article is fake based on its negative sentiment score and whether or not its title includes an exclamation mark.
Below are the probability calculations for the articles in our sample, split up by those that were & weren't actually fake.

Discuss thresholds that could be used to predict a fake new article.

```{r eval = TRUE, echo = FALSE}
fake_news %>% 
  mutate(.fitted = predict(news_model, newdata = ., type = "response")) %>% 
  ggplot(aes(x = is_fake, y = .fitted)) +
  geom_boxplot()
```


\
\
\
\


## Exercises {-}

In these exercises you'll practice some more logistic regression concepts *and* review interaction terms.

\
\


### Exercise 1: Evaluating the model

Let's explore using a probability threshold of 0.5 to make a binary prediction of an article's status:

- If the article's probability of being fake is greater than or equal to 50%, we'll predict that it's fake.
- Otherwise, we'll predict that it's real.

We can visualize this threshold on our predicted probability boxplots:

```{r}
fake_news %>% 
  mutate(.fitted = predict(news_model, newdata = ., type = "response")) %>% 
  ggplot(aes(x = is_fake, y = .fitted)) +
  geom_boxplot() + 
  geom_hline(yintercept = 0.5, color = "red")
```


#### Part a 

Using the boxplots alone...

- approximately, in what range is the sensitivity (true positive rate): 0-25%, 25-50%, 50-75%, or 75-100%
- approximately, in what range is the specificity (true negative rate): 0-25%, 25-50%, 50-75%, or 75-100%

#### Part b 

The corresponding confusion matrix is below:

```{r}
# Rows = actual / observed to be fake (FALSE or TRUE)
# Columns = predicted to be fake (FALSE or TRUE)
library(janitor)
threshold <- 0.5
fake_news %>% 
  mutate(.fitted = predict(news_model, newdata = ., type = "response")) %>% 
  mutate(predictFake = .fitted >= threshold) %>% 
  tabyl(is_fake, predictFake) %>%
  adorn_totals(c("row", "col"))
```

```{r}
# Calculate the overall accuracy

```


```{r}
# Calculate the sensitivity (true positive rate)
# Was your answer in Part a correct?

```


```{r}
# Calculate the specificity (true negative rate)
# Was your answer in Part b correct?

```

#### Part c 

In general, sensitivity & specificity are inversely related -- increasing one leads to a decrease in the other.
In this context, would you rather have high sensitivity or high specificity?
Why?


\
\
\
\


### Exercise 2: Calculations

Suppose you come across a news article that has an exclamation point in the title, and 4% of its words have a negative sentiment.
Calculate the *probability* that this article is fake.
You should make this calculation by hand, but can check it using `predict()`:

```{r}
# Calculate the prediction by hand

```

```{r}
# Check your work
predict(news_model, newdata = data.frame(title_has_excl = TRUE, negative = 4), type = "response")
```


\
\
\
\

Choose your own adventure:

- Finish PS 5 (due tomorrow)
- If you already finished PS 5, start the project milestone 1.
- Come back to the extra practice to review Interaction terms

\
\
\
\

## Extra practice {-}


### Exercise 3: Interaction

Let's revisit the Adelie and Gentoo penguins as well as the concept of interaction:

```{r}
data(penguins)
penguins_sub <- penguins %>% 
  filter(species %in% c("Adelie", "Gentoo"))

penguins_sub %>% 
  ggplot(aes(y = body_mass, x = flipper_len, color = species)) + 
  geom_smooth(method = "lm", se = FALSE)
```

#### Part a

In their relationship with `body_mass`, `flipper_len` and `species` *interact*. What does that *mean*?


#### Part b

The model formula can be expressed as:   

E[`body_mass` | `flipper_len`, `species`] = $\beta_0$ + $\beta_1$ `flipper_len` + $\beta_2$ `speciesGentoo` + $\beta_3$ `flipper_len * speciesGentoo`
    
Indicate whether the coefficients below are positive or negative and what your reasoning is:

- $\beta_1$

- $\beta_2$

- $\beta_3$


\
\
\
\


### Exercise 4: More interaction

```{r}
summary(lm(body_mass ~ flipper_len*species, penguins_sub))
```

#### Part a

Write out *separate sub-model formulas* of `body_mass` vs `flipper_len` for Adelies and Gentoos.

#### Part b

Interpret the `flipper_len` and `flipper_len:speciesGentoo` coefficients.

#### Part c

Predict the body mass of a Gentoo penguin with 220mm long flippers.


\
\
\
\


## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!