---
title: "Nested Models & F-Tests (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')

# Use a color blind friendly color palette throughout doc
library(tidyverse)
cb_palette <- c("#0072B2", "#D55E00", "black", "#E69F00", "#56B4E9", "#009E73", "#F0E442", , "#CC79A7")
scale_colour_discrete <- function(...) scale_colour_manual(values = cb_palette, ...)
scale_fill_discrete   <- function(...) scale_fill_manual(values = cb_palette, ...)
theme_set(theme_bw())
```


## Warm-up {-}



### Example 1: Nested Models

a. Which of the following models are nested in the model $E[A \mid B, C, D] = \beta_0 + \beta_1 D + \beta_2 B + \beta_3 C + \beta_4 B * C$?

- Model 1: $E[A \mid B] = \beta_0 + \beta_1 B$
- Model 2: $E[A \mid B, D] = \beta_0 + \beta_1 B + \beta_2 D$
- Model 3: $E[B \mid C] = \beta_0 + \beta_1 C$
- Model 4: $E[A \mid B, C, D] = \beta_0 + \beta_1 B + \beta_2 C + \beta_3 D$
- Model 5: $E[A \mid B, C, D] = \beta_0 + \beta_1 C + \beta_2 B + \beta_3 D + \beta_4 B * D$
- Model 6: $E[A \mid D] = \beta_0 + \beta_1 D$

\
\

b. Consider the following models involving variables A, B, C, and D:

- Model 1: $E[A \mid B] = \beta_0 + \beta_1 B$
- Model 2: $E[A \mid B, C] = \beta_0 + \beta_1 B + \beta_2 C$
- Model 3: $E[A \mid B, C] = \beta_0 + \beta_1 B + \beta_2 C + \beta_3 BC$
- Model 4: $E[A \mid C, D] = \beta_0 + \beta_1 C + \beta_2 D$
- Model 5: $E[B \mid A] = \beta_0 + \beta_1 A$
- Model 6: $E[B \mid A, C] = \beta_0 + \beta_1 A + \beta_2 C + \beta_3 AC$

Determine for each of the following statements whether that statement is True or False.

- Model 1 is nested in Model 2
- Model 1 is nested in Model 3
- Model 1 is nested in Model 4
- Model 2 is nested in Model 3
- Model 3 is nested in Model 2
- Model 2 is nested in Model 6

c. What is one (numeric) way to compare the quality of nested models? Explain how you would determine which model is "better" based on this metric.



### Example 2: t-tests & "overall" F-tests

To explore the relationship of casual bikeshare ridership with the day of week (weekend or weekday), feels like temperature (F), and actual temperature (F), let's explore the following population model:

E[`rides` | `weekend`, `temp_feel`, `temp_actual`] = $\beta_0$ + $\beta_1$ `weekendTRUE` + $\beta_2$ `temp_feel` + $\beta_3$ `temp_actual`

We estimate this below using `bike_model_1`:

```{r eval=TRUE}
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>% 
  rename(rides = riders_casual)

bike_model_1 <- lm(rides ~ weekend + temp_feel + temp_actual, bikes)
summary(bike_model_1)
```

a. The "overall" **F-test** is reported in the bottom line of the `summary()`. What do you *conclude* from this test? 
NOTE: The F-test test statistic is *not* calculated in the same way as the t-test test statistic (it does not give the number of SE's away from 0).

b. What do you *conclude* from the **t-tests** in the `temp_feel` and `temp_actual` rows of the `summary()` table?

c. Putting these together, what do you think? Which variables might be "good" predictors of ridership?



### Example 3: F-tests for nested models

Let's take the temperature variables out of the model:

```{r eval=TRUE}
# Refit bike_model_1 (just for ease of comparison)
bike_model_1 <- lm(rides ~ weekend + temp_feel + temp_actual, bikes)

# Fit a new model
bike_model_2 <- lm(rides ~ weekend, bikes)
```

a. Note that `bike_model_2` is *nested* in `bike_model_1`. Thus we can compare them with the following **F-test** for nested models. State the hypotheses, p-value, and your conclusion.


```{r eval=TRUE}
# Put the smaller model first!!!
anova(bike_model_2, bike_model_1)
```

b. A *series* of models and tests can provide more insight than one model or test alone! What did we learn about the relationship of `rides` with `temp_feel` and `temp_actual` from the above examples *combined*? Why do you think this happened? What would you do next?



## Exercises {-}

**DIRECTIONS**

- Throughout this activity, test hypotheses at the 0.05 significance level.
- Make all conclusions and interpretations in context.


**CONTEXT**

The `MacGrades.csv` dataset contains a sub-sample (to help preserve anonymity) of every grade assigned to a former Macalester graduating class. For each of the 6414 rows of data, the following information is provided (with a few missing values):

- `sessionID`: A section ID number
- `sid`: A student ID number
- `grade`: The grade obtained, as a numerical value (i.e. an A is a 4, an A- is a 3.67, etc.)
- `dept`: A department identifier (these have been made ambiguous to maintain anonymity)
- `level`: The course level (e.g. 100-, 200-, 300-, and 600-)
- `sem`: A semester identifier
- `enroll`: The section enrollment
- `iid`: An instructor identifier (these have been made ambiguous to maintain anonymity)

```{r}
# Load packages & data
library(tidyverse)
MacGrades <- read_csv("https://mac-stat.github.io/data/MacGrades.csv")%>% 
  mutate(level = factor(level)) # make level a factor variable
head(MacGrades)
```

### Exercise 1: Explore

**NOTE:** This exercise, since it's exploratory in nature, can suck up a lot of time if you let them! For the sake of getting to the rest of the activity, please spend no more than ~5 minutes on this.

a. Hypothesize *two* relationships between the variables in the dataset (pick any two relationships you want!). Your response should be written in a paragraph form.

> **Response** Put your response here

b. Explore the relationship between course grades and other variables in the data. Make *two* visualizations, *and* describe any patterns you observe. 


### Exercise 2: F-tests for grade vs level

Suppose we are interested in the relationship of student `grade` with the course `level` (categorical).

a. Using `grade` as your outcome variable, fit a linear regression model to investigate this question. Comment on the nature of the relationship between course level and student grades (this should not be a coefficient interpretation, but instead a description of a general trend, or lack thereof).

b. State the null and alternative hypotheses associated with the research question in part a.

$$
H_0: 
$$

$$
H_a:
$$

c. What type of test do we need here: a t-test for a single model coefficient, the overall F-test, or a nested F-test?


d. What is the p-value associated with this hypothesis test? Do we have enough evidence to reject the null hypothesis, using a significance threshold of 0.05?





### Exercise 3: F-tests for grade vs enrollment

Suppose we are interested in the relationship between course enrollment and student grades. 

a. Again, use grade as your outcome variable, and fit a linear regression model to investigate this question.


b. State the null and alternative hypotheses associated with the research question in part a.

$$
H_0: 
$$

$$
H_a:
$$

c. What type of test do we need here: a t-test for a single model coefficient, the overall F-test, or a nested F-test?

d. What is the p-value associated with this hypothesis test? Do we have enough evidence to reject the null hypothesis, using a significance threshold of 0.05?





### Exercise 4: More F-tests

Suppose we are now interested in the association between course grade and enrollment for classes *of the same level*, i.e. when controlling for class level. 

a. Write a model statement in the form $E[Y | X] = ...$ that will produce a statistical model that will allow us to answer our scientific question. Replace Y and X, where appropriate, with response and predictor variables. 

$$
E[Y | X] = ___
$$

Which coefficient(s) in your model is the one that is relevant to your research question?


b. What are the relevant null and alternative hypotheses that address the scientific question in part (a)?


c. Fit the model you wrote in part (a), calculate a p-value, and report the results of the hypothesis test in part (b). 




### Reflection

F-tests are useful when the null hypothesis you wish to test is such that *more than one* covariate is simultaneously equal to a specific number (typically zero). What scenarios, outside of those shown in this example, can you think of where a relevant scientific hypothesis you want to test involves more than one covariate being simultaneously equal to zero?



## Extra Practice {-}

### Exercise 5: Repeat

Repeat Exercise 4, supposing we are instead interested in the association between course grade and course level for classes of the same enrollment.

### Exercise 6: Guess the p-value

Consider 2 models of grades, 1 of which you've used before (but maybe named something else):

```{r}
grades_model_A <- lm(grade ~ enroll + level, MacGrades)
summary(grades_model_A)

grades_model_B <- lm(grade ~ level, MacGrades)
summary(grades_model_B)
```

a. Using your notation from Exercise 4, state the hypotheses that would be tested by running `anova(grades_model_B, grades_model_A)`. **DO NOT YET RUN THIS CODE!**


b. **WITHOUT RUNNING THE anova() CODE:** What will be the p-value reported by `anova(grades_model_B, grades_model_A)`? What's your reasoning?

c. Check your intuition:

```{r}
anova(grades_model_B, grades_model_A)
```







### Exercise 7: But is it a "good" model?

In the above exercises, you should have concluded that, when controlling for the other, both course enrollments and course level are significantly associated with grades. But are they "good" predictors? Let's explore `grades_model_A` in more depth.


a. The relationships of `grade` with `enroll` and `level` are *statistically* significant, but are they *practically significant* / contextually meaningful?

b. Grades vary from student to student within and across courses.
What percentage of this variation is explained by course enrollments and level?

c. Putting this all together, what's your overall conclusion about the relationship here?



### Exercise 8: Reference categories

Our final research question pertains to whether or not there is a relationship between course grade and department. Again, use course grade as the outcome variable in your linear regression model.

a. State the null and alternative hypotheses *in colloquial language* associated with the relevant hypothesis test.

$$H_0:$$

$H_a:$$

b. Fit a linear regression model, and conduct your hypothesis testing procedure to answer the research question posed in this Exercise. State your conclusions accordingly (you do not need to interpret any regression coefficients, just state and interpret the results of your hypothesis test!).

c. Are any of the individual department p-values significant? What do these p-values tell us, and why is this *not* contradictory to your answer in part (b)?





## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!


