7.4 Model Selection
For this class, we will use hypothesis testing primarily as a way to help us decide which variables should be included in an model. This is referred to as model selection or variable selection. We have many tools that can help us make these decisions. We’ll first discuss two types of hypothesis tests that are useful for model selection and explain another hypothesis test that is automatically completed when you fit a model. Finally, we summarize all of the model selection tools you have at your disposal that we’ve covered in this class and point to other tools that you may learn in future courses.
7.4.1 Testing Individual Slopes
A major emphasis of our course is regression models. It turns out that many scientific questions of interest can be framed using regression models.
In the tidy()
output, R performs the following hypothesis tests by default (for every slope coefficient \(\beta_j\)):
\[H_0: \beta_j = 0 \qquad \text{vs} \qquad H_A: \beta_j \neq 0\] The test statistic is equal to the slope estimate divided by the standard error.
\[t_{obs} = \text{Test statistic} = \frac{\text{estimate} - \text{null value}}{\text{std. error of estimate}} = \frac{\text{estimate} - 0}{\text{std. error of estimate}} \]
Linear Regression Example: Below we fit a linear regression model of house price on living area:
homes <- read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Saratoga.txt")
mod_homes <- homes %>%
with(lm(Price ~ Living.Area))
mod_homes %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13439. 4992. 2.69 7.17e- 3
## 2 Living.Area 113. 2.68 42.2 9.49e-268
The statistic
column is the test statistic equal to estimate/std.error
, and the p.value
column is the p-value.
What are \(H_0\) and \(H_A\)?
We write the model \(E[\text{Price}|\text{Living Area} ] = \beta_0 + \beta_1\text{Living Area}\).
\(H_0: \beta_1 = 0\) (There is no relationship between price and living area.)
\(H_A: \beta_1 \neq 0\) (There is a relationship between price and living area.)What is the test statistic? We see that the estimated slope for Living Area is 42 standard errors away from the null value of 0, which makes us start to doubt the null hypothesis of no relationship.
For a threshold \(\alpha = 0.05\), what is the decision regarding \(H_0\)?
Note that when you seee
in R output, this means “10 to the power”. So9.486240e-268
means \(9.49 \times 10^{-268}\), which is practically 0. This p-value is far less than our threshold \(\alpha = 0.05\), so we reject \(H_0\) and say that we have strong evidence to support a relationship between price and living area.
Is this consistent with conclusions from a confidence interval?
## 2.5 % 97.5 %
## (Intercept) 3647.6958 23231.0922
## Living.Area 107.8616 118.3835
The interval does not include 0 so we conclude that 0 is not supported as a plausible value for the average increase in price for every 1 square foot increase in living area size of a house, based on the observed sample data.
Logistic Regression Example: Below we fit a logistic regression model of whether a movie made a profit (response) on whether it is a history film:
movies <- read.csv("https://www.dropbox.com/s/73ad25v1epe0vpd/tmdb_movies.csv?dl=1")
mod_movies <- movies %>%
with(glm( profit ~ History, family = "binomial"))
mod_movies %>%
tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.152 0.0296 5.15 0.000000258
## 2 HistoryTRUE 0.0207 0.146 0.142 0.887
The statistic
column is the test statistic equal to estimate/std.error
, and the p.value
column is the p-value. Note that if we test the slope, we can make conclusions about the odds ratios because \(e^0 = 1\).
Try yourself!
- What are \(H_0\) and \(H_A\)?
- What is the test statistic?
- For a threshold \(\alpha = 0.05\), what is the decision regarding \(H_0\)?
- Is this consistent with the confidence interval?
## 2.5 % 97.5 %
## (Intercept) 0.09438169 0.2102419
## HistoryTRUE -0.26484980 0.3085225
ANSWER:
What are \(H_0\) and \(H_A\)?
We write the model \(\log(Odds(\text{Profit}|\text{History})) = \beta_0 + \beta_1\text{HistoryTRUE}\).
\(H_0: \beta_1 = 0\) (There is no relationship between odds of profit and whether a movie is a history file.)
\(H_A: \beta_1 \neq 0\) (There is a relationship between odds of profit and whether a movie is a history file.)What is the test statistic? We see that the estimate slope for HistoryTRUE is 0.14 standard errors away from 0. This is not the far so we don’t have much evidence against the null hypothesis that history films are not more or less profitable than other films.
For a threshold \(\alpha = 0.05\), what is the decision regarding \(H_0\)?
The p-value is 0.88, which is quite large and greater than our threshold \(\alpha = 0.05\), so we fail to reject \(H_0\) because it is quite likely to see an observed difference in estimated log odds of making a profit if history were not more or less likely to be profitable.Is this consistent with the confidence interval? Yes, the 95% confidence interval includes and in fact straddles 0, which suggests we are quite uncertain as to the true relationship between categorizing a film as a history film and the likelihood of it being profitable.
7.4.1.1 Suggested Actions
You might be wondering how to use the information gained by testing individual slopes to see if they are far from zero, taking into account the uncertainty in our estimates.
If the value of a slope coefficient in the population is truly 0, after accounting for the other variables in the model, then it plays no role in the model. If we get a small test statistic and thus a large p-value, it suggests that we try to remove that slope and corresponding variable from the model and compare models because we do not have enough evidence to say that the slope is far from 0, after accounting for the other variables in the model.
Notice, we say try to remove the variable. We do not say eliminate the variable from all models going forward because there are many reasons why a variable might have a slope of 0.
- A variable may not have any bearing or relationship with the outcome and thus the slope is practically 0. In these circumstances, you generally should remove it from the model because we want a simpler model with fewer variables.
- A variable may have a weak relationship with the outcome, even after accounting for the other variables in the model, but if the sample size is small, there is so much uncertainty that we can’t ascertain that it is different from 0. You’ll need to decide whether or not it is important to include based on the context and causal inference arguments.
- If two explanatory variables are highly correlated but can explain a lot of the variability in the outcome, they both have a slope near 0 because after accounting for the other variable in the model, it can’t add much. Try to remove one of the variables and see how the slope estimates change.
- Two category levels in a categorical variable may not be that different (slope which measure the difference may be 0). You can try to combine categories that are similar, if it makes sense in the data context, or acknowledge the information gained about how those groups are not different, after accounting for the other variable in the model.
7.4.2 Nested Tests
If you’ve removed one or more variables from a model, the result is a smaller nested model. We can directly compare the larger model and the smaller nested model (from removing 1 or more variables from the larger model) using a hypothesis test.
For this test,
- Null Hypothesis (\(H_0\)): Smaller nested model is correct
- Alternative Hypothesis (\(H_A\)): Smaller nested model is not correct and at least one of the removed variables has a non-zero slope in the population.
In notation, we can write these hypotheses in terms of the larger model,
\[H_0: \beta_j = 0 \text{ for the variables removed} \qquad \text{vs} \qquad H_A: \beta_j \neq 0 \text{ for the variables removed}\] The test statistic for this nested test compares the smaller model to the larger full model; it is a measure of how much better the full model is over the smaller model. For a linear regression model, this test is called a nested F test and the test statistic is a ratio comparing the sum of squared residuals (F ~ 1 if \(H_0\) is true). For logistic regression, this nested test is called a likelihood ratio test and the test statistic is a ratio comparing the likelihood or “goodness” of a model (Chi ~ 1 if \(H_0\) is true). The larger the test statistics are, the bigger the difference is between the smaller model (\(H_0\)) and the larger model, suggesting that the larger model might be better.
We can compare these models in R using the anova()
function.
Linear Regression Example: Below we fit a larger linear regression model of house price using living area, the number of bedrooms, and the number of bathrooms:
mod_homes_full <- homes %>%
with(lm(Price ~ Living.Area + Bedrooms + Bathrooms))
anova(mod_homes,mod_homes_full)
## Analysis of Variance Table
##
## Model 1: Price ~ Living.Area
## Model 2: Price ~ Living.Area + Bedrooms + Bathrooms
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1726 8.2424e+12
## 2 1724 7.8671e+12 2 3.7534e+11 41.126 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that we get the smaller nested model by removing Bedrooms
and Bathrooms
from the larger model. This means that they are nested and can be compared using the nested test.
What are \(H_0\) and \(H_A\)? \(H_0\): The smaller model with just Living.Area is correct; the number of Bathrooms and Bedrooms do not have an impact on average price after accounting for living area. \(H_A\): The smaller model with just Living.Area is not correct; the number of Bathrooms or Bedrooms do have an impact on average price after accounting for living area.
What is the test statistic? The F = 41.126 is the test statistic. Unlike the test for slopes, this test statistic does not have a nice interpretation. It seems quite far from 1 (which is what F should be close to if \(H_0\) were true), but there are no rules for “how far is far”.
For a threshold \(\alpha = 0.05\), what is the decision regarding \(H_0\)? The p-value of \(< 2.2*10^{-16}\) is well below the threshold suggesting that it is very unlikely to have observed a difference between these two models as big as we did if the smaller model were true. Thus, we reject \(H_0\) that the model with only living area is correct in favor of the larger model.
Logistic Regression Example: Below we fit a larger logistic regression model of whether a movie made a profit (response) based on whether it is a history, drama, comedy, or a family film:
mod_movies_full <- movies %>%
with(glm( profit ~ History + Drama + Comedy + Family, family = "binomial"))
anova(mod_movies,mod_movies_full,test='LRT') #test='LRT' is needed for logistic models
## Analysis of Deviance Table
##
## Model 1: profit ~ History
## Model 2: profit ~ History + Drama + Comedy + Family
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 4801 6630.3
## 2 4798 6555.3 3 74.941 3.729e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that we get the smaller nested model by removing Drama
, Comedy
, and Family
from the larger model. This means that they are nested and can be compared using the nested test.
What are \(H_0\) and \(H_A\)? \(H_0\): The smaller model with just History is correct; indicators of Drama, Comedy, and Family do not have an impact on whether a movie makes a profit. \(H_A\): The smaller model with just History is not correct; indicators of Drama, Comedy, or Family do have an impact on whether a movie makes a profit.
What is the test statistic? The Deviance = 74.941 is the test statistic. Unlike the test for slopes, this test statistic does not have a nice interpretation. It seems quite far from 1 (which is it should be close to if \(H_0\) were true), but there are no rules for “how far is far”.
For a threshold \(\alpha = 0.05\), what is the decision regarding \(H_0\)? The p-value of \(3.729 * 10^{-16}\) is well below the threshold suggesting that it is very unlikely to have observed a difference between these two models as big as we did if the smaller model were true. Thus, we reject \(H_0\) that the model with only History indicator is correct in favor of the larger model.
7.4.2.1 Suggested Actions
You might be wondering how to use the information gained by comparing a smaller nested model to a larger model.
- If the p-value is large, then you don’t have evidence to support the larger model.
- If the p-value is small, then you need to look back at the variables you removed.
This tool is particularly helpful if you have a categorical variable with three or more categories as you can test all of the slopes for that one variable at once.
This tools can also be helpful to stop you from removing too many variables at once.
In practice, you can iterate between testing individual slopes to determine which variables to consider removing and then comparing nested models. Also, you’ll incorporate other model selection tools in your process, not just these two types of hypothesis tests.
7.4.3 Summary of Model Selection Methods
In general, we want a simple model that works well. We want to follow Occam’s Razor, a philosophy that suggests that it is a good principle to explain the phenomena by the simplest hypothesis possible. In our case, that mean the fewest variables in our models.
7.4.3.1 Linear Regression Tools
- Exploratory visualizations give us an indication for which variables have the strongest relationship with our response of interest (also if it is not linear)
- \(R^2\) can tell us the percent of variation in our response that is explained by the model – We want HIGH \(R^2\)
- The standard deviation of the residuals, \(s_e\) (residual standard error), tells us the average magnitude of our residuals (prediction errors for our data set) – We want LOW \(s_e\)
With the addition of statistical inference in our tool set, we now have many other ways to help guide our decision making process.
- Confidence intervals or tests for individual coefficients or slopes (\(H_0: \beta_k = 0\), population slope for kth variable is 0 meaning no relationship)
- See
tidy(lm1)
- See
- Nested tests for a subset of coefficients or slopes (\(H_0: \beta_k = 0\), population slope for kth variable is 0 meaning no relationship)
anova(lm1, lm2)
for comparing two linear regression models (one larger model and one model with some variables removed)
So there are model selection criteria that we can use that penalizes you for having too many variables. Here are some below.
- Choose a model with a higher adjusted \(R^2\), calculated as,
\[R^2_{adj} = 1 - \frac{SSE/(n-k-1)}{SSTO/(n-1)}\]
where \(k\) is the number of estimated coefficients in the model.
- Find the adjusted R squared in the output of
glance(lm1)
- Find the adjusted R squared in the output of
7.4.3.2 Logistic Regression Tools
- Exploratory visualizations give us an indication for which variables have the strongest relationship with our response of interest (also if it is not linear)
- We want HIGH accuracy and LOW false positive and false negative rates that can separate predicted probabilities fairly well between the true outcome groups.
With the addition of statistical inference in our tool set, we now have many other ways to help guide our decision making process.
- Confidence intervals or tests for individual coefficients or slopes (\(H_0: \beta_k = 0\), population slope for kth variable is 0 meaning no relationship)
- See
tidy(lm1)
- See
- Nested tests for a subset of coefficients or slopes (\(H_0: \beta_k = 0\), population slope for kth variable is 0 meaning no relationship)
anova(glm1, glm2, test = 'LRT')
for comparing two logistic regression models (one larger model and one model with some variables removed)
So there are model selection criteria that we can use that penalizes you for having too many variables. Here are some below that you’ll see in future classes.
- Choose a model with a LOWER Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) calculated as
\[AIC = -2\log(L) + 2(k+1)\]
\[BIC= -2\log(L) + (k+1)\log(n)\]
- Calculated using BIC(glm1)
and AIC(glm1)
7.4.3.3 Prediction Tools
Here is a taste of Statistical Machine Learning:
If our goal is prediction, then we will want to choose a model that has the lowest prediction error. BUT, if we fit our model to our data and then calculate the prediction error from that SAME data, we aren’t getting an accurate estimate of the prediction error because we are cheating. We aren’t doing predictions for new data values.
Training and Testing
In order to be able to predict for new data, we can randomly split our observed data into two groups, a training set and a testing set (also known as a validation or hold-out set).
- Fit the model to the training set and do prediction on the observations in the testing set.
- The prediction Mean Squared Error, \(\frac{1}{n}\sum(y_i - \hat{y}_i)^2\) can be calulated based on the predictions from the testing set.
Drawbacks of Validation Testing
- The MSE can be highly variable as it depends on how you randomly split the data.
- We aren’t fully utilizing all of our data to fit the model; therefore, we will tend to overestimate the prediction error.
K-Fold Cross-Validation
If we have a small data set and we want to fully use all of the data in our training, we can do K-Fold Cross-Validation. The steps are as follows:
Randomly splitting the set of observations into \(k\) groups, or folds, of about equal size.
The first group is treated as the test set and the method or model is fit on the remaining \(k-1\) groups. The MSE is calculated on the observations in the test set.
Repeat \(k\) times; each group is treated as the test set once.
The k-fold CV estimate of MSE is an average of these values, \[CV_{(k)} = \frac{1}{k}\sum^k_{i=1}MSE_i \] where \(MSE_i\) is the MSE based on the \(i\)th group as the test group. In practice, one performs k-fold CV with \(k=5\) or \(k=10\) as it reduces the computational time and it also is more accurate.
require(boot)
cv.err <- cv.glm(data,glm1, K = 5)
sqrt(cv.err$delta[1]) #out of sample average prediction error
##Chapter 7 Major Takeaways
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: xml2
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: Matrix
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Have you tried the ggformula package for your plots?
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:infer':
##
## prop_test, t_test
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum