R Cheatsheet
This cheatsheet is a work in progress. It will be updated and improved throughout the semester.
Stat 155 Review
Standard Linear Regression Models
To fit a linear model, start with a formula to specify variable roles and which variable is the predictor variable (left hand side before ~). The lm
function formats variables for you by creating indicator variables for factor/categorical variables.
# predictor variables have pluses between them
<- y ~ x1 + x2
formula
# specifies least squares estimation of the linear model specified by the formula using the data provided
<- lm(formula, data)
fit_model
# provides an overall summary of the fit model
%>%
fit_model summary()
# provides a data frame of the estimated coefficients, standard errors, and information for individual coefficient hypothesis tests
%>%
fit_model tidy()
# provides many model summaries including rsquared and sigma (the estimated sd of residuals)
%>%
fit_model glance()
# provides a data set that includes the original variables as well as the residuals and fitted values; used to plot residuals
%>%
fit_model augment()
# predict outcome values based on the fit model for newdata
%>%
fit_model predict(newdata)
Standard Logistic Regression Models
To fit a linear model, start with a formula to specify variable roles and which variable is the predictor variable (left hand side before ~). The glm
function formats variables for you by creating indicator variables for factor/categorical variables. Make sure that your outcome variable y
is an indicator variable (0/1 or FALSE/TRUE).
# predictor variables have pluses between them
<- y ~ x1 + x2
formula
# specifies maximum likelihood estimation of the generalized linear model specified by the formula using the data provided; when family = 'binomial', it fits a logistic regression model rather than a linear model
<- glm(formula, data, family = 'binomial')
fit_model
# provides an overall summary of the fit model
%>%
fit_model summary()
# provides a data frame of the estimated coefficients, standard errors, and information for individual coefficient hypothesis tests
%>%
fit_model tidy()
# predict probability of outcome values based on the fit model for newdata
%>%
fit_model predict(newdata, type = 'response')
# calculate specificity, false positive rate, false negative rate, and sensitivity
<- 0.25 # you choose threshold
threshold
augment(fit_model, type.predict ='response') %>%
mutate(predOutcome = .fitted >= threshold) %>%
count(y, predOutcome) %>%
mutate(correct = (y == predictOutcome)) %>%
group_by(y) %>% # condition on the true y outcome value
mutate(prop = n/sum(n))
Data Structures and Purrr Package
Vectors and Lists
In R, there are many data structures. Read http://adv-r.had.co.nz/Data-structures.html for more information.
The two most basic types of data structures are vectors and lists.
Vectors
Atomic Vector (numeric, character, integer, logical)
- All elements of an atomic vector must have the same type (numeric/double, character, integer, logical/boolean).
<- 1:10
x typeof(x) # atomic type
class(x) # object class name
is.integer(x)
x1:2] #access elements in vector with []
x[
<- (x > 5)
z typeof(z)
class(z)
is.logical(z)
z-c(1:2)]
z[
<- if_else(x > 5, 'greater than 5','less than or equal to 5')
y typeof(y)
class(y)
is.character(y)
y-c(1,5)]
y[
<- x + .2
v typeof(v)
class(v)
is.numeric(v)
is.double(v)
Factors are vectors with special attributes.
<- factor(y)
yfactor
%>% attributes()
yfactor
as.numeric(yfactor) #underlying a factor is an integer vector, indexing the levels/categories
levels(yfactor) # first level is reference level
relevel(yfactor, ref='less than or equal to 5') # can change reference level
Lists
Lists
- Lists can have elements of differing types. You can even have lists of lists or lists of vectors.
<- list(x,y,z,v,2) #unnamed list
l1
l11]] #access elements in list with [[]]
l1[[
<- list(x = x, y = y, z = z, v = v, b = 2) #named list
l2
l2'x']] #access elements in list with [[]]
l2[[$x #access elements in named list with $ l2
Data frames and tibbles are special types of named lists. Each element must be a vector, the vectors must have the same length, but they can be different types. Each of these vectors are variables in the data set.
You could have a list of data frames/tibbles.
Map
When you want to apply a function to each element in a list or vector of values, you can use map
which is a function in the purrr package.
# Silly example: add 2 to every value in x
<- 1:10
a map(a, ~ .x + 2 ) # short hand for code below
# map(x, function(x) x + 2 )
# Real example: fit the same model to 5 different data sets
#simulate data
<- runif(500,0,10) #500 x values randomly from uniform distribution [0,10]
x <- data.frame(x = x,
df y = 2*x + rnorm(500), #500 y values equal to 2*x + random value from Normal(0,1)
s = sample(size = 500, 1:5, replace=TRUE)) %>% #500 random values from 1 to 5 (strata/groups)
group_by(s) %>%
::nest() # data frame with variable data that is 5 data sets, one for each group
tidyr
map(df$data, ~ lm(y ~ x, data = .x)) # short hand for code below
# map(df$data, function(data) lm(y ~ x, data = data))
Tidymodels Functions
library(tidymodels) # Includes the workflows package
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
Model Specification
For more information, see the documentation for the parsnip package (part of the tidymodels package): https://www.tidymodels.org/find/parsnip/
Types of Models
For this course, we will only use the following models:
linear_reg()
, gen_additive_model()
, logistic_reg()
nearest_neighbor()
, decision_tree()
, rand_forest()
Engines (Estimation Methods)
For this course, we will primarily use the following estimation methods/engines.
lm
, glm
, glmnet
, kknn
, rpart
, ranger
Usage
Regression Models
# Linear Regression Model
<-
lm_spec linear_reg() %>%
set_engine(engine = 'lm') %>%
set_mode('regression')
# Lasso Regression Model
<-
lm_lasso_spec linear_reg() %>%
set_args(mixture = 1, penalty = lambda) %>% # we could let penalty = tune()
set_engine(engine = 'glmnet') %>%
set_mode('regression')
# Ridge Regression Model
<-
lm_ridge_spec linear_reg() %>%
set_args(mixture = 0, penalty = lambda) %>% # we could let penalty = tune()
set_engine(engine = 'glmnet') %>%
set_mode('regression')
# K Nearest Neighbors Regression Model
<-
knn_spec nearest_neighbor() %>%
set_args(neighbors = k) %>% # we could let neighbors = tune()
set_engine(engine = 'kknn') %>%
set_mode('regression')
# Generalized Additive Models (with smoothing splines)
<-
gam_spec gen_additive_mod() %>%
set_engine(engine = 'mgcv') %>%
set_mode('regression')
# Regression Tree Model
<-
rt_spec decision_tree() %>%
set_args(tree_depth, min_n, cost_complexity) %>% # we could let any of these = tune()
set_engine(engine = 'rpart') %>%
set_mode('regression')
# Random Forest Model for Regression
<-
rf_spec rand_forest() %>%
set_args(mtry, min_n, trees) %>% # we could let any of these = tune()
set_engine(engine = 'ranger') %>%
set_mode('regression')
Classification Models
# Logistic Regression Model
<-
logistic_spec logistic_reg() %>%
set_engine(engine = 'glm') %>%
set_mode('classification')
# Lasso Logistic Regression Model
<-
logistic_lasso_spec logistic_reg() %>%
set_args(mixture = 1, penalty = lambda) %>% #we could let penalty = tune()
set_engine(engine = 'glmnet') %>%
set_mode('classification')
# K Nearest Neighbors Regression Model
<-
knn_spec nearest_neighbor() %>%
set_args(neighbors = k) %>% # we could let neighbors = tune()
set_engine(engine = 'kknn') %>%
set_mode('classification')
# Classification Tree Model
<-
ct_spec decision_tree() %>%
set_args(tree_depth, min_n, cost_complexity) %>%
set_engine(engine = 'rpart') %>%
set_mode('classification')
# Random Forest Model for Classification
<-
rf_class_spec rand_forest() %>%
set_args(mtry, min_n, trees) %>% # we could let any of these = tune()
set_engine(engine = 'ranger') %>%
set_mode('classification')
Preprocessing Data
For more information and many examples, see the documentation for the recipes package (part of the tidymodels package): https://recipes.tidymodels.org/
Recipes
To facilitate transparency, organization, and consistency of process, we create a recipe of the steps taken for preprocessing the data prior to modeling.
<- recipe(formula, data) %>%
rec %>%
step_{FILLIN}() # continue to add preprocessing steps; note that the order matters! step_{FILLIN}()
Make sure to include the steps in the same order as they listed below.
- Row Operations
- Imputation
- Factor Levels
- Univariate Quant. Transformations
- Discretization
- Dummy Variables
- Interactions
- Normalization
- Multivariate Transformations
- Down/Up Sampling (we won’t talk about this very much)
Row Operations
The step functions available for row operations are below.
step_filter(condition) # use rows that satisfy a condition
step_sample(size) # use randomly select rows
step_shuffle(var) # randomly change the order of rows for selected variables (not needed for most circumstances)
step_slice(index) # use rows with the index (row numbers) provided
step_arrange(var) # sort rows according to the values of selected variables
step_corr() # removes predictors with high correlation with other predictors
step_lincomb() # removes predictors that are linear combinations of others
step_zv() # removes predictors with zero variability
step_nzv() # removes predictors with near zero variability
Imputation
The step functions available for imputing values for missing values are below.
step_impute_bag(impute_with) # imputation via bagged trees
step_impute_knn(impute_with) # imputation via K-nearest neighbors
step_impute_mode() # imputation of most common value
step_unknown() # assign missing to "unknown" category
Factor Levels
The step functions available for changing the levels for factors are below.
step_relevel(ref_level) # change reference level of factor
Univariate Quant. Transformations
The step functions available for transforming quantitative variables are below.
step_BoxCox() # box cox transformation
step_log() # log transformation
step_bs() # basis spline
step_ns() # natural spline
step_poly() # polynomial transformation
step_mutate() # mutate a variable in the way you specify
Discretization
The step functions available for transforming quantitative variables into categorical variables are below.
step_discretize(num_breaks)
step_cut(breaks)
Dummy Variables
The step functions available for transforming categorical variables into quantitative variables (dummy/indicator variables) and dealing with categorical levels are below.
step_novel() # create category for new levels in test set
step_other(threshold) # collapse small categories into an "other" category
step_date(features = c('dow','month','year')) # create features based on date variables
step_dummy() # create indicators based on nominal variables
step_regex() # create indicators using regular expressions
Interactions
The step functions available for creating interaction terms are below.
step_interact(~ A:B)
step_interact(~ starts_with("CatA"):starts_with("CatB"))
Normalization
The step functions available for centering, scaling, and standardizing/normalizing are below.
step_center() # subtracts mean
step_scale() # divides by sd
step_normalize() # both center and scale
Multivariate Transformations
The step functions available for combining/transforming many quantitative variables are below.
step_pca()
Variable Roles
In a recipe, each variable has a role. Typical roles are:
- Outcome
- Predictor
# Use summary to see the roles of the variables
recipe() %>%
summary()
# in the steps of the recipe, you can refer to variables based on their roles
all_predictors()
all_outcomes()
# in the recipe, you can refer to variables based on their type (quantitative: numeric, categorical: nominal)
all_numeric()
all_nominal()
# in the recipe, you can refer to variables based on their type and role
all_numeric_predictors()
all_nominal_predictors()
# roles can help you select groups of variables
has_role()
has_type()
Roles to consider adding if you want to keep variables in the data set but do not want to use them in the model:
- ID (to exclude from modeling)
- Locations
- Date/Times
- General Categories of Predictors
<- recipe() %>%
rec update_role()
<- recipe() %>%
rec add_role() %>% # variables can have multiple roles
remove_role()
Prep Data
If you aren’t ready to fit the model but want to see the pre-processed data, you can use these functions.
Note: The prep and bake steps are done for you when you fit models
recipe() %>%
prep(training = train_data) %>%
juice() # returns preprocessed training data
recipe() %>%
prep(training = train_data) %>%
bake(new_data = test_data) # returns preprocessed test data using values from train data
Data Subsets
For more information and many examples, see the documentation for the rsample package (part of the tidymodels package): https://rsample.tidymodels.org/
Random Splits
To randomly split your data set into a training and test set, use initial_split()
. You can change prop
to 0.8 for a training set equal to 80% of the rows. You can specify a variable for the strata
argument to make the random sampling to be conducted within the stratification variable; this can help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.
set.seed(123) # to reproduce your random split, set a seed - pick an integer
<- initial_split(data, prop = 3/4) # saves minimal information about the split (which rows are in the training set and which are in the testing set)
data_split
<- training(data_split)
data_train
<- testing(data_split) data_test
Cross-Validation/Resamples
To complete cross-validation, use vfold_cv()
. This function saves information about the folds (which rows are in which fold) and which rows should be used for training and testing for each iteration.
<- vfold_cv(data, v = 10)
data_cv
# In practice, you don't need to use the following functions but it is useful to know what is going on behind the scenes
training(data_cv$splits[[1]]) # pulls training data for the 1st split (1st fold is testing set)
testing(data_cv$splits[[1]]) # pulls testing data for the 1st split (1st fold is testing set)
Workflows
A machine learning workflow (the “black box”) = model specification + preprocessing recipe/formula
<- workflow() %>%
model_wf add_recipe(rec) %>% # add_formula(formula) if no recipe
add_model(mod_spec)
%>%
model_wf update_recipe(rec_new)
%>%
model_wf update_formula(formula_new)
%>%
model_wf update_model(model_spec_new)
Fit Model
# Fit model to given training data
<- fit(model_spec, formula, data_train) # fit model using a formula (no recipe) and model_spec
fit_model
<- fit(model_wf, data_train) # fit model using a workflow
fit_model
# Fit model using CV
<- fit_resamples(model_wf, # fit and evaluate model with CV
fit_model_cv resamples = data_cv,
metrics = metric_set()
# from tune package
)
# Fit model on one training and evaluate on one test data
<- last_fit(model_wf, # fit model on training and evaluate model on test data
fit_model_traintest split = data_split,
metrics = metric_set()
# from tune package )
Evaluate Model: Metrics
For documentation and examples about metrics, see the yardstick package, https://yardstick.tidymodels.org/index.html.
metric_set(rmse, rsq, mae) # typical regression metrics
metric_set(sens, spec, accuracy, roc_auc) # typical classification metrics
For documentation and examples about collecting metrics from train/testing or cross validation, see the workflowsets package, https://workflowsets.tidymodels.org/
# Using one training and testing split
last_fit(model_wf,
split = data_split,
metrics = metric_set(rmse, rsq, mae)
%>%
) collect_metrics()
# Using CV
fit_resamples(model_wf,
resamples = data_cv,
metrics = metric_set(rmse, rsq, mae)
%>%
) collect_metrics(summarize = TRUE)
Tune Model: Choosing tuning parameters
For documentation and examples about tuning a model by selecting tuning parameters, see the tune package, https://tune.tidymodels.org/
Use tune()
for tuning parameters in model and recipe and tune_grid()
to fit models with an array of values for the tuning parameters.
# Regularized Regression (LASSO, Ridge)
<- lm_lasso_spec %>%
lm_lasso_spec_tune set_args(mixture = 1, penalty = tune('lambda'))
<- lm_ridge_spec %>%
lm_ridge_spec_tune set_args(mixture = 0, penalty = tune('lambda'))
# Number of neighbors in KNN
<- knn_spec %>%
knn_spec_tune set_args(neighbors = tune('k'))
# Decision Trees
<- rt_spec %>%
rt_spec_tune set_args(cost_complexity = tune('cost'), min_n = tune('n'))
<- ct_spec %>%
ct_spec_tune set_args(cost_complexity = tune('cost'), min_n = tune('n'))
# Number of Components in PCA
recipe() %>%
step_pca(num_comp = tune('n'))
Once you’ve included tune() as the input for the parameter in the model or recipe, you need to fit the model for a variety of values. You may provide a grid of values or you can let the function create a grid of a specified number of values.
<- tune_grid(model_wf,
tune_results resamples = data_cv,
grid = 10, #creates grid of 10 values
metrics = metric_set())
<- tune_grid(model_wf,
tune_resultsresamples = data_cv,
grid = expand_grid(k = seq(0,15,by = 1)), #you provide grid, use name in tune() for variable name
metrics = metric_set())
%>% show_best()
tune_results
%>% autoplot()
tune_results
<- select_best(tune_results, metric) # or select_by_one_std_err(tune_results, metric, desc(penalty))
final_model
<- finalize_workflow(model_wf, final_model) final_model_wf
Use Final Model: Predictions
# Fit final model on training data and then predict for any new data set
%>%
final_model_wf fit(data = data_train) %>%
predict(new_data = data_test) # get .pred only
# Fit final model on training data and then predict for test set
%>%
final_model_wf last_fit(
split = data_split,
metrics = metric_set(rmse, rsq, mae)
%>%
) collect_predictions() # get .pred and true values in test set
Extraction
For some models (LASSO, decision trees), you’ll want to pull out the original fit object to make visualizations.
%>% extract_fit_engine() # gives you the original fit object by the engine (useful for specific purposes)
fit_model
pluck(list, 'fit') # gives you the element in the list associated with the name 'fit'
Variable Importance
Model Agnostic Importance - Permutation Approach
Permutation technique - Train model - Evaluate model
For each variable,
- Randomly shuffle the variable and reevaluate model
- Compare the original model evaluation metric to the value when the variable is reshuffled
Reshuffling the values of a variable breaks or removes any predictive relationships that may exist with that variable and the outcome variable. So if a variable is highly predictive of the outcome, there should be a large change in the model evaluation metric when it is “removed” through reshuffling.
library(vip)
::conflict_prefer("vi", "vip")
conflicted
<- train %>% prep() %>% juice() # Pre-process the training data
train_prep
<- fit_model %>% extract_fit_engine() # Pull the fit model
mod
# Calculate Variable Importance using Permutation approach
vi(mod, method = 'permute', target = 'outcomename', metric = 'metricname', train = train_prep, pred_wrapper = predict)
# use vip instead of vi for plot
Model Specific Importance
## LASSO Var Importance
<- final_fit_se %>% extract_fit_engine()
glmnet_output
# Create a boolean matrix (predictors x lambdas) of variable exclusion
<- glmnet_output$beta==0
bool_predictor_exclude
# Loop over each variable
<- sapply(seq_len(nrow(bool_predictor_exclude)), function(row) {
var_imp <- bool_predictor_exclude[row,]
this_coeff_path if(sum(this_coeff_path) == ncol(bool_predictor_exclude)){ return(0)}else{
return(ncol(bool_predictor_exclude) - which.min(this_coeff_path) + 1)}
})
# Create a dataset of this information and sort
<- tibble(
var_imp_data var_name = rownames(bool_predictor_exclude),
var_imp = var_imp
)%>% arrange(desc(var_imp)) var_imp_data
## Tree and Forest Var Importance
library(vip)
::conflict_prefer("vi", "vip")
conflicted
<- fit_tree_model %>% extract_fit_engine() # fit_tree_model would come from fitting a decision_tree() model
tree vi(tree)
<- fit_rf_model %>% extract_fit_engine() # fit_rf_model would come from fitting a rand_forest() model
rf vi(rf)
# use vip instead of vi for plot
Comparing Models: Workflow Sets
For documentation and examples about comparing models, see the workflowsets package, https://workflowsets.tidymodels.org/
<- workflow_set(
wf_set preproc = list(rec1 = recipe1, rec2 = recipe2, rec3 = recipe3),
models = list(mod1 = model_spec1, mod2 = model_spec2, mod3 = model_spec3),
cross = TRUE) %>%
workflow_map(
"fit_resamples",
resamples,
metrics)
%>% autoplot()
wf_set
%>% rank_results(rank_metric) wf_set
Visualizing Fit Models
Visualize Coefficients
# For Linear Regression
%>%
fit_model tidy() %>%
slice(-1) %>%
mutate(lower = estimate - 1.96*std.error, upper = estimate + 1.96*std.error) %>%
ggplot() +
geom_vline(xintercept=0, linetype=4) +
geom_point(aes(x=estimate, y=term)) +
geom_segment(aes(y=term, yend=term, x=lower, xend=upper), arrow = arrow(angle=90, ends='both', length = unit(0.1, 'cm'))) +
labs(x = 'Coefficient estimate (95% CI)', y = 'Feature') +
theme_classic()
# For Logistic Regression
%>%
fit_model tidy() %>%
slice(-1) %>%
mutate(OR.conf.low = exp(estimate - 1.96*std.error), OR.conf.high = exp(estimate + 1.96*std.error)) %>% # do this first
mutate(OR = exp(estimate)) %>%
ggplot() +
geom_vline(xintercept=1, linetype=4) +
geom_point(aes(x=OR, y=term)) +
geom_segment(aes(y=term, yend=term, x=OR.conf.lower, xend=OR.conf.upper), arrow = arrow(angle=90, ends='both', length = unit(0.1, 'cm'))) +
labs(x = 'Coefficient estimate (95% CI)', y = 'Feature') +
theme_classic()
Visualize Effects
# Linear Regression
%>%
data mutate(id = row_number()) %>%
pivot_longer(-c(outcome,id), names_to = 'key', values_to = 'value') %>%
right_join(
%>%
(lm_fit_model tidy() %>%
slice(-1) %>%
select(term, estimate)),
by = c('key'='term')
%>%
) mutate(effect = value * estimate) %>%
ggplot(aes(x = effect, y = key)) +
geom_boxplot() +
geom_vline(xintercept = 0, color = 'grey') +
labs(x = 'Effect', y = 'Feature') +
theme_classic()
# Logistic Regression
%>%
data mutate(id = row_number()) %>%
pivot_longer(-c(outcome,id), names_to = 'key', values_to = 'value') %>%
right_join(
%>%
(glm_fit_model tidy() %>%
slice(-1) %>%
select(term, estimate)),
by = c('key'='term')
%>%
) mutate(effect = value * estimate) %>%
ggplot(aes(x = effect, y = key)) +
geom_boxplot() +
geom_vline(xintercept = 0, color = 'grey') +
labs(x = 'Effect (Log Odds units)', y = 'Feature') +
theme_classic()
Visualize Residuals
# For Any Regression Model
#Predicted vs. Residual
%>%
fit_model predict(new_data = data_train) %>%
bind_cols(data_train) %>%
mutate(resid = outcome - .pred) %>%
ggplot(aes(x = .pred, y = resid)) +
geom_point() +
geom_smooth() +
theme_classic()
#Variable vs. Residual
%>%
fit_model predict(new_data = data_train) %>%
bind_cols(data_train) %>%
mutate(resid = outcome - .pred) %>%
ggplot(aes(x = var, y = resid)) +
geom_point() +
geom_smooth() +
theme_classic()
Visualize Predictions
# For Any Regression Model
# Create a data_grid (fixing all other variables at some value, sequence of values for the variable of interest)
<- data.frame(x1 = value1, x2 = value2, x3 = value3, x4 = seq(0,100,by = 1)) #example where x4 is variable of interest
data_grid
# Get Predicted Values for test values in data_grid
<- data_grid %>%
predicted_lines bind_cols(
predict(fit_model, new_data = data_grid),
predict(fit_model, new_data = data_grid, type='conf_int')
)
%>%
train_data ggplot(aes(x = x4, y = outcome)) +
geom_point() +
geom_line(aes(y = .pred, x = x4), data = predicted_lines) +
geom_line(aes(y = .pred_lower, x = x4), data = predicted_lines) +
geom_line(aes(y = .pred_upper, x = x4), data = predicted_lines) +
theme_classic()
0.0.1 Visualize LASSO Coefficients
<- final_fit_se %>% extract_fit_engine() # get the original glmnet output
glmnet_output
<- glmnet_output$lambda
lambdas <-
coefs_lambdas coefficients(glmnet_output, s = lambdas ) %>%
as.matrix() %>%
t() %>%
as.data.frame() %>%
mutate(lambda = lambdas ) %>%
select(lambda, everything(), -`(Intercept)`) %>%
pivot_longer(cols = -lambda,
names_to = "term",
values_to = "coef") %>%
mutate(var = purrr::map_chr(stringr::str_split(term,"_"),~.[1]))
%>%
coefs_lambdas ggplot(aes(x = lambda, y = coef, group = term, color = var)) +
geom_line() +
geom_vline(xintercept = best_se_penalty %>% pull(penalty), linetype = 'dashed') +
theme_classic() +
theme(legend.position = "bottom", legend.text=element_text(size=8))
Real Data Example: Regression
library(tidymodels)
tidymodels_prefer()
data(ames)
set.seed(123)
<- ames %>% mutate(Sale_Price = log10(Sale_Price))
ames
# If you do one train/test split
<- initial_split(ames, strata = "Sale_Price", prop = 0.75) #Create Train/Test set
data_split <- training(data_split) # Fit model to this
ames_train <- testing(data_split) # Don't use until evaluating final model ames_test
Linear Regression
<-
lm_spec linear_reg() %>% # Specify Model and Engine
set_engine( engine = 'lm') %>%
set_mode('regression')
<- recipe(Sale_Price ~ Lot_Area + Year_Built + House_Style + Gr_Liv_Area + Fireplaces, data = ames_train) %>%
lm_rec step_lincomb(all_numeric_predictors()) %>% # Specify Formula and Preprocessing Recipe
step_zv(all_numeric_predictors()) %>%
step_mutate(Gr_Liv_Area = Gr_Liv_Area/100, Lot_Area = Lot_Area/100) %>%
step_mutate(Fireplaces = Fireplaces > 0) %>%
step_cut(Year_Built, breaks = c(0, 1950, 1990, 2020)) %>%
step_other(House_Style,threshold = .1) %>%
step_dummy(all_nominal_predictors())
<- lm_rec %>%
train_prep prep() %>%
juice() # Pre-process Training Data
<- workflow() %>% # Create Workflow (Recipe + Model Spec)
ames_wf add_recipe(lm_rec) %>%
add_model(lm_spec)
<- ames_wf %>%
lm_fit_train fit(data = ames_train) # Fit Model to Training Data
%>%
train_prep select(Sale_Price) %>%
bind_cols( predict(lm_fit_train, ames_train) ) %>%
metrics(estimate = .pred, truth = Sale_Price) # Calculate Training metrics
%>%
lm_fit_train tidy() # Model Coefficients from Trained Model
library(dotwhisker)
tidy(lm_fit_train) %>% # Viz of Trained Model Coef
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 0, color = "grey50", linetype = 2))
<- vfold_cv(ames_train, v = 10, strata = Sale_Price) # Create 10 Folds of Training Data for CV
ames_cv
<- fit_resamples(ames_wf, # Fit Model to 10 Folds of Training Data
lm_fit_cv resamples = ames_cv,
metrics = metric_set(rmse, mae, rsq))
%>% collect_metrics() # Evaluate Trained Model using CV
lm_fit_cv
<- last_fit(ames_wf,
lm_fit_test split = data_split)
%>%
lm_fit_test collect_metrics() #Evaluation on Test Data
library(vip)
::conflict_prefer("vi", "vip")
conflicted
<- lm_fit_train %>% extract_fit_engine()
mod vi(mod, method = 'permute', target = 'Sale_Price', metric = 'rmse', train = train_prep, pred_wrapper = predict)
vip(mod, method = 'permute', target = 'Sale_Price', metric = 'rmse', train = train_prep, pred_wrapper = predict) + theme_classic()
Regularized Regression
<-
lm_lasso_spec linear_reg() %>%
set_args(mixture = 1, penalty = tune()) %>%
set_engine(engine = 'glmnet') %>%
set_mode('regression')
# Update Workflow (Recipe + Model Spec)
<- ames_wf %>%
ames_wf update_model(lm_lasso_spec)
# Tune Model by Fitting Model to 10 Folds of Training Data
<- tune_grid(ames_wf,
lasso_fit_cv resamples = ames_cv,
grid = 10,
metrics = metric_set(rmse, mae, rsq))
%>% autoplot() + theme_classic() # Evaluate Trained Model using CV
lasso_fit_cv
# Select penalty value
%>% show_best(metric = 'rmse')
lasso_fit_cv <- lasso_fit_cv %>%
best_penalty select_by_one_std_err(metric = 'rmse',desc(penalty))
<- finalize_workflow(ames_wf,best_penalty)
tuned_ames_wf
<- tuned_ames_wf %>% pull_workflow_spec() # Save final tuned model spec
lm_lasso_spec
# CV metrics
%>%
lasso_fit_cv collect_metrics() %>%
filter(penalty == (best_penalty %>% pull(penalty)))
# Fit Tuned Lasso Model to Training Data
<- tuned_ames_wf %>%
lasso_fit_train fit(data = ames_train)
# Training metrics
%>%
train_prep select(Sale_Price) %>%
bind_cols( predict(lasso_fit_train, ames_train) ) %>%
metrics(estimate = .pred, truth = Sale_Price)
# Model Coefficients from Trained Model
%>%
lasso_fit_train tidy()
# Var Importance
<- lasso_fit_train %>% extract_fit_engine()
glmnet_output
# Create a boolean matrix (predictors x lambdas) of variable exclusion
<- glmnet_output$beta==0
bool_predictor_exclude
# Loop over each variable
<- sapply(seq_len(nrow(bool_predictor_exclude)), function(row) {
var_imp <- bool_predictor_exclude[row,]
this_coeff_path if(sum(this_coeff_path) == ncol(bool_predictor_exclude)){ return(0)}else{
return(ncol(bool_predictor_exclude) - which.min(this_coeff_path) + 1)}
})
# Create a dataset of this information and sort
<- tibble(
var_imp_data var_name = rownames(bool_predictor_exclude),
var_imp = var_imp
)%>% arrange(desc(var_imp)) var_imp_data
Nearest Neighbors
<-
knn_spec nearest_neighbor() %>%
set_args(neighbors = tune()) %>%
set_engine(engine = 'kknn') %>%
set_mode('regression')
# Update Workflow (Recipe + Model Spec)
<- ames_wf %>%
ames_wf update_model(knn_spec)
# Tune Model by Fitting Model to 10 Folds of Training Data
<- tune_grid(ames_wf,
knn_fit_cv resamples = ames_cv,
grid = 20,
metrics = metric_set(rmse, mae, rsq))
%>% autoplot() # Evaluate Trained Model using CV
knn_fit_cv
# Select penalty value
%>% show_best(metric = 'rmse')
knn_fit_cv
<- knn_fit_cv %>%
best_neighbor select_by_one_std_err(metric = 'rmse', neighbors)
<- finalize_workflow(ames_wf, best_neighbor)
tuned_ames_wf
<- tuned_ames_wf %>% pull_workflow_spec() # Save final tuned model spec
knn_spec
# CV Metrics
%>%
knn_fit_cv collect_metrics() %>%
filter(neighbors == (best_neighbor %>% pull(neighbors)))
# Fit KNN Model to Training Data
<- tuned_ames_wf %>%
knn_fit_train fit(data = ames_train)
%>%
train_prep select(Sale_Price) %>%
bind_cols( predict(knn_fit_train, ames_train) ) %>%
metrics(estimate = .pred, truth = Sale_Price) # Training metrics
# Visualize Predictions
<- data.frame(Lot_Area = median(ames_train$Lot_Area),Year_Built = median(ames_train$Year_Built), House_Style = 'One_Story', Fireplaces = 1, Gr_Liv_Area = seq(300,5000,by = 100))
data_grid
# Get Predicted Values for test values in data_grid
<- data_grid %>%
predicted_lines bind_cols(
predict(knn_fit_train , new_data = data_grid)
)
%>%
ames_train ggplot(aes(x = Gr_Liv_Area, y = Sale_Price)) +
geom_point() +
geom_line(aes(y = .pred, x = Gr_Liv_Area), data = predicted_lines, color = 'red') +
theme_classic()
GAMs - Splines and Local Regression
<- lm_rec %>%
spline_rec step_ns(Gr_Liv_Area, deg_free = 4)
<- ames_wf %>% # Update Workflow (Recipe + Model Spec)
ames_wf update_recipe(spline_rec) %>%
update_model(lm_spec)
<- ames_wf %>%
spline_fit_train fit(data = ames_train) # Fit Spline Model to Training Data
%>%
train_prep select(Sale_Price) %>%
bind_cols( predict(spline_fit_train, ames_train) ) %>%
metrics(estimate = .pred, truth = Sale_Price) # Training metrics
<- ames_wf %>%
spline_fit_cv fit_resamples(resamples = ames_cv,
metrics = metric_set(rmse, mae, rsq))
# CV Metrics
%>%
spline_fit_cv collect_metrics()
# Visualize Predictions (Non-Linear Functions)
<- data.frame(Lot_Area = median(ames_train$Lot_Area),Year_Built = median(ames_train$Year_Built), House_Style = 'One_Story', Fireplaces = 1, Gr_Liv_Area = seq(300,5000,by = 100))
data_grid
# Get Predicted Values for test values in data_grid
<- data_grid %>%
predicted_lines bind_cols(
predict(spline_fit_train , new_data = data_grid)
)
%>%
ames_train ggplot(aes(x = Gr_Liv_Area, y = Sale_Price)) +
geom_point() +
geom_line(aes(y = .pred, x = Gr_Liv_Area), data = predicted_lines, color = 'red') +
theme_classic()
GAM - Smoothing Splines
<-
gam_spec gen_additive_mod() %>%
set_engine(engine = 'mgcv') %>%
set_mode('regression')
<- gam_spec %>%
fit_gam_model fit(Sale_Price ~ Lot_Area + Year_Built + House_Style + s(Gr_Liv_Area, k = 15) + Fireplaces, data = ames_train)
# Summary: Parameter (linear) estimates and then Smooth Terms (H0: no relationship)
%>% pluck('fit') %>% summary()
fit_gam_model
# Diagnostics: Check to see if the number of knots is large enough (if p-value is low, increase number of knots)
par(mfrow=c(2,2))
%>% pluck('fit') %>% mgcv::gam.check()
fit_gam_model
# Visualize Non-Linear Functions
%>% pluck('fit') %>% plot() fit_gam_model
Regression Trees
<- recipe(Sale_Price ~ Lot_Area + Year_Built + House_Style + Gr_Liv_Area + Fireplaces, data = ames_train) %>%
tree_rec step_zv(all_numeric_predictors())
<-
tree_spec decision_tree() %>%
set_engine(engine = 'rpart') %>%
set_mode('regression')
<- ames_wf %>% # Update Workflow (Recipe + Model Spec)
ames_wf update_recipe(tree_rec) %>%
update_model(tree_spec)
<- ames_wf %>%
tree_fit_train fit(data = ames_train) # Fit Reg Tree Model to Training Data
%>%
train_prep select(Sale_Price) %>%
bind_cols( predict(tree_fit_train, ames_train) ) %>%
metrics(estimate = .pred, truth = Sale_Price) # Training metrics
%>%
tree_fit_train extract_fit_engine() %>%
::rpart.plot(roundint = FALSE)
rpart.plot
<- ames_wf %>%
tree_fit_cv fit_resamples(resamples = ames_cv,
metrics = metric_set(rmse, mae, rsq))
%>%
tree_fit_cv collect_metrics()
Comparison of Methods
<- workflow_set(
wf_set preproc = list(lm = lm_rec, spline = spline_rec, tree = tree_rec),
models = list(lm = lm_spec, lasso = lm_lasso_spec, knn = knn_spec, tree = tree_spec),
cross = TRUE) %>%
anti_join(tibble(wflow_id = c("lm_tree","spline_lasso", "spline_knn","spline_tree","tree_lm","tree_lasso","tree_knn")),
by = "wflow_id")
<- wf_set %>%
ames_wf_set workflow_map(
"fit_resamples",
resamples = ames_cv, # Compare Methods via CV
metrics = metric_set(rmse, rsq, mae))
%>% autoplot()
ames_wf_set %>% rank_results(rank_metric = 'rmse') ames_wf_set
list(lm = lm_fit_test, lasso = lasso_fit_test, knn = knn_fit_test, spline = spline_fit_test, tree = tree_fit_test) %>% # Comparison of Methods fit to Test Data
map_df(collect_metrics, .id = 'Model') %>%
ggplot(aes(x = Model, y = .estimate)) +
geom_point() +
facet_grid(.metric ~ ., scales = 'free') +
theme_classic()
<- crossing(Fireplaces = 0:1, Lot_Area = seq(1500, 15000, length = 10), Gr_Liv_Area = seq(500, 5000, length = 10), Year_Built = c(1900,1950, 1975,2000), House_Style = c('One_Story', 'Two_Story','SLvl')) %>%
new_ames filter(!(Gr_Liv_Area > Lot_Area))
<- new_ames %>%
all_preds bind_cols(predict(lm_fit_train,new_ames)) %>%
rename(.lm.pred = .pred) %>%
bind_cols(predict(lasso_fit_train,new_ames)) %>%
rename(.lasso.pred = .pred) %>%
bind_cols(predict(knn_fit_train,new_ames)) %>%
rename(.knn.pred = .pred) %>%
bind_cols(predict(spline_fit_train,new_ames)) %>%
rename(.spline.pred = .pred) %>%
bind_cols(predict(tree_fit_train,new_ames)) %>%
rename(.tree.pred = .pred)
# Visualize Predictions from these Models
%>%
all_preds filter(Year_Built == 1975 & Fireplaces == 0) %>%
ggplot(aes(x = Gr_Liv_Area, y = .spline.pred, color = House_Style )) +
geom_point() +
geom_line() +
facet_wrap(~Lot_Area) +
theme_classic()
Real Data Example: Classification
library(tidymodels)
tidymodels_prefer()
data(ames)
set.seed(123)
<- ames %>% mutate(High_Sale_Price = factor(Sale_Price > quantile(Sale_Price,.75)))
ames
<- initial_split(ames, strata = "High_Sale_Price", prop = 0.75) #Create Train/Test set
data_split
<- training(data_split) # Fit model to this
ames_train <- testing(data_split) # Don't use until evaluating final model ames_test
Logistic Regression
<-
logistic_spec logistic_reg() %>% # Specify Model and Engine
set_engine( engine = 'glm') %>%
set_mode('classification')
<- recipe(High_Sale_Price ~ Lot_Area + Year_Built + House_Style + Gr_Liv_Area + Fireplaces, data = ames_train) %>%
glm_rec step_lincomb(all_numeric_predictors()) %>% # Specify Formula and Preprocessing Recipe
step_zv(all_numeric_predictors()) %>%
step_mutate(Gr_Liv_Area = Gr_Liv_Area/100, Lot_Area = Lot_Area/100) %>%
step_mutate(Fireplaces = Fireplaces > 0) %>%
step_cut(Year_Built, breaks = c(0, 1950, 1990, 2020)) %>%
step_other(House_Style,threshold = .1) %>%
step_dummy(all_nominal_predictors())
<- glm_rec %>%
train_prep prep() %>%
juice() # Create Prepped Training Data
<- workflow() %>% # Create Workflow (Recipe + Model Spec)
ames_wf add_recipe(glm_rec) %>%
add_model(logistic_spec)
<- ames_wf %>%
logistic_fit_train fit(data = ames_train) # Fit Model to Training Data
predict(logistic_fit_train, ames_train) %>%
bind_cols(train_prep %>% select(High_Sale_Price)) %>%
metrics(estimate = .pred_class, truth = High_Sale_Price) # Training metrics
%>%
logistic_fit_train tidy() # Model Coefficients from Trained Model
library(dotwhisker)
tidy(logistic_fit_train) %>% # Viz of Trained Model Odds Ratios
mutate(conf.low = exp(estimate - 1.96*std.error),conf.high = exp(estimate + 1.96*std.error)) %>% # do this first
mutate(estimate = exp(estimate)) %>% # this second
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 1, color = "grey50", linetype = 2)) + labs(x = 'Odds Ratio') + theme_classic()
<- vfold_cv(ames_train, v = 10, strata = Sale_Price) # Create 10 Folds of Training Data for CV
ames_cv
<- fit_resamples(ames_wf, # Fit Model to 10 Folds of Training Data
logistic_fit_cv resamples = ames_cv,
metrics = metric_set(sens, spec, accuracy, roc_auc))
%>% collect_metrics() # Evaluate Model using CV
logistic_fit_cv
<- last_fit(ames_wf,
logistic_fit_test split = data_split,
metrics = metric_set(sens, spec, accuracy, roc_auc))
%>%
logistic_fit_test collect_metrics() #Evaluation on Test Data
Regularized Logistic Regression
<-
logistic_lasso_spec logistic_reg() %>%
set_args(mixture = 1, penalty = tune()) %>%
set_engine(engine = 'glmnet') %>%
set_mode('classification')
<- ames_wf %>% # Update Workflow (Recipe + Model Spec)
ames_wf update_model(logistic_lasso_spec)
<- tune_grid(ames_wf, # Fit Model to 10 Folds of Training Data
logistic_lasso_fit_cv resamples = ames_cv,
grid = 10,
metrics = metric_set(sens, spec, accuracy, roc_auc))
%>% autoplot() # Evaluate Trained Model using CV
logistic_lasso_fit_cv
%>% show_best(metric = 'roc_auc')
logistic_lasso_fit_cv
<- logistic_lasso_fit_cv %>%
tuned_ames_wf select_best(metric = 'roc_auc') %>%
finalize_workflow(ames_wf,.) #set penalty based on best fit
<- tuned_ames_wf %>% pull_workflow_spec() # Save final tuned model spec
logistic_lasso_spec
<- tuned_ames_wf %>%
logistic_lasso_fit_train fit(data = ames_train) # Fit Lasso Model to Training Data (penalty is now set)
%>%
logistic_lasso_fit_train predict(new_data = ames_train) %>%
bind_cols(train_prep %>% select(High_Sale_Price)) %>%
metrics(estimate = .pred_class, truth = High_Sale_Price) # Training metrics
%>%
logistic_lasso_fit_train tidy() # Model Coefficients from Trained Model
<- last_fit(tuned_ames_wf,
lasso_fit_test split = data_split,
metrics = metric_set(sens, spec, accuracy, roc_auc))
%>%
lasso_fit_test collect_metrics() #Evaluation on Test Data
Classification Trees
<-
ctree_spec decision_tree() %>%
set_engine(engine = 'rpart') %>%
set_mode('classification')
<- ames_wf %>% # Update Workflow (Recipe + Model Spec)
ames_wf update_model(ctree_spec)
<- ames_wf %>%
ctree_fit_train fit(data = ames_train) # Fit Reg Tree Model to Training Data
%>%
ctree_fit_train predict(ames_train) %>%
bind_cols(train_prep %>% select(High_Sale_Price) ) %>%
metrics(estimate = .pred_class, truth = High_Sale_Price) # Training metrics
%>%
ctree_fit_train pull_workflow_fit() %>%
pluck('fit') %>%
::rpart.plot(roundint = FALSE)
rpart.plot
<- ames_wf %>%
ctree_fit_cv fit_resamples(resamples = ames_cv,
metrics = metric_set(sens,spec,accuracy,roc_auc))
%>%
ctree_fit_cv collect_metrics()
last_fit(ames_wf,
split = data_split,
metrics = metric_set(sens,spec,accuracy,roc_auc)
%>%
) collect_metrics() # Evaluation on Test Data
Random Forest
<-
rf_spec rand_forest() %>%
set_engine(engine = 'ranger') %>%
set_mode('classification')
<- ames_wf %>% # Update Workflow (Recipe + Model Spec)
ames_wf update_model(rf_spec)
<- ames_wf %>%
rf_fit_train fit(data = ames_train) # Fit Reg Tree Model to Training Data
%>%
rf_fit_train predict(ames_train) %>%
bind_cols(train_prep %>% select(High_Sale_Price) ) %>%
metrics(estimate = .pred_class, truth = High_Sale_Price) # Training metrics
<- ames_wf %>%
rf_fit_cv fit_resamples(resamples = ames_cv,
metrics = metric_set(sens,spec,accuracy,roc_auc))
%>%
rf_fit_cv collect_metrics()
last_fit(ames_wf,
split = data_split,
metrics = metric_set(sens,spec,accuracy,roc_auc)
%>%
) collect_metrics() # Evaluation on Test Data
Comparison of Methods
<- workflow_set(
wf_set preproc = list(glm_rec),
models = list(logistic = logistic_spec, lasso = logistic_lasso_spec, tree = ctree_spec, rf = rf_spec),
cross = TRUE)
<- wf_set %>%
ames_wf_set workflow_map(
"fit_resamples",
resamples = ames_cv, # Compare Methods via CV
metrics = metric_set(accuracy, sens, spec, roc_auc ))
%>% autoplot()
ames_wf_set %>% rank_results(rank_metric = 'roc_auc ') ames_wf_set