Notes - LASSO


  • world = supervised learning
    We want to model some output variable \(y\) using a set of potential predictors \((x_1, x_2, ..., x_p)\).

  • task = regression
    \(y\) is quantitative

  • model = linear regression
    We’ll assume that the relationship between \(y\) and (\(x_1, x_2, ..., x_p\)) can be represented by

    \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \varepsilon\]

  • estimation algorithm = LASSO (instead of Least Squares)

Notes - LASSO

Least Absolute Shrinkage and Selection Operator

  • Dates back to 1996, proposed by Robert Tibshirani (one of the authors of ISLR)

Use the LASSO algorithm to help us regularize and select the “best” predictors \(x\) to use in a predictive linear regression model of \(y\):

\[y = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_p x_p + \varepsilon\]


  • Penalize a predictor for adding complexity to the model (by penalizing its coefficient).
  • Track whether the predictor’s contribution to the model (lowering RSS) is enough to offset this penalty.

Algorithm Criterion

Identify the model coefficients \(\hat{\beta}_1, \hat{\beta}_2, ... \hat{\beta}_p\) that minimize the penalized residual sum of squares:

\[RSS + \lambda \sum_{j=1}^p \vert \hat{\beta}_j\vert = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \vert \hat{\beta}_j\vert\]


  • residual sum of squares (RSS) measures the overall model prediction error
  • the penalty term measures the overall size of the model coefficients
  • \(\lambda \ge 0\) (“lambda”) is a tuning parameter

Small Group Discussion

Discuss basic understanding from the video to help each other clear up concepts.

EXAMPLE 1: LASSO vs other algorithms for building linear regression models

  1. LASSO vs least squares
    • What’s one advantage of LASSO vs least squares?
    • Which algorithm(s) require us (or R) to scale the predictors?

  1. What is one advantage of LASSO vs backward stepwise selection?

Small Group Discussion


We have to pick a \(\lambda\) penalty tuning parameter for our LASSO model. What’s the impact of \(\lambda\)?

  1. When \(\lambda\) is 0, …

  2. As \(\lambda\) increases, the predictor coefficients ….

  3. Goldilocks problem:

    • If \(\lambda\) is too big, ….
    • If \(\lambda\) is too small, …
  4. To decide between a LASSO that uses \(\lambda = 0.01\) vs \(\lambda = 0.1\) (for example), we can ….

Picking \(\lambda\)

We cannot know the “best” value for \(\lambda\) in advance. This varies from analysis to analysis.

We must try a reasonable range of possible values for \(\lambda\). This also varies from analysis to analysis.

In general, we have to use trial-and-error to identify a range that is…

  • wide enough that it doesn’t miss the best values for \(\lambda\)
  • narrow enough that it focuses on reasonable values for \(\lambda\)

