8.18 (Ordinary) Least Squares

8.18.1 Calculus Approach

Let \(y_i\) be the outcome for individual \(i=1,...,n\) and \(\mathbf{x}_i\) be a vector of explanatory variables for individual \(i\), then when we fit a linear regression model, we want to find the slopes, \(\boldsymbol\beta = (\beta_0,...,\beta_p)\), that minimize the sum of squared errors,

\[\sum_i (y_i - $\mathbf{x}_i^T\boldsymbol\beta)^2\]

If we stack the \(y_i\) on top of each other into a \(n\times 1\) vector \(\mathbf{y}\) and stack the \(\mathbf{x}_i\) on top of each other into a \(n \times (p+1)\) matrix \(\mathbf{X}\), we can write the sum of squared errors as an inner product,

\[ (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^T(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\]

By expanding this using matrix multiplication and properties, we get

\[ \mathbf{y}^T\mathbf{y} - \boldsymbol\beta^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol\beta + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \] This can be simplified to

\[ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \]

To find the value of \(\boldsymbol\beta\) that minimizes this quantity, we can find the derivative and set the equations equal to zero.

\[ \frac{\partial}{\partial \boldsymbol \beta} [ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta ] = -2\mathbf{X}^T\mathbf{y}+ 2\mathbf{X}^T\mathbf{X} \]

If we set this equal to 0 and solve for \(\boldsymbol\beta\), we get

\[\hat{\boldsymbol\beta}_{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

8.18.2 Projection Approach

Another way to approach Least Squares is to say that we want to find the unique \(\hat{\mathbf{y}}\) in \(Col(X)\), the column space of \(X\), such that we minimize \(||\mathbf{y} - \hat{\mathbf{y}}||\)$. In order for \(\hat{\mathbf{y}}\) to be in the column space of \(\mathbf{X}\), \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\).

The orthogonal projection of a vector \(\mathbf{y}\) on a the column space of matrix \(\mathbf{X}\) is \[\hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

and therefore,

\[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]