8.18 (Ordinary) Least Squares
8.18.1 Calculus Approach
Let \(y_i\) be the outcome for individual \(i=1,...,n\) and \(\mathbf{x}_i\) be a vector of explanatory variables for individual \(i\), then when we fit a linear regression model, we want to find the slopes, \(\boldsymbol\beta = (\beta_0,...,\beta_p)\), that minimize the sum of squared errors,
\[\sum_i (y_i - $\mathbf{x}_i^T\boldsymbol\beta)^2\]
If we stack the \(y_i\) on top of each other into a \(n\times 1\) vector \(\mathbf{y}\) and stack the \(\mathbf{x}_i\) on top of each other into a \(n \times (p+1)\) matrix \(\mathbf{X}\), we can write the sum of squared errors as an inner product,
\[ (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^T(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\]
By expanding this using matrix multiplication and properties, we get
\[ \mathbf{y}^T\mathbf{y} - \boldsymbol\beta^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol\beta + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \] This can be simplified to
\[ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \]
To find the value of \(\boldsymbol\beta\) that minimizes this quantity, we can find the derivative and set the equations equal to zero.
\[ \frac{\partial}{\partial \boldsymbol \beta} [ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta ] = -2\mathbf{X}^T\mathbf{y}+ 2\mathbf{X}^T\mathbf{X} \]
If we set this equal to 0 and solve for \(\boldsymbol\beta\), we get
\[\hat{\boldsymbol\beta}_{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
8.18.2 Projection Approach
Another way to approach Least Squares is to say that we want to find the unique \(\hat{\mathbf{y}}\) in \(Col(X)\), the column space of \(X\), such that we minimize \(||\mathbf{y} - \hat{\mathbf{y}}||\)$. In order for \(\hat{\mathbf{y}}\) to be in the column space of \(\mathbf{X}\), \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\).
The orthogonal projection of a vector \(\mathbf{y}\) on a the column space of matrix \(\mathbf{X}\) is \[\hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]
and therefore,
\[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]