8.18 (Ordinary) Least Squares

8.18.1 Calculus Approach

Let $y_i$ be the outcome for individual $i=1,...,n$ and $\mathbf{x}_i$ be a vector of explanatory variables for individual $i$, then when we fit a linear regression model, we want to find the slopes, $\boldsymbol\beta = (\beta_0,...,\beta_p)$, that minimize the sum of squared errors,

\[\sum_i (y_i - $\mathbf{x}_i^T\boldsymbol\beta)^2\]

If we stack the $y_i$ on top of each other into a $n\times 1$ vector $\mathbf{y}$ and stack the $\mathbf{x}_i$ on top of each other into a $n \times (p+1)$ matrix $\mathbf{X}$, we can write the sum of squared errors as an inner product,

\[ (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^T(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\]

By expanding this using matrix multiplication and properties, we get

\[ \mathbf{y}^T\mathbf{y} - \boldsymbol\beta^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol\beta + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \] This can be simplified to

\[ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta \]

To find the value of $\boldsymbol\beta$ that minimizes this quantity, we can find the derivative and set the equations equal to zero.

\[ \frac{\partial}{\partial \boldsymbol \beta} [ \mathbf{y}^T\mathbf{y} - 2\boldsymbol\beta^T\mathbf{X}^T\mathbf{y} + \boldsymbol\beta^T\mathbf{X}^T\mathbf{X}\boldsymbol\beta ] = -2\mathbf{X}^T\mathbf{y}+ 2\mathbf{X}^T\mathbf{X} \]

If we set this equal to 0 and solve for $\boldsymbol\beta$, we get

\[\hat{\boldsymbol\beta}_{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

8.18.2 Projection Approach

Another way to approach Least Squares is to say that we want to find the unique $\hat{\mathbf{y}}$ in $Col(X)$, the column space of $X$, such that we minimize $||\mathbf{y} - \hat{\mathbf{y}}||$$. In order for $\hat{\mathbf{y}}$ to be in the column space of $\mathbf{X}$, $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$.

The orthogonal projection of a vector $\mathbf{y}$ on a the column space of matrix $\mathbf{X}$ is \[\hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]

and therefore,

\[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \]