6.5 Failure of Standard Estimation Methods

6.5.1 Ordinary Least Squares

Imagine we have \(n\) units/subjects, and each unit has \(m\) observations over time. Conditional on \(p\) explanatory variables, we typically assume a linear model for the relationship between the explanatory variables and the outcome,

\[Y_{ij} = \beta_0 + \beta_1x_{ij1}+ \cdots+ \beta_px_{ijp} + \epsilon_{ij}\] If we are using ordinary least squares to estimate the slope parameters, we assume the errors represent independent measurement error, \(\epsilon_{ij} \stackrel{iid}{\sim} \mathcal{N}(0,\sigma^2)\).

This can be written using vector and matrix notation with each unit having its outcome vector, \(\mathbf{Y}_i\),

\[\mathbf{Y}_i = \mathbf{X}_i\boldsymbol{\beta} + \boldsymbol\epsilon_i\text{ for each unit } i\] where \(\boldsymbol\epsilon_i \sim \mathcal{N}(0,\sigma^2I)\) and \(\mathbf{X}_i\) is the observations for the explanatory variables for subject \(i\) with \(m\) rows and \(p+1\) columns (a column of 1’s for the intercept plus \(p\) columns that correspond to the \(p\) variables).

If we stack all of our outcome vectors \(\mathbf{Y}\) (and covariate matrices \(\mathbf{X}\)) on top of each other (think Long Format in R),

\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\]

where \(\boldsymbol{\epsilon} \sim \mathcal{N}(0,\sigma^2I)\) so we are assuming that our errors and thus our data are independent within and between each unit and have constant variance.

6.5.1.1 Estimation

To find the ordinary Least Squares (OLS) estimate \(\widehat{\boldsymbol{\beta}}_{OLS}\), we need to minimize the sum of squared residuals, which can be written with vector and matrix notation,

\[(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}).\]

Taking the derivative with respect to \((\beta_0,\beta_1,...,\beta_p)\), we get a set of simultaneous equations expressed in matrix notation as

\[ -2\mathbf{X}^T\mathbf{Y} +2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

As long as \(\mathbf{X}^T\mathbf{X}\) has an inverse, then the unique solution and our estimated parameters can be found using

\[\widehat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\ \]

This estimator is unbiased in that it estimates the true value on average, \(E(\widehat{\boldsymbol{\beta}}_{OLS}) = \boldsymbol{\beta}\).

Derivation: \[E(\widehat{\boldsymbol{\beta}}_{OLS}) = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^TE(\mathbf{Y}) \] \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\ \] \[= \boldsymbol{\beta}\ \]

It also has a minimum variance of all unbiased, linear estimators IF the errors are independent and have constant variance.

Derivation: \[Cov(\widehat{\boldsymbol{\beta}}_{OLS}) = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T Cov(\mathbf{Y})\{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\}^T \] \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\sigma^2\mathbf{I})\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \] \[= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} \]

using \(Cov(\mathbf{A}\mathbf{Y}) = \mathbf{A}Cov(\mathbf{Y})\mathbf{A}^T\), which you can verify.

The OLS estimates of our coefficients are fairly good (unbiased), but the OLS standard errors are wrong unless our data are independent. Since we have correlated repeated measures, we don’t have as much “information” as if they were independent observations.

6.5.2 Generalized Least Squares

If you relax the assumption of constant variance, then the covariance matrix of \(\boldsymbol\epsilon_i\) and thus \(\mathbf{Y}_i\) is

\[\boldsymbol{\Sigma}_i = \left(\begin{array}{cccc} \sigma^2_1&0&\cdots&0\\ 0&\sigma^2_2&\cdots &0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\sigma^2_m\end{array}\right) \]

For longitudinal data, this would allow the variability to change over time.

If you relax both the assumptions of constant variance and independence, then the covariance matrix of \(\boldsymbol\epsilon_i\) and thus \(\mathbf{Y}_i\) is

\[\boldsymbol{\Sigma}_i = \left(\begin{array}{cccc} \sigma^2_1&\sigma_{12}&\cdots&\sigma_{1m}\\ \sigma_{21}&\sigma^2_2&\cdots &\sigma_{2m}\\ \vdots&\vdots&\ddots&\vdots\\ \sigma_{m1}&\sigma_{m2}&\cdots&\sigma^2_m\end{array}\right) \]

For longitudinal data, this would allow the variability to change over time, and you can account for dependence between repeated measures.

If we assume individual units are independent of each other (which is typically a fair assumption), the covariance matrix for the entire vector of outcomes (for all units and their observations over time) is written as a block diagonal matrix,

\[\boldsymbol\Sigma = \left(\begin{array}{cccc} \boldsymbol\Sigma_1&\mathbf{0}&\cdots&\mathbf{0}\\ \mathbf{0}&\boldsymbol\Sigma_2&\cdots &\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\boldsymbol\Sigma_n\end{array}\right)\]

6.5.2.1 Estimation

To find the Generalized Least Squares (GLS) estimator \(\widehat{\boldsymbol{\beta}}\), we need to minimize the standardized sum of squared residuals,

\[(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T\boldsymbol{\Sigma}^{-1}(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})\]

If \(\boldsymbol{\Sigma}\) is known, taking the derivative with respect to \(\beta_0,\beta_1,...,\beta_p\), we get a set of simultaneous equations expressed in matrix notation as

\[ -2\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{Y} +2\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

As long as \(\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X}\) has an inverse, then the generalized least squares (GLS) estimator is \[\widehat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{Y}\ \]

6.5.2.1.1 Alternative Derivation of GLS

An alternative approach is to consider transforming our \(\mathbf{Y}\) and \(\mathbf{X}\) by pre-multiplying by the inverse of the lower triangular matrix from the Cholesky Decomposition of the covariance matrix \(\boldsymbol{\Sigma}\) (\(\mathbf{L}\)), and then find the OLS estimator.

We start by transforming our data so that \(\mathbf{Y}^* = \mathbf{L}^{-1}\mathbf{Y}\), \(\mathbf{X}^*\mathbf{L}^{-1}\mathbf{X}\), and \(\boldsymbol\epsilon^* = \mathbf{L}^{-1}\boldsymbol\epsilon\).

We can rewrite the model as,

\[\mathbf{L}^{-1}\mathbf{Y} = \mathbf{L}^{-1}\mathbf{X}\boldsymbol{\beta} + \mathbf{L}^{-1}\boldsymbol\epsilon\] \[\implies \mathbf{Y}^* = \mathbf{X}^*\boldsymbol{\beta} + \boldsymbol\epsilon^*\] Then, using OLS on these transformed vectors, we get the generalized least squares estimator,

\[\implies \widehat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^{*T}\mathbf{X}^*)^{-1}\mathbf{X}^{*T}\mathbf{Y}^*\] \[= ((\mathbf{L}^{-1}\mathbf{X})^T\mathbf{L}^{-1}\mathbf{X})^{-1}(\mathbf{L}^{-1}\mathbf{X})^T\mathbf{L}^{-1}\mathbf{Y} \] \[= (\mathbf{X}^T(\mathbf{L}^{-1})^T\mathbf{L}^{-1}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{L}^{-1})^T\mathbf{L}^{-1}\mathbf{Y} \] \[= (\mathbf{X}^T(\mathbf{L}^{T})^{-1}\mathbf{L}^{-1}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{L}^{T})^{-1}\mathbf{L}^{-1}\mathbf{Y} \] \[= (\mathbf{X}^T(\mathbf{L}\mathbf{L}^T)^{-1}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{L}\mathbf{L}^T)^{-1}\mathbf{Y} \] \[= (\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{Y} \]

6.5.2.2 Estimation Issues

Main Issue: \(\boldsymbol{\Sigma}\) is unknown!

The solution to this problem is to iterate. Estimate \(\boldsymbol{\beta}\) and use that to estimate \(\boldsymbol{\Sigma}\). Repeat.

Other Issues:

If the longitudinal data is balanced, there are \(m(m+1)/2\) parameters in \(\boldsymbol\Sigma_i\) to estimate. That is a lot of parameters (!)
If the longitudinal data is unbalanced, there is no common \(\boldsymbol{\Sigma}_i\) for every unit

The solution to these problems involves simplifying assumptions such as

assuming the errors are independent (not a great solution)
assuming the errors have constant variance (maybe not too bad, depending on the data)
assuming there is some restriction/structure on the covariance function (best option)

We approximate the covariance matrix to get better estimates of \(\boldsymbol\beta\).

If we know the correlation structure of our data, we could use generalized least squares to get better (better than OLS) estimates of our coefficients, and the estimated standard errors would be more accurate.

The two main methods for longitudinal data that we’ll learn are similar to generalized least squares in flavor.

However, we need models that are not only applicable to continuous outcomes but also binary or count outcomes. Let’s first learn about the standard cross-sectional approaches to allow for binary and count outcomes (generalized linear models). Then, we’ll be able to talk about marginal models and mixed effects models.