6.4 Notation

In order to

elucidate the assumptions made under different models and methods, and
describe the models and methods more easily,

it is convenient to think of all outcomes collected on the same unit over time or another set of conditions together so that complex relationships among them may be summarized.

Consider the random variable,

\[ Y_{ij} = \text{the }j\text{th measurement taken on unit }i,\quad\quad i= 1,...,n, j =1,...,m\]

Consider the dental study data (Example 1) to make this more concrete. Each child was measured 4 times at ages 8, 10, 12, and 14 years. Thus, we let \(j = 1,..., 4\); \(j\) index the measurement order on a child. To summarize the information on when these times occur, we might further define

\[ t_{ij} = \text{ the time at which the }j\text{ measurement on unit i was taken.}\]

Here, for all children (\(i=1,...,27\)), \(t_{i1} = 8, t_{i2} = 10\), and so on for all children in the study. Thus, \(t_{ij} = t_j\) for all \(i=1,...,n\). If we ignore the gender of the children for the moment, the outcomes for the \(i\)th child, where \(i\) ranges from 1 to 27, are \(Y_{i1},..., Y_{i4}\), taken at times \(t_{i1},...,t_{i4}\). We may summarize the measurements for the \(i\)th child even more succinctly: define the (4 × 1) random vector,

\[ \mathbf{Y}_i = \left(\begin{array}{c}Y_{i1}\\ Y_{i2}\\ Y_{i3}\\ Y_{i4} \end{array}\right).\]

The vector elements are random variables representing the outcomes that might be observed for child \(i\) at each time point. For this data set, the data are balanced because the observation times are the same for each unit, and the data are regular because the observation times are equally spaced apart. Most observational data is not balanced and is often irregular. We can generalize our notation to allow for this type of data by changing \(m\) to \(m_i\), which captures the total number of measurements for the \(i\)th unit,

\[ Y_{ij} = \text{the }j\text{th measurement taken on unit }i, \quad\quad i= 1,...,n, j =1,...,m_i\]

The important message is that it is possible to represent the outcomes for the \(i\)th child in a very streamlined and convenient way. Each child \(i\) has its outcome vector, \(\mathbf{Y}_i\). It often makes sense to think of the data not just as individual outcomes \(Y_{ij}\), some from one child, some from another according to the indices, but rather as vectors corresponding to children, the units–each unit has associated with it an entire outcome vector.

We can also consider explanatory variables. If we have \(p\) explanatory variables, we’ll let the value for the 1st variable, for the ith unit at the jth time, be represented as \(x_{ij1}\). That means that for the ith unit, we can collect all of their values of explanatory variables (across time) in a matrix,

\[\mathbf{X}_i = \left(\begin{array}{cccc} x_{i11}&x_{i12}&\cdots&x_{i1p}\\ x_{i21}&x_{i22}&\cdots&x_{i2p}\\ \vdots&\vdots&\ddots&\vdots\\ x_{im1}&x_{im2}&\cdots&x_{imp}\\ \end{array}\right)\]

These explanatory variables might be time-invariant such that we have a variable like the treatment group. Alternatively, we might have explanatory variables that are time-varying, such as age in the values observed at each time point may change.

6.4.1 Multivariate Normal Probability Model

We first discussed this in Chapter 2, so feel free to return to Random Vectors and Matrices.

When we represent the outcomes for the \(i\)th unit as a random vector, \(\mathbf{Y}_i\), it is useful to consider a multivariate model such as the multivariate normal probability model.

The joint probability distribution that is the extension of the univariate version to a (\(m \times 1\)) random vector \(\mathbf{Y}\), each of whose components is normally distributed, is given by \[ f(\mathbf{y}) = \frac{1}{(2\pi)^{m/2}}|\Sigma|^{-1/2}\exp\{ -(\mathbf{y} - \mu)^T\Sigma^{-1}(\mathbf{y} - \mu)/2\}\]

This probability density function describes the probabilities with which the random vector \(\mathbf{Y}\) takes on values jointly in its \(m\) elements.
The form is determined by the mean vector \(\mu\) and covariance matrix \(\Sigma\).

The form of \(f(\mathbf{y})\) depends critically on the term \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu})\]

The quadratic form of the term pops up in many common methods. You’ll see it in generalized least squares (GLS) and the Mahalanobis distance as a standardized sum of squares. Read the next section if you’d like to think more deeply about this quadratic form.

6.4.1.1 Quadratic Form (Optional)

The pdf of the multivariate normal pdf depends critically on the term \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu})\]

This is a quadratic form (in linear algebra), so it is a scalar function of the elements of \((\mathbf{y} -\mu)\) and \(\Sigma^{-1}\).

If we refer to the elements of \(\Sigma^{-1}\) as \(\sigma^{jk}\), i.e. \[\Sigma^{-1}=\left(\begin{array}{ccc} \sigma^{11}&\cdots&\sigma^{1m}\\ \vdots&\ddots&\vdots\\ \sigma^{m1}&\cdots&\sigma^{mm}\end{array} \right)\] then we may write \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu}) = \sum^m_{j=1}\sum^m_{k=1}\sigma^{jk}(y_j-\mu_j)(y_k-\mu_k).\]

Of course, the elements \(\sigma^{jk}\) will be complicated functions of the elements \(\sigma^2_j\), \(\sigma_{jk}\) of \(\Sigma\), i.e. the variances of the \(Y_j\) and the covariances among them.

This term thus depends on not only the squared deviations \((y_j - \mu_j)^2\) for each element in \(\mathbf{y}\) (which arise in the double sum when \(j = k\)), but also on the crossproducts \((y_j - \mu_j)(y_k - \mu_k)\). Each contribution of these squares and cross products is standardized by values \(\sigma^{jk}\) that involve the variances and covariances.
Thus, although it is quite complicated, one gets the suspicion that the quadratic form has an interpretation, albeit more complex, as a distance measure, just as in the univariate case.

To better understand the multivariate distribution, it is instructive to consider the special case \(m = 2\), the simplest example of a multivariate normal distribution (hence the name bivariate).

Here, \[\mathbf{Y} = \left(\begin{array}{c} Y_1\\ Y_2\end{array} \right), {\mu} = \left(\begin{array}{c} \mu_1\\\mu_2\end{array} \right), \Sigma = \left(\begin{array}{cc} \sigma^2_1&\sigma_{12}\\\sigma_{12}&\sigma^2_2\end{array} \right) \]

Using the inversion formula for a (\(2 \times 2\)) matrix,

\[\Sigma^{-1} = \frac{1}{\sigma^2_1\sigma^2_2 - \sigma^2_{12}}\left(\begin{array}{cc} \sigma^2_2&-\sigma_{12}\\-\sigma_{12}&\sigma^2_1 \end{array}\right) \]

We also have that the correlation between \(Y_1\) and \(Y_2\) is given by \[\rho_{12} = \frac{\sigma_{12}}{\sigma_1\sigma_2}.\]

Using these results, it is an algebraic exercise to show that (try it!) \[ (\mathbf{y} - \mu)^T\Sigma^{-1}(\mathbf{y} - \mu) = \frac{1}{1-\rho^2_{12}}\left\{ \frac{(y_1-\mu_1)^2}{\sigma^2_1}+\frac{(y_2-\mu_2)^2}{\sigma^2_2}-2\rho_{12}\frac{(y_1-\mu_1)}{\sigma_1}\frac{(y_2-\mu_2)}{\sigma_2}\right\}\]

One component is the sum of squared standardized values (z-scores) \[ \frac{(y_1-\mu_1)^2}{\sigma^2_1}+\frac{(y_2-\mu_2)^2}{\sigma^2_2}\]

This sum is similar to the sum of squared deviations in least squares, with the difference that each deviation is now weighted by its variance. This makes sense–because the variances of \(Y_1\) and \(Y_2\) differ, information on the population of \(Y_1\) values is of a different quality than that on the population of \(Y_2\) values. If the variance is large, the quality of information is poorer; thus, the larger the variance, the smaller the weight, so that information of higher quality receives more weight in the overall measure. Indeed, this is like a distance measure, where each contribution receives an appropriate weight.

In addition, there is an extra term that makes it have a different form than just a sum of weighted squared deviations: \[-2\rho_{12}\frac{(y_1-\mu_1)}{\sigma_1}\frac{(y_2-\mu_2)}{\sigma_2}\]

This term depends on the cross-product, where each deviation is again weighted by its variance. This term modifies the distance measure in a way connected with the association between \(Y_1\) and \(Y_2\) through their cross-product and correlation \(\rho_{12}\). Note that the larger this correlation in magnitude (either positive or negative), the more we modify the usual sum of squared deviations.

Note that the entire quadratic form also involves the multiplicative factor \(1/(1 -\rho^2_{12})\), which is greater than 1 if \(|\rho_{12}| > 0\). This factor scales the overall distance measure by the magnitude of the association.