6.5 Multivariate Normal Probability Model

The joint probability distribution that is the extension of the univariate version to a (\(m \times 1\)) random vector \(\mathbf{Y}\), each of whose components are normally distributed (but possibly associated), is given by \[ f(\mathbf{y}) = \frac{1}{(2\pi)^{m/2}}|\Sigma|^{-1/2}\exp\{ -(\mathbf{y} - \mu)^T\Sigma^{-1}(\mathbf{y} - \mu)/2\}\]

This probability density describes the probabilities with which the random variable \(\mathbf{Y}\) takes on values jointly in its \(m\) elements.
The form is determined by \(\mu\) and \(\Sigma\). Thus, as in the univariate case, if we know the mean vector and covariance matrix of a random vector \(\mathbf{Y}\), and we know each of its elements are normally distributed, then we know everything about the joint probabilities associated with values \(\mathbf{y}\) of \(\mathbf{Y}\).
By analogy, the form of \(f(\mathbf{y})\) depends critically on the term \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu})\]

Note that this is a quadratic form, so it is a scalar function of the elements of \((\mathbf{y} -\mu)\) and \(\Sigma^{-1}\).

Specifically, if we refer to the elements of \(\Sigma^{-1}\) as \(\sigma^{jk}\), i.e. \[\Sigma^{-1}=\left(\begin{array}{ccc} \sigma^{11}&\cdots&\sigma^{1m}\\ \vdots&\ddots&\vdots\\ \sigma^{m1}&\cdots&\sigma^{mm}\end{array} \right)\] then we may write \[(\mathbf{y} -{\mu})^T\Sigma^{-1} (\mathbf{y} -{\mu}) = \sum^m_{j=1}\sum^m_{k=1}\sigma^{jk}(y_j-\mu_j)(y_k-\mu_k).\]

Of course, the elements \(\sigma^{jk}\) will be complicated functions of the elements \(\sigma^2_j\), \(\sigma_{jk}\) of \(\Sigma\), i.e. the variances of the \(Y_j\) and the covariances among them.

This term thus depends on not only the squared deviations \((y_j - \mu_j)^2\) for each element in \(\mathbf{y}\) (which arise in the double sum when \(j = k\)), but also on the crossproducts \((y_j - \mu_j)(y_k - \mu_k)\). Each contribution of these squares and crossproducts is being standardized somehow by values \(\sigma^{jk}\) that somehow involve the variances and covariances.
Thus, although it is quite complicated, one gets the suspicion that the quadratic form has an interpretation, albeit more complex, as a distance measure, just as in the univariate case.

To gain insight into this suspicion, and to get a better understanding of the multivariate distribution, it is instructive to consider the special case \(m = 2\), the simplest example of a multivariate normal distribution (hence the name bivariate).

Here, \[\mathbf{Y} = \left(\begin{array}{c} Y_1\\ Y_2\end{array} \right), {\mu} = \left(\begin{array}{c} \mu_1\\\mu_2\end{array} \right), \Sigma = \left(\begin{array}{cc} \sigma^2_1&\sigma_{12}\\\sigma_{12}&\sigma^2_2\end{array} \right) \]

Using the inversion formula for a (\(2 \times 2\)) matrix,

\[\Sigma^{-1} = \frac{1}{\sigma^2_1\sigma^2_2 - \sigma^2_{12}}\left(\begin{array}{cc} \sigma^2_2&-\sigma_{12}\\-\sigma_{12}&\sigma^2_1 \end{array}\right) \]

We also have that the **correlation} between \(Y_1\) and \(Y_2\) is given by \[\rho_{12} = \frac{\sigma_{12}}{\sigma_1\sigma_2}.\]

Using these results, it is an algebraic exercise to show that (try it!) \[ (\mathbf{y} - \mu)^T\Sigma^{-1}(\mathbf{y} - \mu) = \frac{1}{1-\rho^2_{12}}\left\{ \frac{(y_1-\mu_1)^2}{\sigma^2_1}+\frac{(y_2-\mu_2)^2}{\sigma^2_2}-2\rho_{12}\frac{(y_1-\mu_1)}{\sigma_1}\frac{(y_2-\mu_2)}{\sigma_2}\right\}\]

One component is the sum of standardized squared deviations \[ \frac{(y_1-\mu_1)^2}{\sigma^2_1}+\frac{(y_2-\mu_2)^2}{\sigma^2_2}\]

This sum alone is in the spirit of the sum of squared deviations in least squares, with the difference that each deviation is now weighted in accordance with its variance. This makes sense–because the variances of \(Y_1\) and \(Y_2\) differ, information on the population of \(Y_1\) values is of different quality than that on the population of \(Y_2\) values. If variance is large, the quality of information is poorer; thus, the larger the variance, the smaller the weight, so that information of higher quality receives more weight in the overall measure. Indeed, then, this is like a distance measure, where each contribution receives an appropriate weight.

In addition, there is an extra term that makes it have a different form than just a sum of weighted squared deviations: \[-2\rho_{12}\frac{(y_1-\mu_1)}{\sigma_1}\frac{(y_2-\mu_2)}{\sigma_2}\]

This term depends on the crossproduct, where each deviation is again weighted in accordance with its variance. This term modifies the distance measure in a way that is connected with the association between \(Y_1\) and \(Y_2\) through their crossproduct and their correlation \(\rho_{12}\). Note that the larger this correlation in magnitude (either positive or negative), the more we modify the usual sum of squared deviations.

Note that the entire quadratic form also involves the multiplicative factor \(1/(1 -\rho^2_{12})\), which is greater than 1 if \(|\rho_{12}| > 0\). This factor scales the overall distance measure in accordance with the magnitude of the association.