3.4 Estimating with Data

In this chapter, we have discussed the theory around autocovariance in function and matrix format and models, simplifications, and constraints that can be imposed and assumed.

In this last section, we’ll introduce the idea of estimating the numerical values of covariance and correlation based on data after assuming models, simplifications, and constraints.

The three estimators mentioned below can be used outside a larger statistical model, so we’ll start here.

3.4.1 Sample Covariance Matrix

To estimate the covariance matrix, \(\boldsymbol\Sigma\), with observed data sampled from the larger population as \(n\) realizations of these \(m\) random variables (imagine \(n\) individual people with \(m\) observations over time), we can calculate the sample covariance matrix,

\[\mathbf{S}_Y = \frac{1}{n-1}\sum^n_{i=1} (\mathbf{y}_i - \bar{\mathbf{y}})(\mathbf{y}_i - \bar{\mathbf{y}})^T \] where \(\mathbf{y}_i = (y_{i1},y_{i2},...,y_{im})\) and \(\bar{\mathbf{y}} = n^{-1} \sum^n_{i=1} \mathbf{y}_i\).

Therefore, the sample variances (on the diagonal of \(\mathbf{S}_Y\)) are the familiar estimates we saw in our introductory course,

\[ s^2_j = \frac{1}{n-1}\sum^n_{i=1}(y_{ij} - \bar{y}_j)^2\]

Note: This estimation is only possible for longitudinal data if we observed \(m\) repeated measurements for each case at the same time. We typically don’t have more than one realization of the process for time series and spatial data. For longitudinal data, if we have unbalanced data collected at irregular times, we’ll need to use one of the common model structures to estimate the covariance. In the Longitudinal Data chapter, we’ll return to this.

3.4.2 Sample Autocovariance Function

If we assume that observed data come from a stationary process, we can estimate the autocovariance function with the sample autocovariance function (ACVF),

\[c_k = \frac{1}{n}\sum^{n-|k|}_{t=1}(y_t - \bar{y})(y_{t+|k|} - \bar{y})\] where \(\bar{y}\) is the sample mean, averaged across time/space, because we are assuming it is constant.

For any sequence of observations of the random process, \(y_1,...,y_n\), the estimated ACVF has the following properties:

  1. \(c_k = c_{-k}\)
  2. \(c_0\geq 0\) and \(|c_{k}| \leq c_{0}\)

In the Time Series Data chapter, we’ll return to this.

3.4.3 Sample Semivariogram

If we assume that observed data come from a stationary process (or at least an intrinsic stationary process), we can estimate the semivariogram,

\[\gamma(h) = \frac{1}{2}Var(Y_{s+h} - Y_s)\]

To estimate the semivariogram, let \(H_1,...,H_k\) be a partition of the space of possible distances or lags \(h\), with \(h_u\) being a representative spatial lag/distance in \(H_u\). Then use your stationary process \(y_t\) to estimate the sample/empirical semivariogram.

\[\hat{\gamma}(h_u) = \frac{1}{2\cdot |\{i-j \in H_u\}|}\sum_{\{i-j\in H_u\}}(y_i - y_j)^2\]