5.2 ACF: Autocorrelation Function
As we discussed earlier, the key feature of correlated data is the covariance and correlation of the data. We have decomposed the data for time series into trend, seasonality, and noise or error. We will now work to model the dependence in the errors.
Remember: For a stationary random process, \(Y_t\) (constant mean, constant variance), the autocovariance function is only dependent on the different in time, which we will refer to as the lag \(h\), so
\[\Sigma(h) = Cov(Y_t, Y_{t+|h|}) = E[(Y_t - \mu_t)(Y_{t+|h|} - \mu_{t+|h|})] \] for any time \(t\).
Most time series are not stationary because they do not have a constant mean across time. By removing the trend and seasonality, we attempt to get errors (also called residuals) with a constant mean around 0. We’ll discuss the constant variance assumption later.
If we assume that the residuals, \(y_1,...,y_n\), are generated from a stationary process, we can estimate the autocovariance function with the sample autocovariance function (ACVF),
\[c_h = \frac{1}{n}\sum^{n-|h|}_{t=1}(y_t - \bar{y})(y_{t+|h|} - \bar{y})\] where \(\bar{y}\) is the sample mean, averaged across time because we are assuming the mean is constant.
There are a few useful properties of the ACVF function. For any sequence of observations, \(y_1,...,y_n\),
- \(c_h = c_{-h}\)
- \(c_0\geq 0\) and \(|c_{h}| \leq c_{0}\)
The sample autocorrelation function (ACF) is the covariance divided by the variance, also known as the covariance for lag 0,
\[r_h = \frac{c_h}{c_0}\text{ for } c_0 > 0\]
There are a few useful properties of the ACF function. For any sequence of observations, \(y_1,...,y_n\),
- \(r_h = r_{-h}\)
- \(r_0 = 1\) and \(|r_{h}| \leq 1\)
We expect a fairly high correlation between observations with large lags for a non-stationary series with a trend and seasonality. This high correlation pattern typically indicates that you need to deal with trend and seasonality.
acf(birth) # acf works well on ts objects (lag 1.0 is one year since it knows that one year is one cycle)
Sample ACF for Trend: Very slow decay to zero Sample ACF for Trend + Seasonality: Very slow decay to zero + Periodic
We would like to see the autocorrelation of the random errors after removing the trend and the seasonality. Let’s look at the sample autocorrelation of the residuals from the model with a moving average filter estimated trend and monthly averages to account for seasonality.
birthTS %>%
dplyr::select(ResidualTS) %>%
dplyr::filter(complete.cases(.)) %>%
acf() # if the data is not a ts object, lags will be in terms of the index
The autocorrelation has to be 1 for lag 0 because \(r_0 = c_0/c_0 = 1\).
Note that the lags are in terms of months here because we did not
specify a ts()
object. We see the autocorrelation slowly decreasing to
zero.
What about the ACF of birth data after differencing?
Sample ACF for Seasonality: Periodic
Note that the lags here are in terms of years (Lag = 1 on the plot
refers to \(h\) = 12) because the data is saved as a ts
object. In the
first plot (after only first differencing), we see an autocorrelation of
about 0.5 for one year apart observations. This suggests that
there may still be some seasonality to be accounted for. In the second
plot, after we also did seasonal differencing for lag = 12, that
decreases a bit (and becomes a bit negative).
Now, do you notice the blue, dashed horizontal lines?
These blue horizontal lines are guide lines for us. If the random process is white noise such that the observations are independent and identically distributed with a constant mean (of 0) and constant variance \(\sigma^2\), we’d expected the sample autocorrelation to be within these dashed horizontal lines (\(\pm 1.96/\sqrt{n}\)) roughly 95% of the time. So if the random process were actually independent white noise, we’d expect 95% of the ACF estimates for \(h\not=0\) to be within the blue lines with no systematic pattern.
See an example of Gaussian white noise below and its sample autocorrelation function.
Sample ACF for White Noise: Zero except at lag 0
Many stationary time series have recognizable ACF patterns. We’ll get familiar with those in a moment.
However, most time series we encounter in practice are not stationary. Thus, our first step will always be to deal with the trend and seasonality. Then we can model the (hopefully stationary) residuals with a stationary model.