4.2 Seasonality

Once we’ve estimated and removed the trend, we may notice cyclic patterns left in the data. Depending on the cyclic pattern’s form, we could use a few different techniques to model the seasonal pattern.

First, let’s visualize the variability in the seasonality between different years for the birth data set to see if there is a consistent cycle. To do so, we’ll work with the residuals after removing the trend (for now, let’s use the moving average filter). Since we are interested in the cycles, we want to know how the number of births varies across the months within a year.

Using the ggplot2 package, we can visualize this with geom_line() by plotting the residuals by the month and having it color each line according to the year. To do this, it creates one line per year. Interestingly, August and September have the highest residuals (approximately nine months after winter break). In this case, we see that every year follows the same pattern.

birthTS %>%
  ggplot(aes(x = Month, y = Residual, color = factor(Year))) + 
  geom_line() + 
  theme_classic()

4.2.1 Parametric Approaches

Similar to the trend, we can use some of our parametric tools to model the cyclic patterns in the data using our linear regression structures.

4.2.1.1 Indicator Variables

To estimate the seasonality that does not have a recognizable cyclic functional pattern (sine or cosine curve), we can model each month (or appropriate time unit within a cycle) to have its own average by including indicator variables for each month (or appropriate unit within a cycle) in a linear regression model.

Like when we estimate the trend, we want to ensure that we have fully estimated the seasonality but check the residuals to ensure that there is no seasonality in what is leftover.

birthTS <- birthTS %>%
  mutate(Month = factor(Month))

SeasonModel <- lm(Residual ~  Month, data = birthTS)
SeasonModel

birthTS <- birthTS %>%
  mutate(Season = predict(SeasonModel, newdata = data.frame(Month = birthTS$Month))) %>%
  mutate(ResidualTS = Residual - Season)

# Estimated Trend + Seasonality
birthTS %>%
  ggplot(aes(x = DecimalTime, y = Value)) +
  geom_point() +
  geom_line() +
  geom_line(aes(x = DecimalTime, y = Trend + Season), color = 'blue', size = 1.5) + 
  theme_classic()

# Residuals
birthTS %>%
  ggplot(aes(x = Month, y = ResidualTS)) + 
  geom_line(aes(group = factor(Year))) + 
  geom_smooth(se = FALSE) + 
  theme_classic()

4.2.1.2 Sine and Cosine Curves

While it isn’t the case for this birth data set, if the cyclic pattern is like a sine curve, we could use a sine or cosine curve to model the relationship.

time <- seq(0,2*pi,length = 100)
tibble(
  t = time,
  sin = sin(time),
  cos = cos(time)
) %>%
  ggplot(aes(x = t, y = sin)) + 
  geom_line() + 
  geom_line(aes(x = t, y = cos),color = 'blue') + 
  labs(x = 'x',y = 'f(x)') +
  theme_classic()

Some properties of sine and cosine curves:

\(\cos(0) = 1\), \(\cos(\pi/2) = 0\), \(\cos(\pi) = -1\), \(\cos(3\pi/2) = 0\) \(\cos(2\pi) = 1\)
\(\sin(0) = 0\), \(\sin(\pi/2) = 1\), \(\sin(\pi) = 0\), \(\sin(3\pi/2) = -1\) \(\sin(2\pi) = 0\)
The period (length of 1 full cycle) of sine or cosine function is \(2\pi\)
We can write a sine curve model as \(A\sin(2\pi ft + s)\) where \(A\) is the amplitude (peak deviation from 0 instead of 1), \(f\) is the frequency (number of cycles that occur each unit of time), \(t\) is the time, and \(s\) is the shift in the x-axis.

Now, let’s try to apply it to the birth data.

SineSeason <- lm(Residual ~ sin(2*pi*(DecimalTime) ), data = birthTS) # frequency is one because a unit in DecimalTime is one year
SineSeason

# Estimate Trend + Seasonality
birthTS %>%
  mutate(SeasonSine  = predict(SineSeason, newdata = data.frame(DecimalTime = birthTS$DecimalTime))) %>%
  ggplot(aes(x = DecimalTime, y = Value)) + 
  geom_point() + 
  geom_line() +
  geom_line(aes(x = DecimalTime, y = Trend + SeasonSine), color = 'blue') + 
  theme_classic()

# Residuals
 birthTS %>%
   mutate(ResidualTS2  = Residual - predict(SineSeason, newdata = data.frame(DecimalTime = birthTS$DecimalTime))) %>%
   ggplot(aes(x = Month, y = ResidualTS2)) + 
   geom_line(aes(group = factor(Year))) +
   geom_smooth(se = FALSE) + 
   theme_classic()

Since we still see a pattern in these residuals, using a sine/cosine curve is not an accurate way to estimate the cyclic pattern in this particular data set.

Data Example 3

Here is the daily temperature data in San Francisco in 2011. We only have one year, but we can see the cycle shape is closer to a cosine curve. The period, the length of one cycle, of a cosine or sine curve is \(2*\pi\), so we need to adjust the period when we fit a cosine or sine curve to our data. In the case below, the period is 365 since our time variable is DayinYear where 1 unit is one day in the year (Jan 1 is one and Dec 31 is 365).

load('./data/sfoWeather2011.RData')
sfoWeather <- sfoWeather %>%
  mutate(DayinYear = 1:365) 
sfoWeather %>%
  ggplot(aes(x = DayinYear, y = High)) +
  geom_line() + 
  theme_classic()


CosSeason <- lm(High ~ cos(2*pi*(DayinYear-60)/365), data = sfoWeather) # frequency is 1/365 because a unit in DayinYear is 1 day
CosSeason

sfoWeather %>%
  mutate(Season = predict(CosSeason)) %>% 
  ggplot(aes(x = DayinYear, y = High)) +
  geom_point() +
  geom_line(aes(x = DayinYear, y = Season), color = 'blue', size = 1.5) + 
  theme_classic()

To recap:

For a cosine function,

\[A\cos(Bx + C) + D\]

the period is \(\frac{2*\pi}{|B|}\), the amplitude is \(A\) (the coefficient in our linear model), the phase shift is \(C\) (included in our model above) and the vertical shift is \(D\) (intercept in our linear model).

4.2.2 Nonparametric Approaches

Similar to the trend, we could use local regression or moving averages to estimate the trend and seasonality as long as we defined the local area small enough to pick up the cycles.

The disadvantage of this approach is that we will not be able to use the estimate to make predictions in the future because we do not learn general patterns of trend and seasonality.

4.2.2.1 Differencing to Remove Seasonality

Like removing a trend, we can use differencing to remove seasonality if we don’t need to estimate it. Rather than a difference between neighboring observations, we will take differences between observations that are approximately one period apart. The period, length of one cycle, is \(k\) observations such that \(y_t\) is similar to \(y_{t-k}\) and \(y_{t-2k}\), then we’ll take differences between observations that are \(k\) observations apart.

For example, if you have monthly data and an annual cycle, you’d want to take differences between observations that are 12 lags apart (12 months apart). See the simulated monthly data below.

t <- 1:48
y <- ts(2 + 20*sin(2*pi*t/12) + rnorm(48,sd = 5), frequency=12)
plot(y)

plot(diff(y, lag = 12, differences = 1)) #first order difference of lag 12

Let’s try this approach with the birth data since we saw annual seasonality by month. First, we take first differences of neighboring values to remove the trend. Then we’ll take the first differences of lag 12 to remove the seasonality. No obvious trends or seasonality remain in the data, and it is centered around 0.

plot(diff(diff(birth,differences = 1), lag = 12, differences = 1))

4.2.3 Decomposition of Series

We have discussed a few options to model and remove two model components: the trend and the seasonality.

One way to visualize this decomposition of the series is with decompose(), which uses a moving average to estimate the trend and then finds the average for each unit of the cycle (in our case, months) to estimate the season.

plot(decompose(birth)) #you can decompose a ts object

What is left over, the random errors, are what we need to worry about now. We’ll first talk about how we might model the errors from time series data and then talk about translating these ideas to other types of correlated data.