5 Model Components and Detrending

Learning Goals

Explain the differences between model components such as trend, seasonality, serial error, and measurement error.
Understand and implement the tools available to model and remove trends (polynomials, splines, local regression, moving average filters, differencing) and seasonality (indicators, sine and cosine curves, seasonality differencing)

Group Activity

Download a template RMarkdown file to start from here.

Data Context

The data come from Mauna Loa Observatory (MLO) in Hawaii. According to the website, “the undisturbed air, remote location, and minimal influences of vegetation and human activity at MLO are ideal for monitoring constituents in the atmosphere that can cause climate change.”

You’ll analyze the CO2 concentration measurements available on the MLO’s website. CO2 measurements have been collected since 1958 until present day and they are presented as monthly averages. Click the website below to see the data in the text file. The first 50 lines include information about the data and provide variable definitions. Read the data into R with the code below. I have provided you some code to add Variable Names to a data frame.

co2 <- read.table('ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt',skip=50,header=FALSE)
names(co2) <- c('Year','Month','Date','CO2','CO2Trend','Days','sdDays','seCO2')
co2[co2 < 0] <- NA
head(co2)

Using ggplot(), make a plot of the CO2 concentration (CO2) as a function of the Date. Comment on what you observe in the data context.

require(ggplot2)
require(dplyr)

Using ggplot(), make a plot of the CO2 concentration (CO2) as a function of the Month, grouped by Year. Comment on what you observe in the data context.

Estimating Trend

Estimate the trend of the data using a moving average filter. Create a vector of weights (save it as fltr) for points before and after the time that you are estimating the trend, such that you are adequately averaging over one years of data.

co2data <- ts(co2$CO2, start = c(1958,3), frequency = 12) # ts (time series) objects keeps track of time and the values.

fltr <- ??  #vector of weights for moving average filter

co2 <- co2 %>%
  mutate( co2Trend = as.numeric(stats::filter(co2data, filter = fltr, method = 'convo', sides = 2)))

co2  %>% 
  ggplot(aes(x = Date, y = CO2)) +
  geom_line() +
  geom_line(aes(y = co2Trend), color = 'red') +
  theme_classic()

Estimate the trend of the data using local regression, loess(). Adjust span= so as to only capture the trend and not the seasonality. Compare the estimated trend with that from the moving average.

co2 <- co2 %>%
  mutate(co2Trend2 = predict(loess(CO2 ~ Date, data = co2, span = ??)))

co2  %>% 
  ggplot(aes(x = Date, y = CO2)) +
  geom_line() +
  geom_line(aes(y = co2Trend), color = 'red') +
  geom_line(aes(y = co2Trend2), color = 'blue') +
  theme_classic()

What other ways might we estimate this trend? Think about other classes you’ve taken (Stat 155, Stat 253). Try implementing those approaches to estimating the trend, save the predicted values as co2Trend3 and co2Trend4, and compare them to the other two estimated trends by plotting it on top.

# Add new variables, co2Trend3 and co2Trend4 to the co2 data set

Choose a method (of the four above) to estimate trend that you find works best. Demonstrate its utility on this data set by visualizing the residuals (also known as the detrended data). Make sure the plot includes a horizontal line at 0 and a smooth estimated mean of the residuals across time.

co2 <- co2 %>%
  mutate(Detrend = CO2 - ??)

Estimating Seasonality

Visualize the seasonality by first removing the estimated trend (the trend you prefer) and plotting those residuals (Detrend) as a function of Month, grouped by Year.
Estimate the seasonality using indicators for Month to estimate the monthly average deviations due to seasonality using linear regression. Plot the estimated average monthly deviations (estimate of the seasonality). Comment on the plot. Does this make sense based on your understanding of the planet and CO2 in the atmosphere?

lm.season <- lm(Detrend ~ ??, data = co2)
  
# Add new variable, Season, as the predictions from lm.season to the co2 data set
co2 <- co2 %>%
  mutate(??)  

# Plot estimated average monthly deviations

Visualize the leftover residuals after removing the estimated seasonality (Errors). Do the residuals look like independent random noise?

co2 <- co2 %>%
  mutate(Errors = Detrend - Season)

co2 %>%
  ggplot(aes(x = Date, y = Errors)) + 
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0) + 
  theme_classic()

Do you think it is safe to assume the resulting random process is stationary?
Assuming stationarity, estimate the correlation function. Comment and what you notice. What does it mean in the context of the problem? Is it independent random noise?

acf(na.omit(co2$Errors))

We will consider models for these error processes, assuming they are stationary.

CHALLENGE: Let’s say you want to predict the CO2 concentrations in the near future. How could you use what we’ve done so far to do that prediction? How might you adjust the processes we’ve used to make forecasts in the future?

Concept: Differencing

Instead of estimating the trend and seasonality, we could directly remove the trend and seasonality with differencing.

A first-order difference is \(y_t - y_{t-1}\). A first-order seasonal difference with lag 12 is \(y_t - y_{t-12}\).

First-order Differences

If you take a first-order difference between observations 12 months apart (lag 12), you get this. Describe what is left over.

plot(diff(co2data,lag = 12))

If you take a first-order difference between observations 1 month apart (lag 1), you get this. Describe what is left over.

plot(diff(co2data,lag = 1))

If you take the 1 month differences and then take 12 month differences, we get this. Describe what is left over.

\[(y_t - y_{t-1}) - (y_{t-12} - y_{t-13})\]

plot(diff(diff(co2data,lag = 1),lag = 12))

Try a seasonal lag of a different value other than 12, what happens? Describe what is left over.

plot(diff(diff(co2data),lag = ?))

Imagine process with a linear trend line (\(y_t = t\)). What will a first-order difference look like? Prove it to yourself algebraically (and then look at the graph).

\[(y_t - y_{t-1})\]

times = 1:10
y = times

plot(times, y)
plot(diff(y))

Imagine a process with no trend but a seasonal component where \(y_t = y_{t-k}\). What would a first-order difference of seasonal lag \(k\) do? Prove it to yourself algebraically (and then look at the graph).

\[(y_t - y_{t-k})\]

times = 1:100
y = rep(c(1:5,5:1),10)

plot(times,y,type='l')
plot(diff(y, lag=10))

Second-order Differences

Imagine a random process with quadratic trend (\(y_t = t^2\)). What will a first-order difference look like? Prove it to yourself algebraically (and then look at the graph).

times = 1:10
y = times^2

plot(times,y)
plot(diff(y))

For the process with quadratic trend, what will a second-order difference (difference the difference) look like? Prove it to yourself algebraically (and then look at the graph).

\[(y_t - y_{t-1}) - (y_{t-1} - y_{t-2}) = y_t - 2y_{t-1} + y_{t-2}\]

times = 1:10
y = times^2

plot(times,y)
plot(diff(y, differences=2))

Imagine a random walk where \(y_t = y_{t-1} + \epsilon_t\). What would a first-order difference of lag 1 do?

\[(y_t - y_{t-1})\]

e = rnorm(100)
y = cumsum(e)

plot(y,type='l')
plot(diff(y, lag=1),type='l')
acf(diff(y, lag=1))