5 Model Components and Detrending
Learning Goals
- Explain the differences between model components such as trend, seasonality, serial error, and measurement error.
- Understand and implement the tools available to model and remove trends (polynomials, splines, local regression, moving average filters, differencing) and seasonality (indicators, sine and cosine curves, seasonality differencing)
Slides from today are available here.
Group Activity
Download a template RMarkdown file to start from here.
Data Context
The data come from Mauna Loa Observatory (MLO) in Hawaii. According to the website, “the undisturbed air, remote location, and minimal influences of vegetation and human activity at MLO are ideal for monitoring constituents in the atmosphere that can cause climate change.”
You’ll analyze the CO2 concentration measurements available on the MLO’s website. CO2 measurements have been collected since 1958 until present day and they are presented as monthly averages. Click the website below to see the data in the text file. The first 50 lines include information about the data and provide variable definitions. Read the data into R with the code below. I have provided you some code to add Variable Names to a data frame.
<- read.table('ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt',skip=50,header=FALSE)
co2 names(co2) <- c('Year','Month','Date','CO2','CO2Trend','Days','sdDays','seCO2')
< 0] <- NA
co2[co2 head(co2)
- Using
ggplot()
, make a plot of the CO2 concentration (CO2
) as a function of theDate
. Comment on what you observe in the data context.
require(ggplot2)
require(dplyr)
- Using
ggplot()
, make a plot of the CO2 concentration (CO2
) as a function of theMonth
, grouped byYear
. Comment on what you observe in the data context.
Estimating Trend
- Estimate the trend of the data using a moving average filter. Create a vector of weights (save it as
fltr
) for points before and after the time that you are estimating the trend, such that you are adequately averaging over one years of data.
<- ts(co2$CO2, start = c(1958,3), frequency = 12) # ts (time series) objects keeps track of time and the values.
co2data
<- ?? #vector of weights for moving average filter
fltr
<- co2 %>%
co2 mutate( co2Trend = as.numeric(stats::filter(co2data, filter = fltr, method = 'convo', sides = 2)))
%>%
co2 ggplot(aes(x = Date, y = CO2)) +
geom_line() +
geom_line(aes(y = co2Trend), color = 'red') +
theme_classic()
- Estimate the trend of the data using local regression,
loess()
. Adjustspan=
so as to only capture the trend and not the seasonality. Compare the estimated trend with that from the moving average.
<- co2 %>%
co2 mutate(co2Trend2 = predict(loess(CO2 ~ Date, data = co2, span = ??)))
%>%
co2 ggplot(aes(x = Date, y = CO2)) +
geom_line() +
geom_line(aes(y = co2Trend), color = 'red') +
geom_line(aes(y = co2Trend2), color = 'blue') +
theme_classic()
- What other ways might we estimate this trend? Think about other classes you’ve taken (Stat 155, Stat 253). Try implementing those approaches to estimating the trend, save the predicted values as
co2Trend3
andco2Trend4
, and compare them to the other two estimated trends by plotting it on top.
# Add new variables, co2Trend3 and co2Trend4 to the co2 data set
- Choose a method (of the four above) to estimate trend that you find works best. Demonstrate its utility on this data set by visualizing the residuals (also known as the detrended data). Make sure the plot includes a horizontal line at 0 and a smooth estimated mean of the residuals across time.
<- co2 %>%
co2 mutate(Detrend = CO2 - ??)
Estimating Seasonality
Visualize the seasonality by first removing the estimated trend (the trend you prefer) and plotting those residuals (
Detrend
) as a function ofMonth
, grouped byYear
.Estimate the seasonality using indicators for
Month
to estimate the monthly average deviations due to seasonality using linear regression. Plot the estimated average monthly deviations (estimate of the seasonality). Comment on the plot. Does this make sense based on your understanding of the planet and CO2 in the atmosphere?
<- lm(Detrend ~ ??, data = co2)
lm.season
# Add new variable, Season, as the predictions from lm.season to the co2 data set
<- co2 %>%
co2 mutate(??)
# Plot estimated average monthly deviations
- Visualize the leftover residuals after removing the estimated seasonality (
Errors
). Do the residuals look like independent random noise?
<- co2 %>%
co2 mutate(Errors = Detrend - Season)
%>%
co2 ggplot(aes(x = Date, y = Errors)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0) +
theme_classic()
Do you think it is safe to assume the resulting random process is stationary?
Assuming stationarity, estimate the correlation function. Comment and what you notice. What does it mean in the context of the problem? Is it independent random noise?
acf(na.omit(co2$Errors))
We will consider models for these error processes, assuming they are stationary.
- CHALLENGE: Let’s say you want to predict the CO2 concentrations in the near future. How could you use what we’ve done so far to do that prediction? How might you adjust the processes we’ve used to make forecasts in the future?
Concept: Differencing
Instead of estimating the trend and seasonality, we could directly remove the trend and seasonality with differencing.
A first-order difference is \(y_t - y_{t-1}\). A first-order seasonal difference with lag 12 is \(y_t - y_{t-12}\).
First-order Differences
- If you take a first-order difference between observations 12 months apart (lag 12), you get this. Describe what is left over.
plot(diff(co2data,lag = 12))
- If you take a first-order difference between observations 1 month apart (lag 1), you get this. Describe what is left over.
plot(diff(co2data,lag = 1))
- If you take the 1 month differences and then take 12 month differences, we get this. Describe what is left over.
\[(y_t - y_{t-1}) - (y_{t-12} - y_{t-13})\]
plot(diff(diff(co2data,lag = 1),lag = 12))
- Try a seasonal lag of a different value other than 12, what happens? Describe what is left over.
plot(diff(diff(co2data),lag = ?))
- Imagine process with a linear trend line (\(y_t = t\)). What will a first-order difference look like? Prove it to yourself algebraically (and then look at the graph).
\[(y_t - y_{t-1})\]
= 1:10
times = times
y
plot(times, y)
plot(diff(y))
- Imagine a process with no trend but a seasonal component where \(y_t = y_{t-k}\). What would a first-order difference of seasonal lag \(k\) do? Prove it to yourself algebraically (and then look at the graph).
\[(y_t - y_{t-k})\]
= 1:100
times = rep(c(1:5,5:1),10)
y
plot(times,y,type='l')
plot(diff(y, lag=10))
Second-order Differences
- Imagine a random process with quadratic trend (\(y_t = t^2\)). What will a first-order difference look like? Prove it to yourself algebraically (and then look at the graph).
= 1:10
times = times^2
y
plot(times,y)
plot(diff(y))
- For the process with quadratic trend, what will a second-order difference (difference the difference) look like? Prove it to yourself algebraically (and then look at the graph).
\[(y_t - y_{t-1}) - (y_{t-1} - y_{t-2}) = y_t - 2y_{t-1} + y_{t-2}\]
= 1:10
times = times^2
y
plot(times,y)
plot(diff(y, differences=2))
- Imagine a random walk where \(y_t = y_{t-1} + \epsilon_t\). What would a first-order difference of lag 1 do?
\[(y_t - y_{t-1})\]
= rnorm(100)
e = cumsum(e)
y
plot(y,type='l')
plot(diff(y, lag=1),type='l')
acf(diff(y, lag=1))