1.2 Motivation for Methods

To motivate why this class is important, let’s consider applying linear regression (least squares) models that you learned in your introductory course, Stat 155. We will call the method of least squares regression you used ordinary least squares (or OLS, for short).

1.2.1 Simulated Data

Suppose we have 50 observations over time where our outcome of interest is observed yearly, and it linearly increases over time, but there is random variation from year to year. If the random fluctuation is large and positive one year, it is more likely to be large and positive next year. In particular, let’s assume that our outcome \(y_t\) roughly doubles every year such that

\[y_t = 2*x_t + \epsilon_t\]

where \(x_1 = 1, x_2 = 2,...,x_{50} = 50\) represents the year since 1950 and the random fluctuation, or the noise, \(\epsilon_t\) are positively correlated from time \(t\) to \(t+1\) and generated from an autoregressive process – we’ll talk about this process later.

Let’s see a plot of one possible realization (for one subject or unit) generated from this random process. The overall trend is a line with an intercept at 0 and a slope of 2, but we can see that the observations don’t just randomly bounce around that line but rather stick close to where the past value was.

1.2.2 OLS Estimation

If we were to ignore the correlation in the random fluctuation across time, we could fit a simple linear regression model,

\[y_t = \beta_0 + \beta_1 x_t + \epsilon_t,\quad \epsilon_t \stackrel{iid}{\sim} N(0,\sigma^2)\]

using OLS to estimate the general or overall relationship with year, we’ll call that big picture relationship the trend.

lm(y ~ x) %>% summary()

Even with linear regression, we do well estimating the slope (\(\hat{\beta_1} = 2.19\) when the true slope value in how we generated the data is \(\beta_1 = 2\)). But let’s consider the standard error of that slope, the estimated variability of that slope estimate. The lm() function gives a standard error of 0.08074.

WARNING: This is not a valid estimate of the variability in the slope with correlated data. Let’s generate more data to see why!

Let’s simulate this same process 500 times (get different random fluctuations) so we can get a sense of how much the estimated slope (which was a “good” estimate) might change.

We can look at all of the estimated slopes by looking at a histogram of the values in beta.

sim_cor_data_results %>%
  ggplot(aes(x = beta)) + 
  geom_histogram() + 
  geom_vline(xintercept = 2, color = 'red') #true value used to generate the data

It is centered at the true slope of 2 (great!). This indicates that our estimate is unbiased (on average, OLS using the lm() function gets the true value of 2). The estimate is unbiased if the expected value of our estimated slope equals the true slope we used to generate the data,

\[E(\hat{\beta_1}) = \beta_1\]

Now, let’s look at the standard errors (SE) that OLS using lm() gave us.

sim_cor_data_results %>%
  ggplot(aes(x = se)) + 
  geom_histogram() + 
  geom_vline(xintercept = sd(beta), color = 'red') #true variability

The true variability of the slopes is estimated by the standard deviation of beta from the simulations (around 0.33), but the values provided from lm() are all between 0.03 and 0.1.

The function lm(), which assumes the observations are independent, underestimates the true variation of the slope estimate.

Why does this happen?

If the data were truly independent, then each random data point would give you unique, important information. If the data are correlated, the observations contain overlapping information (e.g., knowing today’s interest in cupcakes tells you something about tomorrow’s interest in cupcakes). Thus your effective sample size for positively correlated data is going to be less than the sample size of independent observations. You could get the same amount of information about the phenomena with fewer data points. You could space them apart further in time so they are almost independent.

The variability of our estimates depends on the information available. Typically, the sample size of independent observations captures the measure of information, but if we have correlated, the effective sample size (which is usually smaller than the actual sample size) gives a more accurate measure of the information available.

In this specific simulation of \(n=50\) from an autoregressive order 1 process, the effective sample size is calculated as \(50/(1 + 2*0.9) = 17.9\), where \(0.9\) was the correlation used to generate the autoregressive noise. The 50 correlated observations contain about the same amount of information as about 18 independent observations.

If you are interested in the theory of calculating effective sample size, check out these two blog posts, Andy Jones Blog and Stan handbook

What is the big takeaway?

When we use lm(), we are using the ordinary least squares (OLS) method of estimating a regression model. In this method, we assume that the observations are independent of each other.

If our data is actually correlated (not independent, the OLS slope estimates are good (unbiased), but the inference based on the standard errors (including the test statistics, the p-values, and confidence intervals) are wrong.

In this case, the true variability of the slope estimates is much higher than the OLS estimates given by lm() because the effective sample size is much smaller.