1.3 General Themes
With any of the data examples above or the examples we talk about in class, observations taken closer in time or space are typically going to be more similar than observations taken further apart in space or time.
We may be able to explain why data closer together are more similar using predictors or explanatory variables, but there may be unmeasured characteristics or inherent dependence that we can’t explain with our collected data.
To be more precise, we will assume that the observed outcome at time \(t\) (we will generalize this notation to spatial data) can be modeled as
\[y_t = \underbrace{f(x_t)}_\text{trend} + \underbrace{\epsilon_t}_\text{noise}\]
where the trend can be modeled as a deterministic function based on predictors, and there is leftover random noise. This noise might include both serial autocorrelation due to observations being observed close in time plus random variability or measurement error from the data collection instrument.
Time Series Example
Consider the Google search frequency for “cupcake” data example.
The number of people who are searching for the term “cupcake” should be a function of general interest in cupcakes. This interest could change throughout the year by season, or it may be reflected in the number of cupcake shops in business or the number of mentions of cupcakes on network television. We could use these measured predictors in modeling the overall trend in search frequency.
What else may explain differences in the interest in cupcakes over time? Even if we can collect and account for these other cultural characteristics, the number of searches for cupcakes will be similar from one day to the next because culture and general interest typically do not change overnight (unless an extreme event happens).
Spatial Example
Consider the home sale prices from Zillow. Price will be determined by a combination of the home characteristics (e.g., number of bedrooms, bathrooms, size, home quality) as well as neighborhood characteristics (e.g., walking distance to amenities, perception of school reputation). These characteristics could be used to model the general trends of sale prices. Even after controlling for these measurable qualities, homes that are next to each other or on the same block will have a similar price.
For each data type, we discuss these two components, the trend and the noise. In particular, we can’t assume the noise is independent, so we need to model the covariance and correlation of the noise, treating it as a series of random variables.
1.3.1 Questions of Interest
- Dependence of the random variables in the process: ‘How do future values depend on past values? How do values depend on neighboring values?’
- Think: Covariance and correlation of random variables
- Long-Term Averages: ‘What is the average value at a particular point in time (or space)?’
- Think: Expected value of random variables (we’ll call the overall long-term average the trend)
- Cycles: ‘Are there recurring patterns in the average values?’
- Think: Cycles in the expected value of random variables (we’ll call the local cycles seasonality)
We’ll come back to questions 2 and 3 for each sub-field. Let’s spend some time thinking about the covariance and correlation in the context of a random process.
With all three of the correlated data types, we explicitly or implicitly model the covariance between observations, so we need to be quite familiar with the probability theory of covariance.