Chapter 7 Spatial Data
Compared to time series and longitudinal data, spatial data is indexed by space (in 2 or 3 dimensions).
Typically, we have point-referenced or geostatistical data where our outcome is \(Y(s)\) where \(s\in \mathbb{R}^d\) and \(s\) varies continuously. \(s\) may be a point on the globe referenced by its longitude and latitude or a point in another coordinate system. We are typically interested in the relationships between the outcome and explanatory variables and making predictions at locations where we do not have data. We will use that points closer to each other in space are more likely to be similar in value in our endeavors.
Below, we have mapped the value of zinc concentration (coded by color) at 155 spatial locations in a flood plain of the Meuse River in the Netherlands. We might be interested in explaining the variation in zinc concentrations in terms of the distance to the river, flooding frequency, soil type, land use, etc. After building a model to predict the mean zinc concentration, we could use that model to help us understand the current landscape and to make predictions. Remember that to make predictions, we have to observe these characteristics at other spatial locations.
require(sp)
require(ggmap)
data(meuse)
ggplot(meuse, aes(x = x, y = y, color = zinc)) +
geom_point() +
scale_color_viridis_c() +
coord_equal() +
theme_minimal()
We may not be able to collect data at that fine granularity of spatial location due to a lack of data or to protect the confidentiality of individuals. Instead, we may have areal or lattice or discrete data such that we have aggregate data that summarizes observations within a spatial boundary such as a county or state (or country or within a square grid). In this circumstance, we think that spatial areas are similar if they are close (share a boundary, centers are close to each other, etc.) We must consider correlation based on factors other than longitude and latitude.
Below, we have mapped the rate of sudden infant death syndrome (SIDS) for countries in North Carolina in 1974. We might be interested in explaining the variation in country SIDS rates in terms of population size, birth rate, and other factors that might explain county-level differences. After building a model to predict the mean SIDS rate, we could use that model to help us understand the current public health landscape, and we can use it to make predictions in the future.
library(sf)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)
ggplot(nc, aes(fill = SID74)) +
geom_sf() +
scale_fill_gradient(low='white',high='red') +
labs(title = "SIDS (sudden infant death syndrome) in North Carolina") +
theme_classic()
In some disciplines, we may be most interested in the locations themselves and study the point patterns or point processes to try and determine any structure in the locations.
The data below record the locations of 126 pine saplings in a Finnish forest, including their heights and diameters. We might be interested if there are any patterns in the location of the pines. Is it uniform? Is there clustering? Are there optimal distances between pines such that they’ll only grow if they are far enough away from each other (repelling each other)?