3.1 Modeling Goals

Broadly, a model is a simplified representation of the world. When we build models, we may have different goals.

One goal when building models is prediction. Given data on a response or outcome variable, \(Y\), and one or more predictor or explanatory variables, \(X\), the goal is to find a mathematical function, \(f\), of \(X\) that gives good predictions of \(Y\). For example, we might want to be able to predict a customer’s chest size knowing their neck size. This \(X\) may be a single variable, but it is most often a set of variables. We’ll be building up to multivariate modeling over the course of this chapter.

Can you think of some other concrete examples in which we’d want a model to do prediction? Consider what predictions might be made about you every day.

What are the qualities of a good model and function, \(f\)? We want to find an \(f(X)\) such that if we plug in a value of \(X\) such as \(X=x\), we’ll get a good predictor of the observed outcome values \(y\). In other words, we want the model prediction \(\hat{y}=f(x)\) (read, “y hat”) to be close to the observed outcome value. We want \(y-\hat{y}\) to be small. This difference between the observed value and the prediction, \(y-\hat{y}\), is called a residual. We’ll discuss residuals more later.

Another goal when building models is description. We want a model to “explain” the relationship between the \(X\) and \(Y\) variables. Note that an overly complicated model may not be that useful here because it can’t help us understand the relationship. A more complex model may, however, produce better predictions. George Box is often quoted “All models are wrong but some are useful.” Depending on our goal, one model may be more useful than another.

Can you think of some concrete examples in which we’d want a model to do explain a phenomenon? Consider how policy decisions get made.

To begin, we will consider a simple, but powerful model in which we limit this function, \(f(X)\), to be a straight line with a y-intercept, \(\beta_0\), and slope, \(\beta_1\). (\(\beta\) is the Greek letter beta.) The \(E[Y | X]\) below stands for the expected value of the response variable \(Y\) for a given value of \(X\).

\[E[Y | X] = \beta_0 + \beta_1\,X\]

This is a simple linear regression model. We model the expected value (the average value) of the response variable given an explanatory variable \(X\) as a line. It is the foundation of many statistical models used in modern statistics and is more flexible than you may think.

We’ll need to find the values of the slope and intercept that give us a line that gets close to the averages in order to explain the general relationship between \(X\) and \(Y\). We’ll get to this soon.

Once we have values the intercept and slope, we are going to call them estimates and put a hat on them, so \(\hat{\beta}_0\) is our estimated intercept and our estimated slope is \(\hat{\beta}_1\). We can use those to make predictions by plugging in a value of x:

\[\hat{y} = \hat{\beta}_0 +\hat{\beta}_1 x\]

The little hat on top of \(\hat{y}\) means that we’re talking about a predicted or estimated value of \(y\), so our model says that the predicted or estimated value of \(y\) is equal to an estimated intercept (\(\hat{\beta}_0\)), plus an estimated slope (\(\hat{\beta}_1\)), times the value \(x\).

In the past, you may have seen the equation of a line as

\[y = mx + b\]

where \(m\) is the slope and \(b\) is the y-intercept. We will be using different notation so that it can generalize to multiple linear regression.

The y-intercept is the value when \(x=0\) and the slope is change in \(y\) for each 1 unit increase of \(x\) (“rise over run”).