6.6 Generalized Linear Models

When you have outcome data that is not continuous, we can’t use a least squares approach as it is only appropriate for continuous outcomes. However, we can generalize the idea of a linear model to allow for binary or count outcomes. This is called a generalized linear model (GLM). GLM’s extend regression to situations beyond the continuous outcomes with Normal errors (Nelder and Wedderburn 1972), and they are, in fact, a broad class of models for outcomes that are continuous, discrete, binary, etc.

GLM’s requires a three-part specification:

  1. Distributional assumption for \(Y\)
  2. Systematic component with \(\mathbf{X}\)
  3. Link function to relate \(E(Y)\) with systematic component

6.6.1 Distributional Assumption

The first assumption you need to make to fit a GLM is to assume a distribution for the outcome \(Y\).

Many distributions you have learned in probability (normal, Bernoulli, binomial, Poisson) belong to the exponential family of distributions that share a general form and statistical properties. GLM’s are limited to this family of distributions.

One important statistical property of the exponential family is that the variance can be written as a scaled function of the mean,

\[Var(Y) = \phi v(\mu)\quad \text{ where } E(Y) = \mu\]

where \(\phi>0\) is a dispersion or scale parameter and \(v(\mu)\) is a variance function of the mean.

6.6.2 Systematic Component

For a GLM, the mean or a transformed mean can be expressed as a linear combination of explanatory variables, which we’ll notate as \(\eta\):

\[\eta = \beta_0 + \beta_1X_{1}+\beta_2X_{2}+\cdots+\beta_pX_{p} \]

You’ll need to decide which explanatory variables should be used to model the mean. This may include a time variable (e.g., age, time since baseline, etc.) and other unit characteristics that are time-varying or time-invariant. We’ll refer to this as the mean model.