6.6 Generalized Linear Models
When you have outcome data that is not continuous, we can’t use a least squares approach as it is only appropriate for continuous outcomes. However, we can generalize the idea of a linear model to allow for binary or count outcomes. This is called a generalized linear model (GLM). GLM’s extend regression to situations beyond the continuous outcomes with Normal errors (Nelder and Wedderburn 1972), and they are, in fact, a broad class of models for outcomes that are continuous, discrete, binary, etc.
GLM’s requires a three-part specification:
- Distributional assumption for \(Y\)
- Systematic component with \(\mathbf{X}\)
- Link function to relate \(E(Y)\) with systematic component
6.6.1 Distributional Assumption
The first assumption you need to make to fit a GLM is to assume a distribution for the outcome \(Y\).
Many distributions you have learned in probability (normal, Bernoulli, binomial, Poisson) belong to the exponential family of distributions that share a general form and statistical properties. GLM’s are limited to this family of distributions.
One important statistical property of the exponential family is that the variance can be written as a scaled function of the mean,
\[Var(Y) = \phi v(\mu)\quad \text{ where } E(Y) = \mu\]
where \(\phi>0\) is a dispersion or scale parameter and \(v(\mu)\) is a variance function of the mean.
6.6.2 Systematic Component
For a GLM, the mean or a transformed mean can be expressed as a linear combination of explanatory variables, which we’ll notate as \(\eta\):
\[\eta = \beta_0 + \beta_1X_{1}+\beta_2X_{2}+\cdots+\beta_pX_{p} \]
You’ll need to decide which explanatory variables should be used to model the mean. This may include a time variable (e.g., age, time since baseline, etc.) and other unit characteristics that are time-varying or time-invariant. We’ll refer to this as the mean model.
6.6.3 Link Function
Lastly, the chosen link function transforms the mean and links the explanatory variables to that transformed mean.
\[g(\mu) = \eta = \beta_0 + \beta_1X_{1}+\beta_2X_{2}+\cdots+\beta_pX_{p} \]
This link function, \(g()\), allows us to use a linear function to model positive counts and binary variables.
There are canonical link functions for each distribution in the exponential family.
Normal (linear regression)
- \(v(\mu) = 1\)
- \(g(\mu) = \mu\) (identity)
Bernoulli/Binomial (m=1) (logistic regression)
- \(v(\mu)=\mu(1-\mu)\)
- \(g(\mu) = \log(\mu/(1-\mu))\) (logit)
Binomial
- \(v(\mu)=m\mu(1-\mu)\)
- \(g(\mu) = \log(\mu/(1-\mu))\) (logit)
Poisson (poisson regression)
- \(v(\mu)=\mu\)
- \(g(\mu) = \log(\mu)\) (log)
For the Six City Study, we can fit a model to predict whether or not a child has respiratory issues as a function of age and maternal smoking, ignoring the repeated measures on each child with the following R code. Notice we need to specify the mean model using the formula notation resp ~ age + smoke
, the family of the distribution we assume for our outcome family = binomial
and the link function `link = ’logit” we use to connect the linear model to the mean. With this set of assumptions, we are fitting a logistic regression model.
Please see my Introduction to Statistical Models Notes to refresh your memory of interpreting logistic regression models.