6.6 Generalized Linear Models

When you have outcome data that is not continuous, we can’t use a least squares approach as it is only appropriate for continuous outcomes. However, we can generalize the idea of a linear model to allow for binary or count outcomes. This is called a generalized linear model (GLM). GLM’s extend regression to situations beyond the continuous outcomes with Normal errors (Nelder and Wedderburn 1972), and they are, in fact, a broad class of models for outcomes that are continuous, discrete, binary, etc.

GLM’s requires a three-part specification:

Distributional assumption for \(Y\)
Systematic component with \(\mathbf{X}\)
Link function to relate \(E(Y)\) with systematic component

6.6.1 Distributional Assumption

The first assumption you need to make to fit a GLM is to assume a distribution for the outcome \(Y\).

Many distributions you have learned in probability (normal, Bernoulli, binomial, Poisson) belong to the exponential family of distributions that share a general form and statistical properties. GLM’s are limited to this family of distributions.

One important statistical property of the exponential family is that the variance can be written as a scaled function of the mean,

\[Var(Y) = \phi v(\mu)\quad \text{ where } E(Y) = \mu\]

where \(\phi>0\) is a dispersion or scale parameter and \(v(\mu)\) is a variance function of the mean.

6.6.2 Systematic Component

For a GLM, the mean or a transformed mean can be expressed as a linear combination of explanatory variables, which we’ll notate as \(\eta\):

\[\eta = \beta_0 + \beta_1X_{1}+\beta_2X_{2}+\cdots+\beta_pX_{p} \]

You’ll need to decide which explanatory variables should be used to model the mean. This may include a time variable (e.g., age, time since baseline, etc.) and other unit characteristics that are time-varying or time-invariant. We’ll refer to this as the mean model.

6.6.3 Link Function

Lastly, the chosen link function transforms the mean and links the explanatory variables to that transformed mean.

\[g(\mu) = \eta = \beta_0 + \beta_1X_{1}+\beta_2X_{2}+\cdots+\beta_pX_{p} \]

This link function, \(g()\), allows us to use a linear function to model positive counts and binary variables.

There are canonical link functions for each distribution in the exponential family.

Normal (linear regression)

\(v(\mu) = 1\)
\(g(\mu) = \mu\) (identity)

Bernoulli/Binomial (m=1) (logistic regression)

\(v(\mu)=\mu(1-\mu)\)
\(g(\mu) = \log(\mu/(1-\mu))\) (logit)

Binomial

\(v(\mu)=m\mu(1-\mu)\)
\(g(\mu) = \log(\mu/(1-\mu))\) (logit)

Poisson (poisson regression)

\(v(\mu)=\mu\)
\(g(\mu) = \log(\mu)\) (log)

For the Six City Study, we can fit a model to predict whether or not a child has respiratory issues as a function of age and maternal smoking, ignoring the repeated measures on each child with the following R code. Notice we need to specify the mean model using the formula notation resp ~ age + smoke, the family of the distribution we assume for our outcome family = binomial and the link function `link = ’logit” we use to connect the linear model to the mean. With this set of assumptions, we are fitting a logistic regression model.

summary(glm(resp ~ age + smoke, data = ohio, family = binomial(link = 'logit')))

Please see my Introduction to Statistical Models Notes to refresh your memory of interpreting logistic regression models.