6.7 Marginal Models

A marginal model for longitudinal data has a four-part specification, and the first three parts are similar to a GLM.

6.7.1 Model Specification

A distributional assumption about \(Y_{ij}\). We assume a distribution from the exponential family. We will only use this assumption to determine the mean and variance relationship.
A systematic component which is the mean model, \(\eta_{ij} = \mathbf{x}_{ij}^T\boldsymbol\beta\), which get connected to the mean through
a link function: The conditional expectation \(E(Y_{ij}|x_{ij}) = \mu_{ij}\) is assumed to depend on explanatory variables through a given link function (similar to GLM),

\[g(\mu_{ij}) = \eta_{ij} = \mathbf{x}_{ij}^T\boldsymbol\beta\]

Note: This implies that given \(\mathbf{x}_{ij}\), there is no dependence of \(Y_{ij}\) on \(\mathbf{x}_{ik}\) for \(j\not=k\) (warning: this might not hold if \(Y_{ij}\) predicts \(\mathbf{x}_{ij+1}\)).

The conditional variance of each \(Y_{ij}\), given explanatory variables, depends on the mean according to

\[Var(Y_{ij}|\mathbf{x}_{ij}) = \phi v(\mu_{ij}) \]

Note: The \(\phi\) (phi parameter) is a dispersion parameter and scales variance up or down.

Covariance Structure: The conditional within-subject association among repeated measures is assumed to be a function of a set of association parameters \(\boldsymbol\alpha\) (and the means \(\mu_{ij}\)). We choose a working correlation structure for \(Cor(\mathbf{Y}_i)\) from our familiar structures: Compound Symmetry/Exchangeable, Exponential/AR1, Gaussian, Spherical, etc. See Autocorrelation Function in Chapter 3 for examples.

Note: We may use the word “association” rather than correlation so that it applies in cases where the outcome is not continuous.

Based on the model specification, the covariance matrix for \(\mathbf{Y}_i\) is

\[\mathbf{V}_i = \mathbf{A}_i^{1/2}Cor(\mathbf{Y}_i)\mathbf{A}_i^{1/2} \]

where \(\mathbf{A}_i\) is a diagonal matrix with \(Var(Y_{ij}|X_{ij}) = \phi v(\mu_{ij})\) along the diagonal. We use the term working correlation structure to acknowledge our uncertainty about the assumed model for the within-subject associations.

6.7.2 Interpretation

We typically discuss how a 1 unit increase in an explanatory variable impacts the mean, keeping all other variables constant.

In a marginal model, we explicitly model the overall or marginal mean, \(E(Y_{ij}|X_{ij})\).

For a continuous outcome (with identity link function), the expected response for the value of \(X_{ij} = x\) is

\[E(Y_{ij}|X_{ij}=x) = \beta_0 +\beta_1x \]

The expected response for value of \(X_{ij} = x+1\) is

\[E(Y_{ij}|X_{ij}=x+1) = \beta_0 +\beta_1(x+1) \]

The difference is \[E(Y_{ij}| X_{ij}=x+1) -E(Y_{ij}|, X_{ij}=x) \] \[= \beta_0 +\beta_1(x+1) - (\beta_0 +\beta_1x) = \beta_1 \]

Therefore, in the context of a marginal model, \(\beta_1\) has the interpretation of a population mean change.

A 1-unit increase in \(X\) leads to a \(\beta_1\) increase in the overall mean response, keeping all other variables constant.

6.7.3 Estimation

To estimate the parameters, we will use generalized estimating equations (GEE). We want to minimize the following objective function (a standardized sum of squared residuals),

\[\sum^n_{i=1}(\mathbf{Y}_i - \boldsymbol\mu_i)^T\mathbf{V}_i^{-1}(\mathbf{Y}_i - \boldsymbol\mu_i) \]

treating \(\mathbf{V}_i\) as known and where \(\boldsymbol\mu_i\) is a vector of means with elements \(\mu_{ij} = g^{-1}(\mathbf{x}_{ij}^T\boldsymbol\beta)\).

Using calculus, it can be shown that if a minimum exists, it must solve the following generalized estimating equations,

\[\sum^n_{i=1}\mathbf{D}^T_i\mathbf{V}_i^{-1}(\mathbf{Y}_i - \boldsymbol\mu_i) = 0\]

where \(\mathbf{D}_i = \partial\boldsymbol\mu_i/\partial\boldsymbol\beta\) is the gradient matrix.

Depending on the link function, there may or may not be a closed-form solution. If not, the solution requires an iterative algorithm.

Because GEE depends on both \(\boldsymbol\beta\), \(\boldsymbol\alpha\) (parameters for the working correlation matrix) and \(\phi\) (variance scalar), the following iterative two-stage estimation procedure is required:

Step 1. Given current estimates of \(\boldsymbol\alpha\) and \(\phi\), \(V_i\) is estimated, and an updated estimate of \(\boldsymbol\beta\) is obtained as the solution to the generalized estimating equations.

Step 2. Given the current estimate of \(\beta\), updated estimates of \(\boldsymbol\alpha\) and \(\phi\) are obtained from the standardized residuals

\[e_{ij} = \frac{Y_{ij} - \hat{\mu}_{ij}}{\sqrt{v(\hat{\mu}_{ij})}} \]

Step 3. We iterate between Steps 1 and 2 until convergence has been achieved (estimates for \(\boldsymbol\beta\), \(\boldsymbol\alpha\), and \(\phi\) don’t change).

Starting or initial estimates of \(\boldsymbol\beta\) can be obtained assuming independence.

Note: In estimating the model, we only use our assumption about the mean model,

\[g(\mu_{ij}) = \eta_{ij} = \mathbf{x}_{ij}^T\boldsymbol\beta\]

and the covariance model of the observations,

\[\mathbf{V}_i = \mathbf{A}_i^{1/2}Cor(\mathbf{Y}_i)\mathbf{A}_i^{1/2} \]

where \(\mathbf{A}_i\) is a diagonal matrix with \(Var(Y_{ij}|X_{ij}) = \phi v(\mu_{ij})\) along the diagonal.

6.7.3.1 R: Three GEE R Packages

There are three packages to do this in R, gee, geepack, and geeM.

We will use geeM when we have large data sets because it is optimized for large data. The geepack package is nice as it has an anova method to help compare models. The gee package has some nice output.

The syntax for the function geem() in geeM package is

geem(formula, id, waves = NULL, data, family = gaussian, constr = "independence", Mv = 1)

formula: Symbolic description of the model to be fitted
id: Vector that identifies the clusters
data: Optional dataframe
family: Description of the error distribution and link function
constr: A character string specifying the correlation structure. Allowed structures are: “independence”, “exchangeable” (equal correlation), “ar1” (exponential decay), “m-dependent” (m-diagonal), “unstructured”, “fixed”, and “userdefined”.
Mv: for “m-dependent”, the value for m.

require(geeM)
summary(geem(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'ar1'))

The syntax for the function geeglm() in geepack package is

geeglm(formula, family = gaussian, data, id, zcor = NULL, constr, std.err = 'san.se')

formula: Symbolic description of the model to be fitted
family: Description of the error distribution and link function
data: Optional dataframe
id: Vector that identifies the clusters
contr: A character string specifying the correlation structure. The following are permitted: “independence”, “exchangeable”, “ar1”, “unstructured” and “userdefined”
zcor: Enter a user defined correlation structure

require(geepack)
summary(geeglm(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'ar1'))

The syntax for the function gee() in gee package is

gee(formula, id, data, family = gaussian, constr = 'independence',Mv)

formula: Symbolic description of the model to be fitted
family: Description of the error distribution and link function
data: Optional dataframe
id: Vector that identifies the clusters
constr: Working correlation structure: “independence”, “exchangeable”, “AR-M”, “unstructured”
Mv: order of AR correlation (AR1: Mv = 1)

require(gee)
summary(gee(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'AR-M', Mv = 1))

6.7.3.2 Properties of Estimators

We acknowledge that our working covariance matrix \(\mathbf{V}_i\) only approximates the true underlying covariance matrix for \(\mathbf{Y}_i\), but it is wrong.

Despite incorrectly specifying the correlation structure,

\(\hat{\boldsymbol\beta}\) is consistent (meaning that with high probability \(\hat{\boldsymbol\beta}\) is close to the true \(\boldsymbol\beta\) for very large samples) no matter whether the within-in subject associations are correctly modeled.
In large samples, the sampling distribution of \(\hat{\boldsymbol\beta}\) is multivariate normal with mean \(\boldsymbol\beta\) and the covariance is a sandwich, \(Cov(\hat{\boldsymbol\beta}) = B^{-1}MB^{-1}\) where the bread \(B\) is defined as \[B = \sum^n_{i=1} \mathbf{D}_i^T\mathbf{V}_i^{-1}\mathbf{D}_i\] and the middle (meat or cheese or filling) \(M\) is defined as \[M = \sum^n_{i=1} \mathbf{D}_i^T\mathbf{V}_i^{-1}Cov(\mathbf{Y}_i)\mathbf{V}_i^{-1}\mathbf{D}_i\]

We can plug in estimates of the quantities and get

\[\widehat{Cov}(\hat{\boldsymbol\beta}) = \hat{B}^{-1}\hat{M}\hat{B}^{-1}\] where \[\hat{B} = \sum^n_{i=1} \hat{\mathbf{D}}_i^T\hat{\mathbf{V}}_i^{-1}\hat{\mathbf{D}}_i\] and \[\hat{M} = \sum^n_{i=1} \hat{\mathbf{D}}_i^T\hat{\mathbf{V}}_i^{-1}\widehat{Cov}(Y_i)\hat{\mathbf{V}}_i^{-1}\hat{\mathbf{D}}_i\] and \(\widehat{Cov}(\mathbf{Y}_i) = (\mathbf{Y}_i - \hat{\boldsymbol\mu}_i)(\mathbf{Y}_i-\hat{\boldsymbol\mu}_i)^T\)

This is the empirical or sandwich robust estimator of the standard errors. It is valid even if we are wrong about the correlation structure.

Fun fact: If we happen to choose the right correlation/covariance structure and \(\mathbf{V}_i = \boldsymbol\Sigma_i\), where \(\boldsymbol\Sigma_i = Cov(\mathbf{Y}_i)\), then \(Cov(\hat{\boldsymbol\beta}) = B^{-1}\).

Since we model the conditional expectation \(E(Y_{ij}|\mathbf{x}_{ij}) = \mu_{ij}\) with

\[g(\mu_{ij}) = \mathbf{x}_{ij}^T\beta,\]

we can only make overall population interpretations regarding mean response at a given value of \(\mathbf{x}_{ij}\).

We can compare two lists of \(\mathbf{x}\) values, \(\mathbf{x}^*\) and \(\mathbf{x}\) and see how that impacts the population mean via the link function \(g(\mu)\),

\[g(\mu^*) - g(\mu) = \mathbf{x}^{*T}\beta - \mathbf{x}^{T}\beta\]

6.7.4 Model Selection Tools and Diagnostics

The model selection tools we use will be similar to those used in Stat 155. To refresh your memory of how hypothesis testing can be used for model selection, please see my Introduction to Statistical Models Notes.

6.7.4.1 Hypothesis Testing for Coefficients (one at a time)

This section is similar to the t-tests in Stat 155.

To decide which explanatory variables should be in the model, we can test whether \(\beta_k=0\) for some \(k\). If the slope coefficient were equal to zero, the associated variable is essentially not contributing to the model.

In marginal models (GEE), the estimated covariance matrix for \(\hat{\boldsymbol\beta}\) is given by the robust sandwich estimator. The standard error for \(\hat{\beta}_k\) is the square rooted values of the diagonal of this matrix corresponding to \(\hat{\beta}_k\). This is reported in the summary output for Robust SE.

To test the hypothesis \(H_0: \beta_k=0\) vs. \(H_A: \beta_k\not=0\), we can calculate a z-statistic (referred to as a Wald statistic),

\[z = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \]

With GEE, the geem() function provides wald statistics and associated p-values for testing \(H_0: \beta_k= 0\). These p-values come from a Normal distribution because we know that is the asymptotic distribution for these estimators.

geemod <- geem(resp ~ age + smoke, id = id, data = ohio, family = binomial(link = 'logit'), corstr = 'exchangeable')
summary(geemod) #look for Robust SE and wald
geemod$beta/sqrt(diag(geemod$var)) #wald statistics by hand

6.7.4.2 Hypothesis Testing for Coefficients (many at one time)

This section is similar to the nested F-tests in Stat 155.

To test a more involved hypothesis \(H_0: \mathbf{L}\boldsymbol\beta=0\) (to test whether multiple slopes are zero or a linear combo of slopes is zero), we can calculate a Wald statistic (a squared z statistic),

\[W^2 = (\mathbf{L}\hat{\boldsymbol\beta})^T(\mathbf{L}\widehat{Cov}(\hat{\boldsymbol\beta})\mathbf{L}^T)^{-1}(\mathbf{L}\hat{\boldsymbol\beta}) \]

We assume the sampling distribution is approximately \(\chi^2\) with df = # of rows of \(\mathbf{L}\) to calculate p-values (as long as \(n\) is large).

This model selection tool is useful when you have a categorical variable with more than two categories. You may want to check to see if we should exclude the entire variable or whether a particular category has an impact on the model.

With the data example, let’s look at smoking. It is only a binary categorical variable, but let’s run the hypothesis test with the example code from above. So, in this case, we let the matrix \(L\) be equal to a \(1\times 3\) matrix with 0’s everywhere except in the 3rd column so that we’d test: \(H_0: \beta_2 = 0\). Even though we used a different test statistic, we ended up with the same p-value as the approach above.

summary(geemod)

b = geemod$beta
W = geemod$var

(L = matrix(c(0,0,1),nrow=1))

L%*%b # Lb estimate
(se = sqrt(diag(L%*%W%*%t(L)))) # Robust SE for Lb

# 95% Confidence Interval (using Asymptotic Normality)
L%*%b - 1.96*se
L%*%b + 1.96*se

# Hypothesis Testing

w2 <- as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b)) ## should be approximately chi squared
1 - pchisq(w2, df = nrow(L)) # p-value

So, the p-value, the probability of getting a test statistic as or more extreme assuming the null hypothesis is true, is around 13%. We do not have enough evidence to reject the null hypothesis. This leads us to believe that smoking could be removed from the model.

Let’s test whether \(H_0: \beta_1 = \beta_2 = 0\), then we let the matrix \(L\) be equal to a \(2\times 3\) matrix with 0’s everywhere except in the 2nd column in row 1 and 3rd column in row 2.

summary(geemod)

b = geemod$beta
W = geemod$var

(L = matrix(c(0,1,0,0,0,1),nrow=2,byrow=TRUE))

L%*%b #Lb estimate
(se = sqrt(diag(L%*%W%*%t(L)))) # Robust SE for Lb

# 95% Confidence Interval (using Asymptotic Normality)
L%*%b - 1.96*se
L%*%b + 1.96*se

# Hypothesis Testing

w2 <- as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b)) ## should be approximately chi squared
1 - pchisq(w2, df = nrow(L)) #p-value

We have evidence to suggest that at least one of the variables, age or smoking, has a statistically significant effect on the outcome. As we saw above, we know it must be age, but the model with age and smoking is better than a model with no explanatory variables.

6.7.4.3 Diagnostics

For any linear model with a quantitative outcome and at least one quantitative explanatory variable, we should check to see if the residuals have a pattern.

Ideally, the residual plot should have no obvious pattern, plotting the residuals on the y-axis against a quantitative x-variable. See some example code below.

resid = mod$y - predict(mod)
plot(predict(mod),resid)

If there is a pattern, then we are systematically over or under-predicting, and we can use that information to incorporate non-linearity to improve the model.