Checkpoint 5

There is a template RMarkdown file called checkpoint-5.Rmd to start from in the Github Repo from Github Classroom.

For this checkpoint, I want you to work together with a partner.

Github Setup

To create a shared repository (for you, your partner, and me), go to https://classroom.github.com/a/Y3dqd2Ti. One partner creates the team, call it long-Name1-Name2.

Models

You will work on this in-class activity for the checkpoint. It is due next Tuesday night; push to Github what your team has. It will help you make progress on the longitudinal project before the break.

Let’s do a quick review:

Marginal Models For models fit with GEE, we need to specify a working correlation structure, but the robust standard errors are valid even if we get that structure wrong.

Mixed Effects Models We can write a linear mixed effects model in terms of our outcome vector,

\[\mathbf{Y}_{i} = \mathbf{X}_{i}\boldsymbol\beta + \mathbf{Z}_i\mathbf{b}_i + \boldsymbol\epsilon_i\]

\[ = \text{Fixed effects} + \text{Random effects} + \text{Error}\]

where \(\mathbf{X}_i\) is the design matrix for the fixed effects, \(\mathbf{Z}_i\) is the design matrix for the random effects (a subset of columns of \(\mathbf{X}_i\)), \(\boldsymbol\epsilon_i\sim N(0,\boldsymbol\Sigma)\), \(\mathbf{b}_i\sim N(0,\mathbf{G})\), and \(\mathbf{b}_i\) and \(\boldsymbol\epsilon_i\) are independent.

Thus,

\[\mathbf{Y}_{i}\sim \mathcal{N}(\mathbf{X}_{i}\boldsymbol\beta,\mathbf{Z}_i\mathbf{G}\mathbf{Z}_i^T + \boldsymbol\Sigma) \]

BEFORE YOU BEGIN: Decide whether you and your partner want to use Mixed Effects or GEE models.

Based on your previous work with this data set, including your literature search, come up with at least 3 candidate models that you want to fit.

GEE: choose x’s and/or working correlation structures, describe them in words, and then fit them with geeM::geem().
Mixed Effects: choose x’s and start with random intercept models, describe them in words, and then fit them with lme4::lmer().

source('Cleaning.R') # Open Cleaning.R to see the data cleaning steps to create a long format dataset
library(geeM)
library(lme4)

ANSWER:

Model 1:
Model 2:
Model 3:

Write bullet points about what you learn about your outcome and the modeling process from fitting these three models.

ANSWER:

Model Selection

Hypothesis Testing for Fixed Coefficients (one at a time)

We can use this model selection tool for either GEE or Mixed Effects Models.

In order to decide which fixed effects should be in the model, we can test whether \(\beta_k=0\) for some \(k\).

In marginal models (GEE), the estimated covariance matrix for \(\hat{\boldsymbol\beta}\) is given by the robust sandwich estimator. The Robust SE for \(\hat{\beta}_k\) is the square rooted values of the diagonal of this matrix corresponding to \(\hat{\beta}_k\). This is what is reported in the summary output as Robust SE.

In mixed effects models, the estimated covariance matrix for \(\hat{\boldsymbol\beta}\) is based on assuming the random effects are correct.

To test the hypothesis \(H_0: \beta_k=0\) vs. \(H_A: \beta_k\not=0\), we can calculate a z-statistic,

\[z = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \]

With GEE, the geem() function provides z statistics (labeled as wald) and associated p-values for testing \(H_0: \beta_k= 0\). These p-values come from a Normal distribution because we know that is the asymptotic (as the sample size, \(n\rightarrow \infty\)) distribution for these estimates.

With mixed effects, the lmer() function provides z statistics (labeled as t-value) for the fixed effects (but not the p-values) for testing \(H_0: \beta_k= 0\). We could compare the t-values to the cut off of +/-2 because we know that the asymptotic (as the sample size, \(n\rightarrow \infty\)) distribution for these estimates is Normal if the model is correct.

#gee_mod1 %>% summary()
#lmer_mod1 %>% summary()

Looking back on the models you fit, which model would you choose using this model selection tool and why?

ANSWER:

Hypothesis Testing for Fixed Coefficients (many at one time)

To test a more involved hypothesis \(H_0: \mathbf{L}\boldsymbol\beta=0\) (to test whether multiple slopes are zero or a linear combo of slopes is zero), we can calculate a Wald statistic,

\[W^2 = (\mathbf{L}\hat{\boldsymbol\beta})^T(\mathbf{L}\widehat{Cov}(\hat{\boldsymbol\beta})\mathbf{L}^T)^{-1}(\mathbf{L}\hat{\boldsymbol\beta}) \]

Then we assume the sampling distribution is approximately \(\chi^2\) with df = # of rows of \(\mathbf{L}\) to calculate p-values (as long as \(n\) is large).

For both types of models, here is some example code where we have four beta parameters, \(\beta = (\beta_0, \beta_1,\beta_2,\beta_3)\) and what to test whether \(\beta_2 = \beta_3 = 0\) because they are indicators for one categorical variable. We do that by defining

\[L = \left(\begin{array}{cccc} 0&0&1&0\\ 0&0&0&1\end{array}\right)\]

because

\[L\beta = \left(\begin{array}{c} \beta_2\\\beta_3\end{array}\right)\]

b = gee_mod1$beta
W = gee_mod1$var

b = lme4::fixef(lmer_mod1)
W = vcov(lmer_mod1)

(L = matrix(c(0,0,1,0,
              0,0,0,1),nrow=2,byrow = TRUE)) #L for Lb

## Hypothesis Testing
w2 <- as.numeric( t(L%*%b) %*% solve(L %*% W %*% t(L))%*% (L%*%b)) ## should be approximately chi squared
1 - pchisq(w2, df = nrow(L)) #p-value

#If you use lmer(), you can use anova() to compare nested models rather than having to do the manual code above.

Compare two nested models among your candidate models (if you don’t have any nested models, add another candidate model that is nested within one of your larger models). Discuss whether your should favor the larger model or the nested model.

ANSWER:

Diagnostics

For any model with a quantitative outcome and quantitative explanatory variable, we should check to see if the residuals have no pattern.

Create a residual plot for your best model so far, plotting the residuals on the y-axis against a quantitative x-variable. Comment on the residual plot. See some example code below.

ANSWER:

resid = gee_mod1$y - predict(gee_mod1)
resid = lmer_mod1$y - predict(lmer_mod1)

plot(predict(mod), resid)

Considering all of the choices you need to make, make an argument for one final model using a variety of model selection tools.

ANSWER:

Using that final model, what do you conclude about your original research question about elderly and braining training?

ANSWER: