6.2 Data Examples

We can consider several real datasets from various applications to put things into a firmer perspective. These will not only provide us with concrete examples of longitudinal data situations but also illustrate the range of ways that data may be collected and the types of measurements that may be of interest.

6.2.1 Example 1: The orthodontic study data of Potthoff and Roy (1964).

6.2.1.1 Data Context

A study was conducted involving 27 children, 16 boys and 11 girls. On each child, the distance (mm) from the center of the pituitary to the pterygomaxillary fissure was made at ages 8, 10, 12, and 14 years of age. In the plot below, the distance measurements are plotted against the age of each child. The trajectory for each child is connected by a solid line so that individual child patterns may be seen, and the color of the lines denotes girls (0) and boys (1).

dat <- read.table("./data/dental.dat", col.names=c("obsno", "id", "age", "distance", "gender"))

dat %>%
  ggplot(aes(x = age, y = distance, col = factor(gender))) +
  geom_line(aes(group = id)) +
  geom_smooth(method='lm',se=FALSE,lwd=3) +
  theme_minimal() + 
  scale_color_discrete('Gender')

Plots like this are often called spaghetti plots, for obvious reasons! Each set of connected points represented one child and their outcome measurements over time.

6.2.1.2 Research Questions

The objectives of the study were to

  • Determine whether distances over time are larger for boys than for girls
  • Determine whether the rate of change (i.e., slope) for distance is similar between boys and girls.

Several features are notable from the spaghetti plot of the data:

  • Each child appears to have his/her own trajectory of distance as a function of age. For any given child, the trajectory looks roughly like a straight line, with some random fluctuations. But from child to child, the features of the trajectory (e.g., its steepness) vary. Thus, the trajectories are all similar in form but vary in their specific characteristics among children. Also, note any unusual trajectories. In particular, there is one boy whose pattern fluctuates more profoundly than those of the other children and one girl whose distance is much “lower” than the others at all time points.

  • The overall trend is for the distance measurement to increase with age. The trajectories for some children exhibit strict increases with age. In contrast, others show some intermittent decreases (biologically, is this possible? or just due to measurement error?), but still with an overall increasing trend across the entire six-year period.

  • The distance trajectories for boys seem, for the most part, to be “higher” than those for girls – most of the boy profiles involve larger distance measurements than those for girls. However, this is not uniformly true: some girls have larger distance measurements than boys at some ages.

  • Although boys seem to have larger distance measurements, the rate of change of the measurements with increasing age seems similar. More precisely, the slope of the increasing (approximate straight-line) relationship with age seems roughly similar for boys and girls. However, for any individual boy or girl, the rate of change (slope) may be steeper or shallower than the evident “typical” rate of change.

To address the questions of interest, we need a formal way of representing the fact that each child has an individual-specific trajectory.

6.2.2 Example 2: Vitamin E diet supplement and growth of guinea pigs

6.2.2.1 Data Context

The following data are reported by Crowder and Hand (1990, p. 27). The study concerned the effect of a vitamin E diet supplement on the growth of guinea pigs. 15 guinea pigs were given a growth-inhibiting substance at the beginning of week 1 of the study (time 0, before the first measurement), and body weight was measured at the ends of weeks 1, 3, and 4. At the beginning of week 5, the pigs were randomized into three groups of 5, and vitamin E therapy was started. One group received zero doses of vitamin E, another received a low dose, and the third received a high dose. Each guinea pig’s body weight (g) was measured at the end of weeks 5, 6, and 7. The data for the three dose groups are plotted, in the plot below, with each line representing the body weight of one pig.

dat <- read.table("./data/diet.dat", col.names=c("id", paste("bw.",c(1,3,4,5,6,7),sep=""), "dose" ))

dat <- dat %>%
  gather(Tmp,bw, bw.1:bw.7) %>%
  separate(Tmp,into=c('Var','time'), remove=TRUE) 

dat$time <- as.numeric(dat$time)
dat$dose <- factor(dat$dose)
levels(dat$dose) <- c("Zero dose",' Low dose', 'High dose')
  

dat %>%
  ggplot(aes(x = time, y = bw, col = dose)) +
  geom_line(aes( group = id)) + 
  xlab('Weeks') +
  ylab('Body Weight (g)') +
  geom_smooth(lwd=2, se=FALSE) +
  theme_minimal()

6.2.2.2 Research Questions

The primary objective of the study was to

  • Determine whether the growth patterns differed among the three groups.

As with the dental data, several features of the spaghetti plot are evident:

  • For the most part, the trajectories for individual guinea pigs seem to increase overall over the study period (although note pig 1 in the zero dose group). Guinea pigs in the same dose group have different trajectories, some of which look like a straight line and others of which seem to have a “dip” at the beginning of week 5, the time at which vitamin E was added in the low and high-dose groups.

  • The body weight for the zero dose group seems somewhat “lower” than those in the other dose groups.

  • It is unclear whether the rate of change in body weight on average is similar or different across dose groups. It is unclear that the pattern for individual pigs or “on average” is a straight line, so the rate of change may not be constant. Because vitamin E therapy was not administered until the beginning of week 5, we might expect two “phases” in the growth pattern, before and after vitamin E, making it possibly non-linear.

6.2.3 Example 3: Epileptic seizures and chemotherapy

A common situation is where the measurements are in the form of counts. A response in the form of a count is by nature discrete–counts (usually) take only nonnegative integer values (0, 1, 2, 3,…).

6.2.3.1 Data Context

The following data were first reported by Thall and Vail (1990). A clinical trial was conducted in which 59 people with epilepsy suffering from simple or partial seizures were assigned at random to receive either the anti-epileptic drug progabide (subjects 29-59) or an inert substance (a placebo, subjects 1-28) in addition to a standard chemotherapy regimen all were taking. Because each individual might be prone to different rates of experiencing seizures, the investigators first tried to get a sense of this by recording the number of seizures suffered by each subject over the 8 weeks before administering the assigned treatment. It is common in such studies to record such baseline measurements so that the effect of treatment for each subject may be measured relative to how that subject behaved before treatment.

Following the commencement of treatment, each subject’s seizures were counted for four two-week consecutive periods. The age of each subject at the start of the study was also recorded, as it was suspected that the subject’s age might be associated with the effect of the treatment.

The first 10 rows of the long format data for the study are shown below.

require(MASS)
head(epil,10)

6.2.3.2 Research Questions

The primary objective of the study was to

  • Determine whether progabide reduces the rate of seizures in subjects like those in the trial.

Here, we have repeated measurements (counts) on each subject over four consecutive observation periods for each subject. We would like to compare the baseline seizure counts to post-treatment counts, where the latter are observed repeatedly over time following the initiation of treatment. An appropriate analysis would best use this data feature in addressing the main objective. Below is a boxplot of the change from baseline, separated by treatment group (but ignores the repeated measures).

epil %>%
  mutate(change = y - base) %>%
  ggplot(aes(x = trt, y = change)) +
  geom_boxplot() + 
  theme_classic()

Moreover, some counts are quite small; for some subjects, 0 seizures (none) were experienced in some periods. For example, subject 31 in the treatment group experienced only 0, 3, or 4 seizures over the four observation periods. Pretending that the response is continuous would be a lousy approximation of the true nature of the data! Thus, methods suitable for handling continuous data problems like the first three examples would not be appropriate for data like these.

A common approach to handling data in the form of counts is to transform them to some other scale. The motivation is to make them seem more “normally distributed” with constant variance, and the logarithm transformation is used to (hopefully) accomplish this. The desired result is that methods usually used to analyze continuous measurements may be applied.

However, the drawback of this approach is that one is no longer working with the data on the original scale of measurement, the number of seizures in this case. The statistical models the “log number of seizures,” which is not particularly interesting or intuitive. New statistical methods have recently been developed to analyze discrete repeated measurements like counts on the original measurement scale.

6.2.4 Example 4: Maternal smoking and child respiratory health

Another common discrete data situation is where the response is binary; that is, the response may take on only two possible values, which usually correspond to things like

  • success or failure of a treatment to elicit a desired response
  • presence or absence of some condition

It would be foolish to even try and pretend such data are approximately continuous!

6.2.4.1 Data Context

The following data come from a very large public health study, the Six Cities Study, undertaken in six small American cities to investigate various public health issues. The full situation is reported in Lipsitz, Laird, and Harrington (1992). The current study focused on the association between maternal smoking and child respiratory health. Each of the 300 children was examined once a year at ages 9–12. The response of interest was “wheezing status, a measure of the child’s respiratory health, which was coded as either”no” (0) or “yes” (1), where “yes” corresponds to respiratory problems. Also recorded at each examination was a code to indicate the mother’s current level of smoking: 0 = none, 1 = moderate, 2 = heavy. The data for the first 5 subjects are summarized below. Missing data are denoted by a “.”.

Smoking at age Wheezing at age
Subject City 9 10 11 12 9 10 11 12
1 Portage 2 2 1 1 1 0 0 0
2 Kingston 0 0 0 0 0 0 0 0
3 Portage 1 0 0 . 0 0 0 .
4 Portage . 1 1 1 . 1 0 0
5 Kingston 1 . 1 2 0 . 0 1

A simplified version of this data is available in R.

require(geepack)
data(ohio)
head(ohio,10)

6.2.4.2 Research Questions

The objective of an analysis of these data was to

  • Determine how the typical “wheezing” response pattern changes with age
  • Determine whether there is an association between maternal smoking severity and child respiratory status (as measured by “wheezing”).

It would be pretty pointless to plot the responses as a function of age as we did in the continuous data cases – here, the only responses are 0 or 1! Inspection of subject data suggests something is happening here; for example, subject 5 did not exhibit positive wheezing until his/her mother’s smoking increased in severity.

This highlights that this situation is complex: over time (measured here by the child’s age), an important characteristic, maternal smoking, changes. Contrast this with the previous situations, where the main focus is to compare groups whose membership stays constant over time.

Thus, we have repeated measurements, which are binary to further complicate matters! As with the count data, one might first think about summarizing and transforming the data to allow methods for continuous data to be used; however, this would be inappropriate. As we will see later, methods for dealing with repeated binary responses and scientific questions like those above have been developed.

Another feature of these data is that some measurements are missing for some subjects. Specifically, although the intention was to collect data for each of the four ages, this information is not available for some children and their mothers at some ages; for example, subject 3 has both the mother’s smoking status and wheezing indicator missing at age 12. This pattern would suggest that the mother may have failed to appear with the child for this intended examination.

A final note: In the other examples, units (children, guinea pigs, plots, patients) were assigned to treatments; thus, these may be regarded as controlled experiments, where the investigator has some control over how the factors of interest are “applied” to the units (through randomization). In contrast, in this study, the investigators did not decide which children would have mothers who smoke; instead, they could only observe the smoking behavior of the mothers and the wheezing status of their children. That is, this is an example of an observational study. Because it may be impossible or unethical to randomize subjects to potentially hazardous circumstances, studies of issues in public health and the social sciences are often observational.

As in many observational studies, an additional difficulty is the fact that the thing of interest, in this case, maternal smoking, also changes with the response over time. This leads to complicated issues of interpretation in statistical modeling that are a matter of some debate.