6.3 R: Wide V. Long Format

In R, longitudinal data could be formatted in two ways.

Wide Format: When the observations times are the same for every unit (balanced data), then every unit has a vector of observations of length \(m\), and we can organize the unit’s data into one row per unit, resulting in a data set with \(n\) rows and \(m\) columns that correspond to outcome values (plus other columns that will correspond to an identifier variable and other explanatory variables). See the simulated example below. The first column is an id column to identify the units from each other, the next five columns correspond to the \(m=5\) observations over time, and the next five columns correspond to a time-varying explanatory variable.

n = 10
m = 5

(wideD = data.frame(id = 1:n, y.1 = rnorm(n), y.2 = rnorm(n), y.3 = rnorm(n), y.4 = rnorm(n), y.5 = rnorm(n),x.1 = rnorm(n), x.2 = rnorm(n), x.3 = rnorm(n), x.4 = rnorm(n), x.5 = rnorm(n))) #Example with 5 repeated observations for 10 units

Long Format: We’ll need the data in long format for most data analysis. We must use a long format when every unit’s observation times differ. Imagine stacking each unit’s observed outcome vectors on top of each other. Similarly, we want to stack the observed vectors of any other explanatory variable we might have on the individuals. In contrast to wide format, it is necessary to have a variable to identify the unit and the observation time.

See below for the R code to convert a data set in wide format to one in long format. Hopefully, the variable names are given in a way that specifies the variable name and the time order of the values, such as y.1,y.2,...y.5,x.1,x.2,...,x.5.

In the tidyr package, pivot_longer() takes the wide data, the columns (cols) you want to make longer, and the names of two variables you want to create. The first (we use Tmp here) is the variable name containing the variable names of the columns you want to gather. The second (called Value) is the variable name that will collect the values from those columns. See below.

require(tidyr)

pivot_longer(wideD, cols = y.1:x.5, names_to = 'Tmp', values_to = 'Value') %>% head()

Then, we want to separate the Tmp variable into two variables because they contain information about both the characteristic and time. We can use separate() to separate Tmp into Var and Time, it will automatically detect the . as a separator.

pivot_longer(wideD, cols = y.1:x.5, names_to = 'Tmp', values_to = 'Value') %>%
  separate(Tmp,into=c('Var','Time'), remove=TRUE)  %>%
  head()

Lastly, we want to have one variable called x and one variable called y', and we can get that by pivoting the variableVarwider into two columns with the values that come fromValue`.

pivot_longer(wideD, cols = y.1:x.5, names_to = 'Tmp', values_to = 'Value') %>%
  separate(Tmp,into=c('Var','Time'), remove=TRUE) %>%
  pivot_wider(names_from = 'Var', values_from = 'Value') %>%
  head()