8 Data Wrangling - Missing data

Settling In

Sit with your people in your project group.

Re-introduce yourself:

Name, preferred pronouns
Macalester connections (e.g., majors/minors/concentrations, clubs, teams, events regularly attended)
What are your strengths in group work?
How do you like to be listened to, so that you know you are heard and understood?

Discuss project ideas.

Project overview
Milestone 1 due tomorrow with HW4 to Moodle
Download a template Quarto file to start from here. Put this file in a folder called wrangling within the activities folder for this course. Install the naniar and mice packages.

Data Storytelling Moment

Go to https://www.gatesfoundation.org/goalkeepers/report/2019-report/#ExaminingInequality

What is the data story?
What is effective?
What could be improved?

Quiz 1 Revisions

On a separate piece of paper, please complete the following:

For every question I marked with an X on the Quiz 1,
- write a more correct answer
- 1 sentence describing how your understanding has changed since the quiz

Turn in this piece of paper + the original quiz next Monday in class.

Learning goals

After this lesson, you should be able to:

Go through a data quality checklist when data wrangling
Explain the difference between MCAR, MAR, and MNAR missing data mechanisms
Assess what missing data mechanisms might be at play in a given dataset
Use visualizations to explore missing data patterns
Explain why multiple imputation is preferred to single imputation
Explain how a simulation study can be used to investigate properties of statistical methods

Data Quality Checklist

When wrangling / cleaning data, make sure to check the assumptions you make about the data to ensure you don’t lose data quality.

. . .

Data Parsing (reading data into a different data format)

Always keep the original, raw data (don’t manually change it).
Use Test Cases: Find rows or write test cases to double check the wrangling works as expected
DATES: When using lubridate to parse dates and times, ensure the strings are consistently ordered and formatted correctly (e.g. mm/dd/yy vs. dd/mm/yy).
STRINGS: When using stringr to parse strings with regular expressions, check example rows to ensure that the pattern captured all of the examples you want and excluded the patterns you don’t want.
Always check for missing values to see if the missing ones are expected given the original data.

. . .

Data Joining

Identify missing values in the key variables and decide how to handle them before the merge process (e.g., omitting rows with missing values, imputing missing values).
Decide on the correct join type (left, right, inner, full, etc.) OR if the data structure is the same use list_rbind() to bind rows or list_cbind() to bind columns.
If doing a join, make sure that the key variables (by) have the same meaning in both datasets and are represented in the same way (e.g., id = 1 to 20 in first dataset will match id = 1 - 20 in undesirable ways)
Predict the number of rows that will result from the join and double check the anti_join() to see which rows did not find a match.
Check for duplicate records within each dataset and ensure they are handled appropriately before merging.
Verify that the merged dataset maintains consistency with the original datasets in terms of data values, variable names, and variable types.
Perform some preliminary analysis or validation checks on the merged dataset to ensure that it meets the requirements of your analysis.

. . .

Sanity Check: Visualize your data!!!

Do the right number of points appear?
Do the values seem reasonable?

Explicit v. Implicit missing data

Explicit missing data is data that is explicitly marked as missing. In R, this is done with NA.

. . .

Implicit missing data is data that is missing but not explicitly marked as such.

This can happen when an entire row is missing from a dataset.
- For example, if a study participant doesn’t attend a follow-up visit. It maybe not even be recorded in the data.

. . .

We need to make implicit missingness explicit before we can work with the data.

Consider the combinations of variables that you’d expect to be fully present in a complete dataset.
- If a combination is missing, you can create a new row with explicit missing values.
  - For example, if you expect every participant (each has a unique pid) to have an observation for each visit, you could use the function complete() to create that new row and plug in values of NA for the missing data.

study_data_full <- study_data %>% 
  complete(pid, visit)

Dealing with missing data

If you have explicit missing data, there are 3 main ways of proceeding:

Drop the cases or rows with any missing data from the analysis–a complete case analysis
- Pro: Easy to implement
- Con: reduces sample size, introduces bias if the missing data is not “missing completely at random”
Create a category for missing values.
- Example: The survey question “What is your gender?” might only provide two possible responses: “male” and “female”. Missing values for this could indicate that the respondent uses a non-binary designation. Instead of dropping these cases, treating the missing data as its own category (“Does not wish to answer”) would be more appropriate.
- Pros: Maintains sample size, may help us if data is “missing not at random”
- Cons: Not directly applicable to continuous outcomes (could include interactions with a categorical version in models to account for it)
Impute (or fill in values for) the missing data using imputation algorithms.
- Imputation algorithms can be as simple as replacing missing values with the mean of the non-missing values (very simplistic).
- Regression imputation algorithms use models to predict missing values as a function of other variables in the data.
- Pros: Maintains sample size, multiple regression imputation minimizes bias if “missing at random”
- Cons: Computationally intensive

. . .

Deciding between these options and proceeding with choosing finer details within an option requires an understanding of the mechanism by which data become missing.

Missing data mechanisms

The reasons for which a variable might have missing data are divided into 3 mechanisms: MCAR, MAR, and MNAR.

Within a dataset, multiple mechanisms may be present–we need to consider the missingness mechanism for each variable individually.

Missing completely at random (MCAR):
- The probability of missing data for a variable is the same for all cases. Implies that causes of the missing data are unrelated to the data. (https://stefvanbuuren.name/fimd/sec-MCAR.html)
- Examples:
  - Measurement device that runs out of batteries causes the remainder of observations for the day to be missing.
  - Data entry software requires a particular field to be typo-free, and missing values are introduced when there are typos.
- Implications for downstream work:
  - If a variable has MCAR missingness, a complete case analysis will be unbiased (still valid).
  - However, with a lot of missing observations, a complete case analysis will suffer from loss of statistical power (ability to detect a real difference), and imputation will be useful to retain the original sample size.
Missing at random (MAR):
- The probability of missing data is related to observed variables but unrelated to unobserved information.
- Examples:
  - Blood pressure measurements tend to be missing in patients in worse health. (Those in worse health are more likely to miss clinic visits.) Better and worse health can be measured by a variety of indicators in their health record.
  - In a survey, older people are more likely to report their income than younger people. Missingness is related to the observed age variable, but not to unobserved information.
- Implications for downstream work:
  - Try to use imputation methods that predict the value of the missing variables from other observed variables. Assessing whether this can be done accurately takes some exploration–we’ll explore this shortly.
Missing not at random (MNAR):
- The probability of missing data is related to unobserved variables (and probably observed variables too).
- Examples:
  - Blood pressure measurements are more likely to be missing for those with the highest blood pressure. This is MNAR rather than MAR because the missing data on blood pressure is related to the unobserved values themselves.
  - High-income individuals may be less likely to report their income.
- Implications for downstream work:
  - Ideally, we would learn more about the causes for the missingness. This could allow us to use more informed imputation models.
    - Example: Biological measurements that tend to be missing because of concentrations that are too low (a phenomenon known as left-censoring). Imputation methods specifically suited to left-censoring are useful here.
  - We can use imputation methods with different assumptions about the missing data and try out a variety of assumptions. This lets us see how sensitive the results are under various scenarios.
    - Example: If higher incomes are more likely to be missing, we can make different assumptions about what “high” could be to fill in the missing values and see how our results change under these different assumptions.

Exercise

Missing data mechanism For each of the following situations, propose what missing data mechanism you think is most likely at play.

In a clinical trial, some patients dropped out before the end of the study. Their reasons for dropping out were not recorded.
A weather station records temperature, humidity, and wind speed every hour. Some of the recorded values are missing.
A social media platform collects data on user interactions, such as likes, comments, and shares. Some interactions are not recorded due to bugs in the code.

Exploring missing data

Guiding question: How can we use visualizations and tabulations to explore what missing data mechanisms may be at play?

We’ll look at the airquality dataset available in base R, which gives daily air quality measurements in New York from May to September 1973. You can pull up the codebook with ?airquality in the Console.

data(airquality)

Missingness by variable

We can explore how much missingness there is for each variable with the following functions:

summary(airquality) # Summary statistics in addition to number of NA's

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

naniar::vis_miss(airquality) # Where are NA's located?

naniar::miss_var_summary(airquality) # Information from vis_miss() in table form

# A tibble: 6 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 Ozone        37    24.2 
2 Solar.R       7     4.58
3 Wind          0     0   
4 Temp          0     0   
5 Month         0     0   
6 Day           0     0

Missingness by case

We can explore how much missingness there is for each case with naniar::miss_case_summary(). For each case, this function calculates the number and percentage of variables with a missing value.

. . .

Impact of Information: If the pct_miss column is large for a case, we likely won’t be able to impute any of its missing values because there just isn’t enough known information–this case will have to be dropped from the analysis.

naniar::miss_case_summary(airquality)

# A tibble: 153 × 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ℹ 143 more rows

Exploring missingness mechanisms

Assessing missingness mechanisms involves checking if missingness in a variable is related to other variables.

Note: Through our available data, we are really only able to explore the potential for MCAR or MAR mechanisms.

. . .

Impact of Information: There is always the chance that unobserved information (unobserved other variables or unobserved values of the variables we do have) is related to missingness for our variables, so to think through the potential for MNAR, more contextual information is necessary.

. . .

To explore these relationships, we can create TRUE/FALSE indicators of whether a variable is missing. In the plots below, we use is.na(Ozone) to explore whether cases with missing ozone values are noticeably different from cases with observed ozone values in terms of Solar.R.

ggplot(airquality, aes(x = is.na(Ozone), y = Solar.R)) + 
    geom_boxplot()

ggplot(airquality, aes(x = Solar.R, color = is.na(Ozone))) + 
    geom_density()

The above boxplots and density plots suggest that missing ozone is not strongly related to solar radiation levels.

. . .

We still should check if ozone missingness is related to the Wind, Temp, Month, and Day variables (to be done in Exercises).

In addition to checking if the chance of ozone missingness is related to Solar.R, we should check if the values of ozone could be predicted by Solar.R.

. . .

In the scaterrplot below, we look at the relationship between Ozone and Solar.R and use vertical lines to indicate the Solar.R values for cases that are missing Ozone.

Impact of Information: We see that missing Ozone cases are within the observed span of Solar.R, so we would be ok with predicting Ozone from Solar.R because there would be no extrapolation.

ggplot(airquality, aes(x = Solar.R, y = Ozone)) +
    geom_point() +
    geom_smooth() +
    geom_vline(data = airquality %>% filter(is.na(Ozone)), mapping = aes(xintercept = Solar.R))

Exercises

Mechanism detection practice Look at the boxplot + scatterplot pairs for Alternate Situations 1 and 2 below. How do these situations compare to our actual situation and to each other? What concerns might arise from using a model to impute Ozone?

Ozone mechanism detection Continue the investigation of missingness for Ozone.

We want to see how Month, Wind, and Temp relate to the chance of missingness for Ozone and to the value of Ozone.

Does it look like a linear regression model (perhaps with variable transformations) could be effective in imputing the missing ozone data?

Regression imputation

When a model is built and used to generate a single set of predictions for missing values, this is known as single imputation.

. . .

When using singly imputed data in subsequent modeling, the uncertainty in estimates tends to be underestimated. This means that:
- Standard errors are lower than they should be.
- Confidence intervals won’t contain the true parameter value the “advertised” percentage of times
  - e.g., 95% confidence intervals will not contain the truth in 95% of samples–the coverage probability will be less than 95%

. . .

In multiple imputation, multiple imputed datasets are generated with different values for the filled-in data.

. . .

Subsequent models are fit on each of these datasets, and both estimates and uncertainty measures are pooled across all of these fits.
Multiple imputation more accurately estimates uncertainty measures.

Simulation Studies

We can use a simulation study to investigate the statistical properties described above.

Generate (simulate) data where we are in control of missing data mechanisms and the true relationship between an outcome and the predictors.
On that simulated data, use single imputation to fill in the missing values. Fit the desired model, and obtain a confidence interval for a coefficient of interest.
On that simulated data, use multiple imputation to fill in the missing values. Fit the desired models on all imputed datasets, pool results, and obtain a confidence interval for a coefficient of interest.
Steps 1 - 3 are repeated a lot of times (num_simulations <- 1000) to see how things work out in lots of different samples.
Summarize the performance of single and multiple imputation across the num_simulations simulations.

We will slowly step through the simulation study code below. We will pause frequently for you to add comments documenting what is happening.

set.seed(224)
num_simulations <- 1000
ci_list <- vector("list", length = num_simulations)

system.time({
for (i in 1:num_simulations) {
    # Simulate data
    n <- 1000
    sim_data <- tibble(
        x1 = runif(n, min = 0, max = 1),
        x2 = x1 + rnorm(n, mean = 0, sd = 1),
        x2_miss_bool = rbinom(n, size = 1, prob = x1/2),
        x2_NA = if_else(x2_miss_bool == 1, NA, x2),
        y = x1 + x2 + rnorm(n, mean = 0, sd = 1)
    )
    
    # Single imputation ---------------
    mice_obj <- mice(sim_data %>% select(x1, x2_NA, y), m = 1, method = "norm", printFlag = FALSE)
    si_mod <- with(mice_obj, lm(y ~ x1 + x2_NA))
    ci_single <- si_mod$analyses[[1]] %>% confint(level = 0.95)
    ci_single <- ci_single["x2_NA",]
    
    # Multiple imputation -------------
    mice_obj <- mice(sim_data %>% select(x1, x2_NA, y), m = 10, method = "norm", printFlag = FALSE)
    mi_mods <- with(mice_obj, lm(y ~ x1 + x2_NA))
    pooled_res <- pool(mi_mods)
    summ_pooled_res <- summary(pooled_res, conf.int = TRUE, conf.level = 0.95)
    ci_multiple_lower <- summ_pooled_res %>% filter(term=="x2_NA") %>% pull(`2.5 %`)
    ci_multiple_upper <- summ_pooled_res %>% filter(term=="x2_NA") %>% pull(`97.5 %`)
    
    # Store CI information
    ci_list[[i]] <- tibble(
        ci_lower = c(
            ci_single[1],
            ci_multiple_lower
        ),
        ci_upper = c(
            ci_single[2],
            ci_multiple_upper
        ),
        which_imp = c("single", "multiple")
    )
}
})

Below we compute the confidence interval (CI) coverage probability (fraction of times the CI contains the true value of 1) for the CIs generated from single and multiple imputation:

ci_data <- bind_rows(ci_list)
ci_data %>% 
    mutate(contains_truth = ci_lower < 1 & ci_upper > 1) %>% 
    group_by(which_imp) %>% 
    summarize(frac_contains_truth = mean(contains_truth))

Reflection

What were the challenges and successes of today? What resources do you need to take the next steps in your understanding of the ideas today?

Solutions

Missing data mechanism

Solution

A variety of responses would be reasonable here:

MNAR probably: The chance of missingness is likely due to factors that aren’t recorded (like discomfort).
Could be MCAR if the missing values are from technical glitches. Could be MAR if the missing values are related to other measured weather variables. Could be MNAR if the values tend to be missing when the values themselves are high (e.g., high temperature, humidity, and wind speed causing measurement devices to malfunction).
Could be MCAR if the bugs affected the platform uniformly. Could be MAR if the bugs affected groups of users differently, but where the groupings are known (e.g., bugs affect mobile users more and mobile vs. desktop usage is measurable). Could be MNAR if the bugs lead to performance issues that affect users with slower internet more (where internet speed is likely not measured).

2.Mechanism detection practice

Solution

Ozone is missing when solar is at higher values for both alternative situations.

In Alt 1 version, we have some observed data in the range of Solar for where Ozone is missing so we could predict it fairly well.

In Alt 2 version, Ozone is missing when Solar is beyond the values of our observed data so we don’t have good sense of what the relationship would be between Solar and Ozone.

Ozone mechanism detection

Solution

Missingness (MAR or potentially MNAR) of Ozone is related to:

Month: More likely in June

Value of Ozone is related to:

Month: Higher in July and August
Wind: Lower when wind is higher
Temp: Higher when temp is higher

ggplot(airquality, aes(fill = is.na(Ozone), x = factor(Month))) +
    geom_bar(position = "fill")

ggplot(airquality, aes(x = is.na(Ozone), y = Wind)) +
    geom_boxplot()

ggplot(airquality, aes(x = is.na(Ozone), y = Temp)) +
    geom_boxplot()

ggplot(airquality, aes(x = factor(Month), y = Ozone)) +
    geom_boxplot()

ggplot(airquality, aes(x = Wind, y = Ozone)) +
    geom_point() +
    geom_smooth() +
    geom_vline(data = airquality %>% filter(is.na(Ozone)), mapping = aes(xintercept = Wind))

ggplot(airquality, aes(x = Temp, y = Ozone)) +
    geom_point() +
    geom_smooth() +
    geom_vline(data = airquality %>% filter(is.na(Ozone)), mapping = aes(xintercept = Temp))

After Class

Take a look at the Schedule page to see how to prepare for the next class.
Quiz 1 Revisions due next Monday in class.
Finish Homework 4
I’ll post Homework 5 later today. It will be due next Tuesday.

References

The naniar package provides a variety of functions for exploring missing data through visualizations and tabulations.
Flexible Imputation of Missing Data: A free online book about missing data and imputation
- This section digs into missing data mechanisms.
A chapter on missing data from Andrew Gelman’s online textbook