data(airquality)
8 Data Wrangling - Missing data
Settling In
Data Storytelling Moment
Quiz 1 Revisions
On a separate piece of paper, please complete the following:
- For every question I marked with an X on the Quiz 1,
- write a more correct answer
- 1 sentence describing how your understanding has changed since the quiz
Turn in this piece of paper + the original quiz next Monday in class.
Learning goals
After this lesson, you should be able to:
- Go through a data quality checklist when data wrangling
- Explain the difference between MCAR, MAR, and MNAR missing data mechanisms
- Assess what missing data mechanisms might be at play in a given dataset
- Use visualizations to explore missing data patterns
- Explain why multiple imputation is preferred to single imputation
- Explain how a simulation study can be used to investigate properties of statistical methods
Data Quality Checklist
When wrangling / cleaning data, make sure to check the assumptions you make about the data to ensure you don’t lose data quality.
. . .
Data Parsing (reading data into a different data format)
- Always keep the original, raw data (don’t manually change it).
- Use Test Cases: Find rows or write test cases to double check the wrangling works as expected
- DATES: When using
lubridate
to parse dates and times, ensure the strings are consistently ordered and formatted correctly (e.g. mm/dd/yy vs. dd/mm/yy). - STRINGS: When using
stringr
to parse strings with regular expressions, check example rows to ensure that the pattern captured all of the examples you want and excluded the patterns you don’t want. - Always check for missing values to see if the missing ones are expected given the original data.
. . .
Data Joining
- Identify missing values in the key variables and decide how to handle them before the merge process (e.g., omitting rows with missing values, imputing missing values).
- Decide on the correct join type (left, right, inner, full, etc.) OR if the data structure is the same use
list_rbind()
to bind rows orlist_cbind()
to bind columns. - If doing a join, make sure that the key variables (
by
) have the same meaning in both datasets and are represented in the same way (e.g., id = 1 to 20 in first dataset will match id = 1 - 20 in undesirable ways) - Predict the number of rows that will result from the join and double check the
anti_join()
to see which rows did not find a match. - Check for duplicate records within each dataset and ensure they are handled appropriately before merging.
- Verify that the merged dataset maintains consistency with the original datasets in terms of data values, variable names, and variable types.
- Perform some preliminary analysis or validation checks on the merged dataset to ensure that it meets the requirements of your analysis.
. . .
Sanity Check: Visualize your data!!!
- Do the right number of points appear?
- Do the values seem reasonable?
Explicit v. Implicit missing data
Explicit missing data is data that is explicitly marked as missing. In R, this is done with NA
.
. . .
Implicit missing data is data that is missing but not explicitly marked as such.
- This can happen when an entire row is missing from a dataset.
- For example, if a study participant doesn’t attend a follow-up visit. It maybe not even be recorded in the data.
. . .
We need to make implicit missingness explicit before we can work with the data.
- Consider the combinations of variables that you’d expect to be fully present in a complete dataset.
- If a combination is missing, you can create a new row with explicit missing values.
- For example, if you expect every participant (each has a unique
pid
) to have an observation for eachvisit
, you could use the functioncomplete()
to create that new row and plug in values ofNA
for the missing data.
- For example, if you expect every participant (each has a unique
- If a combination is missing, you can create a new row with explicit missing values.
study_data_full <- study_data %>%
complete(pid, visit)
Dealing with missing data
If you have explicit missing data, there are 3 main ways of proceeding:
- Drop the cases or rows with any missing data from the analysis–a complete case analysis
- Pro: Easy to implement
- Con: reduces sample size, introduces bias if the missing data is not “missing completely at random”
- Create a category for missing values.
- Example: The survey question “What is your gender?” might only provide two possible responses: “male” and “female”. Missing values for this could indicate that the respondent uses a non-binary designation. Instead of dropping these cases, treating the missing data as its own category (“Does not wish to answer”) would be more appropriate.
- Pros: Maintains sample size, may help us if data is “missing not at random”
- Cons: Not directly applicable to continuous outcomes (could include interactions with a categorical version in models to account for it)
- Impute (or fill in values for) the missing data using imputation algorithms.
- Imputation algorithms can be as simple as replacing missing values with the mean of the non-missing values (very simplistic).
- Regression imputation algorithms use models to predict missing values as a function of other variables in the data.
- Pros: Maintains sample size, multiple regression imputation minimizes bias if “missing at random”
- Cons: Computationally intensive
. . .
Deciding between these options and proceeding with choosing finer details within an option requires an understanding of the mechanism by which data become missing.
Missing data mechanisms
The reasons for which a variable might have missing data are divided into 3 mechanisms: MCAR, MAR, and MNAR.
Within a dataset, multiple mechanisms may be present–we need to consider the missingness mechanism for each variable individually.
- Missing completely at random (MCAR):
- The probability of missing data for a variable is the same for all cases. Implies that causes of the missing data are unrelated to the data. (https://stefvanbuuren.name/fimd/sec-MCAR.html)
- Examples:
- Measurement device that runs out of batteries causes the remainder of observations for the day to be missing.
- Data entry software requires a particular field to be typo-free, and missing values are introduced when there are typos.
- Implications for downstream work:
- If a variable has MCAR missingness, a complete case analysis will be unbiased (still valid).
- However, with a lot of missing observations, a complete case analysis will suffer from loss of statistical power (ability to detect a real difference), and imputation will be useful to retain the original sample size.
- Missing at random (MAR):
- The probability of missing data is related to observed variables but unrelated to unobserved information.
- Examples:
- Blood pressure measurements tend to be missing in patients in worse health. (Those in worse health are more likely to miss clinic visits.) Better and worse health can be measured by a variety of indicators in their health record.
- In a survey, older people are more likely to report their income than younger people. Missingness is related to the observed age variable, but not to unobserved information.
- Implications for downstream work:
- Try to use imputation methods that predict the value of the missing variables from other observed variables. Assessing whether this can be done accurately takes some exploration–we’ll explore this shortly.
- Missing not at random (MNAR):
- The probability of missing data is related to unobserved variables (and probably observed variables too).
- Examples:
- Blood pressure measurements are more likely to be missing for those with the highest blood pressure. This is MNAR rather than MAR because the missing data on blood pressure is related to the unobserved values themselves.
- High-income individuals may be less likely to report their income.
- Implications for downstream work:
- Ideally, we would learn more about the causes for the missingness. This could allow us to use more informed imputation models.
- Example: Biological measurements that tend to be missing because of concentrations that are too low (a phenomenon known as left-censoring). Imputation methods specifically suited to left-censoring are useful here.
- We can use imputation methods with different assumptions about the missing data and try out a variety of assumptions. This lets us see how sensitive the results are under various scenarios.
- Example: If higher incomes are more likely to be missing, we can make different assumptions about what “high” could be to fill in the missing values and see how our results change under these different assumptions.
- Ideally, we would learn more about the causes for the missingness. This could allow us to use more informed imputation models.
Exercise
- Missing data mechanism For each of the following situations, propose what missing data mechanism you think is most likely at play.
- In a clinical trial, some patients dropped out before the end of the study. Their reasons for dropping out were not recorded.
- A weather station records temperature, humidity, and wind speed every hour. Some of the recorded values are missing.
- A social media platform collects data on user interactions, such as likes, comments, and shares. Some interactions are not recorded due to bugs in the code.
Exploring missing data
Guiding question: How can we use visualizations and tabulations to explore what missing data mechanisms may be at play?
We’ll look at the airquality
dataset available in base R, which gives daily air quality measurements in New York from May to September 1973. You can pull up the codebook with ?airquality
in the Console.
Missingness by variable
We can explore how much missingness there is for each variable with the following functions:
summary(airquality) # Summary statistics in addition to number of NA's
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
::vis_miss(airquality) # Where are NA's located? naniar
::miss_var_summary(airquality) # Information from vis_miss() in table form naniar
# A tibble: 6 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 Ozone 37 24.2
2 Solar.R 7 4.58
3 Wind 0 0
4 Temp 0 0
5 Month 0 0
6 Day 0 0
Missingness by case
We can explore how much missingness there is for each case with naniar::miss_case_summary()
. For each case, this function calculates the number and percentage of variables with a missing value.
. . .
Impact of Information: If the pct_miss
column is large for a case, we likely won’t be able to impute any of its missing values because there just isn’t enough known information–this case will have to be dropped from the analysis.
::miss_case_summary(airquality) naniar
# A tibble: 153 × 3
case n_miss pct_miss
<int> <int> <dbl>
1 5 2 33.3
2 27 2 33.3
3 6 1 16.7
4 10 1 16.7
5 11 1 16.7
6 25 1 16.7
7 26 1 16.7
8 32 1 16.7
9 33 1 16.7
10 34 1 16.7
# ℹ 143 more rows
Exploring missingness mechanisms
Assessing missingness mechanisms involves checking if missingness in a variable is related to other variables.
Note: Through our available data, we are really only able to explore the potential for MCAR or MAR mechanisms.
. . .
Impact of Information: There is always the chance that unobserved information (unobserved other variables or unobserved values of the variables we do have) is related to missingness for our variables, so to think through the potential for MNAR, more contextual information is necessary.
. . .
To explore these relationships, we can create TRUE/FALSE indicators of whether a variable is missing. In the plots below, we use is.na(Ozone)
to explore whether cases with missing ozone values are noticeably different from cases with observed ozone values in terms of Solar.R
.
ggplot(airquality, aes(x = is.na(Ozone), y = Solar.R)) +
geom_boxplot()
ggplot(airquality, aes(x = Solar.R, color = is.na(Ozone))) +
geom_density()
The above boxplots and density plots suggest that missing ozone is not strongly related to solar radiation levels.
. . .
We still should check if ozone missingness is related to the Wind
, Temp
, Month
, and Day
variables (to be done in Exercises).
In addition to checking if the chance of ozone missingness is related to Solar.R
, we should check if the values of ozone could be predicted by Solar.R
.
. . .
In the scaterrplot below, we look at the relationship between Ozone
and Solar.R
and use vertical lines to indicate the Solar.R
values for cases that are missing Ozone
.
- Impact of Information: We see that missing
Ozone
cases are within the observed span ofSolar.R
, so we would be ok with predictingOzone
fromSolar.R
because there would be no extrapolation.
ggplot(airquality, aes(x = Solar.R, y = Ozone)) +
geom_point() +
geom_smooth() +
geom_vline(data = airquality %>% filter(is.na(Ozone)), mapping = aes(xintercept = Solar.R))
Exercises
- Mechanism detection practice Look at the boxplot + scatterplot pairs for Alternate Situations 1 and 2 below. How do these situations compare to our actual situation and to each other? What concerns might arise from using a model to impute
Ozone
?
- Ozone mechanism detection Continue the investigation of missingness for
Ozone
.
We want to see how Month
, Wind
, and Temp
relate to the chance of missingness for Ozone
and to the value of Ozone
.
Does it look like a linear regression model (perhaps with variable transformations) could be effective in imputing the missing ozone data?
Regression imputation
When a model is built and used to generate a single set of predictions for missing values, this is known as single imputation.
. . .
- When using singly imputed data in subsequent modeling, the uncertainty in estimates tends to be underestimated. This means that:
- Standard errors are lower than they should be.
- Confidence intervals won’t contain the true parameter value the “advertised” percentage of times
- e.g., 95% confidence intervals will not contain the truth in 95% of samples–the coverage probability will be less than 95%
. . .
In multiple imputation, multiple imputed datasets are generated with different values for the filled-in data.
. . .
- Subsequent models are fit on each of these datasets, and both estimates and uncertainty measures are pooled across all of these fits.
- Multiple imputation more accurately estimates uncertainty measures.
Simulation Studies
We can use a simulation study to investigate the statistical properties described above.
- Generate (simulate) data where we are in control of missing data mechanisms and the true relationship between an outcome and the predictors.
- On that simulated data, use single imputation to fill in the missing values. Fit the desired model, and obtain a confidence interval for a coefficient of interest.
- On that simulated data, use multiple imputation to fill in the missing values. Fit the desired models on all imputed datasets, pool results, and obtain a confidence interval for a coefficient of interest.
- Steps 1 - 3 are repeated a lot of times (
num_simulations <- 1000
) to see how things work out in lots of different samples. - Summarize the performance of single and multiple imputation across the
num_simulations
simulations.
We will slowly step through the simulation study code below. We will pause frequently for you to add comments documenting what is happening.
set.seed(224)
<- 1000
num_simulations <- vector("list", length = num_simulations)
ci_list
system.time({
for (i in 1:num_simulations) {
# Simulate data
<- 1000
n <- tibble(
sim_data x1 = runif(n, min = 0, max = 1),
x2 = x1 + rnorm(n, mean = 0, sd = 1),
x2_miss_bool = rbinom(n, size = 1, prob = x1/2),
x2_NA = if_else(x2_miss_bool == 1, NA, x2),
y = x1 + x2 + rnorm(n, mean = 0, sd = 1)
)
# Single imputation ---------------
<- mice(sim_data %>% select(x1, x2_NA, y), m = 1, method = "norm", printFlag = FALSE)
mice_obj <- with(mice_obj, lm(y ~ x1 + x2_NA))
si_mod <- si_mod$analyses[[1]] %>% confint(level = 0.95)
ci_single <- ci_single["x2_NA",]
ci_single
# Multiple imputation -------------
<- mice(sim_data %>% select(x1, x2_NA, y), m = 10, method = "norm", printFlag = FALSE)
mice_obj <- with(mice_obj, lm(y ~ x1 + x2_NA))
mi_mods <- pool(mi_mods)
pooled_res <- summary(pooled_res, conf.int = TRUE, conf.level = 0.95)
summ_pooled_res <- summ_pooled_res %>% filter(term=="x2_NA") %>% pull(`2.5 %`)
ci_multiple_lower <- summ_pooled_res %>% filter(term=="x2_NA") %>% pull(`97.5 %`)
ci_multiple_upper
# Store CI information
<- tibble(
ci_list[[i]] ci_lower = c(
1],
ci_single[
ci_multiple_lower
),ci_upper = c(
2],
ci_single[
ci_multiple_upper
),which_imp = c("single", "multiple")
)
} })
Below we compute the confidence interval (CI) coverage probability (fraction of times the CI contains the true value of 1
) for the CIs generated from single and multiple imputation:
<- bind_rows(ci_list)
ci_data %>%
ci_data mutate(contains_truth = ci_lower < 1 & ci_upper > 1) %>%
group_by(which_imp) %>%
summarize(frac_contains_truth = mean(contains_truth))
Reflection
What were the challenges and successes of today? What resources do you need to take the next steps in your understanding of the ideas today?
Solutions
- Missing data mechanism
Solution
A variety of responses would be reasonable here:
MNAR probably: The chance of missingness is likely due to factors that aren’t recorded (like discomfort).
Could be MCAR if the missing values are from technical glitches. Could be MAR if the missing values are related to other measured weather variables. Could be MNAR if the values tend to be missing when the values themselves are high (e.g., high temperature, humidity, and wind speed causing measurement devices to malfunction).
Could be MCAR if the bugs affected the platform uniformly. Could be MAR if the bugs affected groups of users differently, but where the groupings are known (e.g., bugs affect mobile users more and mobile vs. desktop usage is measurable). Could be MNAR if the bugs lead to performance issues that affect users with slower internet more (where internet speed is likely not measured).
2.Mechanism detection practice
Solution
Ozone is missing when solar is at higher values for both alternative situations.
In Alt 1 version, we have some observed data in the range of Solar for where Ozone is missing so we could predict it fairly well.
In Alt 2 version, Ozone is missing when Solar is beyond the values of our observed data so we don’t have good sense of what the relationship would be between Solar and Ozone.
- Ozone mechanism detection
Solution
Missingness (MAR or potentially MNAR) of Ozone is related to:
- Month: More likely in June
Value of Ozone is related to:
- Month: Higher in July and August
- Wind: Lower when wind is higher
- Temp: Higher when temp is higher
ggplot(airquality, aes(fill = is.na(Ozone), x = factor(Month))) +
geom_bar(position = "fill")
ggplot(airquality, aes(x = is.na(Ozone), y = Wind)) +
geom_boxplot()
ggplot(airquality, aes(x = is.na(Ozone), y = Temp)) +
geom_boxplot()
ggplot(airquality, aes(x = factor(Month), y = Ozone)) +
geom_boxplot()
ggplot(airquality, aes(x = Wind, y = Ozone)) +
geom_point() +
geom_smooth() +
geom_vline(data = airquality %>% filter(is.na(Ozone)), mapping = aes(xintercept = Wind))
ggplot(airquality, aes(x = Temp, y = Ozone)) +
geom_point() +
geom_smooth() +
geom_vline(data = airquality %>% filter(is.na(Ozone)), mapping = aes(xintercept = Temp))
After Class
- Take a look at the Schedule page to see how to prepare for the next class.
- Quiz 1 Revisions due next Monday in class.
- Finish Homework 4
- I’ll post Homework 5 later today. It will be due next Tuesday.
References
- The
naniar
package provides a variety of functions for exploring missing data through visualizations and tabulations. - Flexible Imputation of Missing Data: A free online book about missing data and imputation
- This section digs into missing data mechanisms.
- A chapter on missing data from Andrew Gelman’s online textbook