19 Code & Data Quality
Data Storytelling Moment
Learning goals
After this lesson, you should be able to:
- Understand the core principles of writing high quality code
- Understand the core principles of cleaning so as to ensure high quality data
Code Quality Checklist
When writing code to do complex tasks (acquiring data, advanced wrangling, data simulation, data modeling, interactive web application, etc.), try to ensure the code follows the following principles.
- DRY (Don’t Repeat Yourself)
- If you find yourself copying and pasting code, write a function that can complete that simple task
- Readability
- Code is written both for a computer and a human reading it
- Ensure the code is human readable and add comments to support that readability
- Meaningful Names
- Names should reflect the purpose and functionality of the element (e.g. data, function, etc.)
- Names should follow a naming convention (meaningful structure and consistent patterns) to make it easier to automate/iterate over names if necessary
- Testing / Test Cases
- For any reusable function, you should specify the inputs and the desired outputs.
- Write tests to cover use cases (including extreme, unusual cases) to ensure that the function provides the desired output as expected.
- Gen AI may be a good tool to help you write example test cases; you provide the description of the desired inputs and outputs
- Efficiency
- Quality code runs faster and uses fewer computing resources.
- Use the bench::mark() function to test the time and memory usage of a function.
- To speed up code, find the slowest / most computationally intense part and consider whether you are unnecessarily replicating a task.
- Reproducible
- Quality code provides output that can be reproduced on different computer systems.
- To start, make sure you are using relative file paths.
- Simple Tasks in Functions
- One function should only complete 1 simple and small task
- Avoid writing a loop within a function; that loop should be calling a function to run that task over and over
Data Quality Checklist
When wrangling / cleaning data, make sure to check the assumptions you make about the data to ensure you don’t lose data quality.
- Data Parsing (reading data into a different data format)
- Always keep the original, raw data (don’t adjust).
- Always check for missing values to see if the missing ones are expected given the original data.
- Use Test Cases: Find rows or write test cases to double check the wrangling works as expected
- DATES: When using lubridate to parse dates and times, ensure the strings are properly ordered and formatted correctly (e.g. mm/dd/yy vs. dd/mm/yy).
- STRINGS: When using stringr to parse strings with regular expressions, check example rows to ensure that the pattern captured all of the examples you want and excluded the patterns you don’t want.
- Data Joining
- Decide on the correct join type (left, right, inner, full, etc.) OR if the data structure is the same use list_rbind() to bind rows or list_cbind() to bind columns.
- If doing a join, make sure that the key variables (
by
) have the same meaning in both datasets and are represented in the same way (e.g., id = 1 to 20 in first dataset will match id = 01 - 20 in undesirable ways) - Predict the number of rows that will result from the join and double check the anti_join() to see which rows did not find a match.
- Check for duplicate records within each dataset and ensure they are handled appropriately before merging.
- Identify missing values in the key variables and decide how to handle them during the merge process (e.g., omitting rows with missing values, imputing missing values).
- Verify that the merged dataset maintains consistency with the original datasets in terms of data values, variable names, and variable types.
- Perform some preliminary analysis or validation checks on the merged dataset to ensure that it meets the requirements of your analysis.
- Sanity Check: Visualize your data!!!
- Do the right number of points appear?
- Do the values seem reasonable?
After Class
- Take a look at the Schedule page to see how to prepare for the next class