16 Exploratory Data Analysis (EDA)
- Understand the first steps that should be taken when you encounter a new data set
- Develop comfort in knowing how to explore data to understand it
- Develop comfort in formulating research questions
Read:
- Exploratory Data Analysis (Wickham, Çetinkaya-Rundel, & Grolemund)
- Exploratory Data Analysis Checklist (Peng)
WHERE ARE WE?!? Starting a data project
This final, short unit will help prepare us as we launch into course projects. In order to even start these projects, we need some sense of the following:
data import: how to find data, store data, load data into RStudio, and do some preliminary data checks & cleaning
exploratory data analysis (EDA)
16.1 Warm-up
What is EDA?!
EDA is a preliminary, exploratory, and iterative analysis of our data relative to our general research questions of interest.
How is this different than what we’ve been doing?
We’ve been focusing on various tools needed for various steps within an EDA. Now we’ll bring them all together in a more cohesive process.
EXAMPLE
EDA essentials
Start small.
We often start with lots of data – some of it useful, some of it not. To start:- Focus on just a small set of variables of interest.
- Break down your research question into smaller pieces.
- Obtain the most simple numerical & visual summaries that are relevant to your research questions.
Ask questions.
We typically start a data analysis with at least some general research questions in mind. In obtaining numerical and graphical summaries that provide insight into these questions, we must ask:- what questions do these summaries answer?
- what questions don’t these summaries answer?
- what’s surprising or interesting here?
- what follow-up questions do these summaries provoke?
Play! Be creative. Don’t lock yourself into a rigid idea of what should happen.
Repeat.
Repeat this iterative questioning and analysis process as necessary, letting our reflections on the previous questions inspire our next steps.
16.2 Exercises
Do the Homework 7 exercises.
16.3 Wrap-up
- Upcoming due dates
- Wednesday by 5pm: Project Milestone 1 - pre-project survey (linked on Moodle). This is required, and is the first “milestone” for the project.
- Thursday: Quiz 2 revisions
- Thursday by 11:59pm: Homework 7 (on Moodle)
- Thursday’s class
- We’ll do project brainstorming and start thinking about project groups. Attendance is important!
- Roughly half the class will be work time for Homework 7
- Registration info
- MSCS Registration Ice Cream Social: Thursday November 7, 11:30am-12:30pm in OLRI Smail Gallery
- Information on waitlists for all MSCS courses will be listed here.
- Courses to consider
- COMP/STAT 212 (Intermediate Data Science)
- Prereqs = 112, STAT 155, COMP 123
- Recommended = STAT 253
- Topics: similar themes to 112 but more advanced / in depth approaches
- STAT 155 (Intro to Statistical Modeling)
- Prereqs = none
- Postreqs = 155 is required for all STAT courses beyond the 100-level
- Topics: Like 112, you’ll use data to explore relationships of interest. But unlike 112 in which this exploration is observational and restricted to lower dimensions, 155 explores how to model relationships and use these models to make inferences and predictions regarding the population outside our dataset.
- Overlap: 155 uses similar wrangling and viz tools as 112, but these are not the emphasis.
- STAT 253 (Statistical Machine Learning)
- Prereqs = STAT 155
- Topics: Like 112 and 155, 253 focuses on data analysis. It surveys a wide variety algorithms / models, beyond those in 155, thus greatly expands the types of relationships we can study.
- COMP 123 (Core concepts in computer science)
- Prereqs = none
- Postreqs = 123 is core requirement for all COMP courses
- Topics: This course isn’t focused on data. As the title suggests, it focuses on core concepts in computer science.
- COMP/STAT 212 (Intermediate Data Science)