Exploratory Data Analysis (EDA)

Brianna Heggeseth

Switch it Up

  • Sit with someone you don’t know well
  • Introduce yourself
  • Ask them about their Fall Break!

Announcements

MSCS Happenings

  • 3:30pm Wednesday Problem solving
  • 11:15am Thursday Coffee Break
  • 12-1pm Thursday Iowa State Grad School Info Session

From Last Wednesday

Brainstorming Activity

In the back of your brain, start thinking about project ideas.

Each of you will generate 2-3 ideas.

By Friday night (updated!), you’ll submit those ideas to Moodle.

Midterm Revisions Part 1

For each problem I marked with an X,

  • write a more correct answer and 1 sentence description of why that is a more correct answer (maybe why the original answer wasn’t correct).

Talk with others in the class; help each other understand the WHY.

Turn into me by next class.

Midterm Revisions Part 2

  • UPDATE: You should have been notified of a shared pdf with feedback

  • Talk through some of the stumbling blocks with your classmates. Take notes for yourself.

  • By the end of THIS week, submit an updated version of the Midterm Part 2 to Moodle and write a reflection about the midterm in your spreadsheet.

My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.

Learning Goals

  • Understand the first steps that should be taken when you encounter a new data set
  • Develop comfort in knowing how to explore data to understand it
  • Develop comfort in formulating research questions

First Steps of a Data Analysis

Exploratory Data Analysis (EDA), a name given to the process of

  1. “getting to know” a dataset, and
  2. trying to identify any meaningful insights within it.

Exploratory Data Analysis

The process of EDA, as described by Grolemund and Wickham.

Another way to describe EDA:

  1. Understand the basic data that is available to you.
  2. Visualize and describe the variables that seem most interesting or relevant.
  3. Formulate a research question.
  4. Analyze the data related to the research question, starting from simple analyses to more complex ones.
  5. Interpret your findings, refine your research question, and return to step 4.

Understand the Basic Data

  1. Start by understanding the data that is available to you.
  • Where does my data come from? How was it collected?
    • WHO (whether it is a sample of a larger data set, and, if so, how the sampling was done? Randomly? All cases during a specific time frame? All data for a selected set of users?),
    • WHEN (is this current data or historical? what events may have had an impact?),
    • WHAT (what variables were measured? how was it measured, self-reported through a questionnaire or measured directly?),
    • WHY (who funded the data collection? for what purposes what the data collected? to whose benefit was the data collected?)
  • Is there a codebook? If not, how can I learn about it?
    • Are there people I can reach out to who have experience with this data?

Understand the Basic Data

  1. Next, you need to load the data and clean it. Once the data is loaded, ask yourself about each table:
  • What is an observation?
  • How many observations are there?
  • What is the meaning of each variable?
  • What is the type or class of each variable (date, location, string, factor, number, boolean, etc.)?

Useful R functions:

  • str() to learn about the numbers of variables and observations as well as the classes of variables
  • head() to view the top of the data table (can specify the number of rows with n= )
  • tail() to view the bottom of the data table

Example

spec_tbl_df [52 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state              : chr [1:52] "United States" "Alabama" "Alaska" "Arizona" ...
 $ murder             : num [1:52] 5.6 8.2 4.8 7.5 6.7 6.9 3.7 2.9 4.4 35.4 ...
 $ forcible_rape      : num [1:52] 31.7 34.3 81.1 33.8 42.9 26 43.4 20 44.7 30.2 ...
 $ robbery            : num [1:52] 140.7 141.4 80.9 144.4 91.1 ...
 $ aggravated_assault : num [1:52] 291 248 465 327 387 ...
 $ burglary           : num [1:52] 727 954 622 948 1085 ...
 $ larceny_theft      : num [1:52] 2286 2650 2599 2965 2711 ...
 $ motor_vehicle_theft: num [1:52] 417 288 391 924 262 ...
 $ population         : num [1:52] 2.96e+08 4.55e+06 6.69e+05 5.97e+06 2.78e+06 ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   murder = col_double(),
  ..   forcible_rape = col_double(),
  ..   robbery = col_double(),
  ..   aggravated_assault = col_double(),
  ..   burglary = col_double(),
  ..   larceny_theft = col_double(),
  ..   motor_vehicle_theft = col_double(),
  ..   population = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# A tibble: 6 × 9
  state         murder forcibl…¹ robbery aggra…² burgl…³ larce…⁴ motor…⁵ popul…⁶
  <chr>          <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 United States    5.6      31.7   141.     291.    727.   2286.    417.  2.96e8
2 Alabama          8.2      34.3   141.     248.    954.   2650     288.  4.55e6
3 Alaska           4.8      81.1    80.9    465.    622.   2599.    391   6.69e5
4 Arizona          7.5      33.8   144.     327.    948.   2965.    924.  5.97e6
5 Arkansas         6.7      42.9    91.1    387.   1085.   2711.    262.  2.78e6
6 California       6.9      26     176.     317.    693.   1916.    713.  3.58e7
# … with abbreviated variable names ¹​forcible_rape, ²​aggravated_assault,
#   ³​burglary, ⁴​larceny_theft, ⁵​motor_vehicle_theft, ⁶​population
# A tibble: 6 × 9
  state         murder forcibl…¹ robbery aggra…² burgl…³ larce…⁴ motor…⁵ popul…⁶
  <chr>          <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Vermont          1.3      23.3    11.7    83.5    492.   1686.    103.  618814
2 Virginia         6.1      22.7    99.2   155.     392.   2035     211. 7563887
3 Washington       3.3      44.7    92.1   206.     960.   3150.    784. 6261282
4 West Virginia    4.4      17.7    44.6   206.     621.   1794     210  1803920
5 Wisconsin        3.5      20.6    82.2   135.     441.   1993.    227. 5541443
6 Wyoming          2.7      24      15.3   188.     476.   2534.    145.  506242
# … with abbreviated variable names ¹​forcible_rape, ²​aggravated_assault,
#   ³​burglary, ⁴​larceny_theft, ⁵​motor_vehicle_theft, ⁶​population

Understand the Basic Data

  1. Finally, ask yourself about the relationships between tables:
  • What variables are keys and link the tables (i.e., which variables can you use in join commands)?

Visualize and Describe the Data

  1. Do some univariate visualizations; e.g., plotting histograms, densities, and box plots of different variables.
  • What do you see that is interesting?
  • Which values are most common or unusual (outliers)?
  • Is there a lot of missing data?
  • What type of variation occurs within the individual variables?
  • What might be causing the interesting findings?
  • How could you figure out whether your ideas are correct?

Visualize and Describe the Data

  1. Then examine the covariation between different variables.

One convenient way to do this is with a pairs plot.

Examples

The main point of such plots is not necessarily to draw any conclusions, but help generate more specific research questions and hypotheses.

Formulate a Research Question

You will often end up with a lot of data, and it can be easy to be overwhelmed.

How should you get started?

  1. One easy idea is to brainstorm ideas for research questions, and pick one that seems promising. This process is much easier with more than one brain!
  1. You will often be working off of a broad question posed by a business, organization, or supervisor, and be thinking about how to narrow it down.

To do so, you can again revisit questions like “What patterns do you see?” or “Why might they be occurring?”

EDA Examples

Practice: Flight Data

Let’s practice these steps using data about flight delays from Kaggle. Download template Rmd file from course website.

After Class

  • Finish this activity for Assignment 11 (EDA)

  • Brainstorm Activity due Wednesday

  • Midterm Revisions Part 2 due Friday

  • IV1 due next week

    • If you turned in IV0 last week, feedback will come in next 24 hours.