Exploratory Data Analysis (EDA)

Brianna Heggeseth

Switch it Up

  • Sit with someone you don’t know well
  • Introduce yourself

Announcements

MSCS Happenings

From Tuesday

Midterm Revisions

For each problem I marked with an X,

  • write a more correct answer and 1 sentence description of why that is a more correct answer (maybe why the original answer wasn’t correct).

Talk with others in the class; help each other understand the WHY.

Turn into me today or Friday.

Learning Goals

  • Understand the first steps that should be taken when you encounter a new data set
  • Develop comfort in knowing how to explore data to understand it
  • Develop comfort in formulating research questions

First Steps of a Data Analysis

Exploratory Data Analysis (EDA), a name given to the process of

  1. “getting to know” a dataset, and
  2. trying to identify any meaningful insights within it.

Exploratory Data Analysis

The process of EDA, as described by Grolemund and Wickham.

Another way to describe EDA:

  1. Understand the basic data that is available to you.
  2. Visualize and describe the variables that seem most interesting or relevant.
  3. Formulate a research question.
  4. Analyze the data related to the research question, starting from simple analyses to more complex ones.
  5. Interpret your findings, refine your research question, and return to step 4.

See paper handout & online course website for more details.

EDA Examples

Practice: Flight Data

Open 13-EDA on the course website for exercises.

Working Together

I want you to work in pairs (3 if needed). List your partner on your Rmd file.

  • Exercises 1-3 about about Understanding Data
    • Work together to ensure you both have understanding.
  • Exercise 4 is Visualizing and Describing
    • Share what you learn with your partner.
  • Exercise 5 is Formulating a Research Question
    • Come up with one specific question together.
  • Exercise 6 is creating a visualization to address that question.
    • Each individual can create their own visualization.

Practice: Flight Data

Let’s practice these steps using data about flight delays from Kaggle.

airlines <- read_csv("https://bcheggeseth.github.io/112_fall_2023/data/airlines.csv")
airports <- read_csv("https://bcheggeseth.github.io/112_fall_2023/data/airports.csv")
flights <- read_csv("https://bcheggeseth.github.io/112_fall_2023/data/flights_jan_jul_sample2.csv")

head(airlines)
# A tibble: 6 × 2
  IATA_CODE AIRLINE               
  <chr>     <chr>                 
1 UA        United Air Lines Inc. 
2 AA        American Airlines Inc.
3 US        US Airways Inc.       
4 F9        Frontier Airlines Inc.
5 B6        JetBlue Airways       
6 OO        Skywest Airlines Inc. 
head(airports)
# A tibble: 6 × 7
  IATA_CODE AIRPORT                       CITY  STATE COUNTRY LATITUDE LONGITUDE
  <chr>     <chr>                         <chr> <chr> <chr>      <dbl>     <dbl>
1 ABE       Lehigh Valley International … Alle… PA    USA         40.7     -75.4
2 ABI       Abilene Regional Airport      Abil… TX    USA         32.4     -99.7
3 ABQ       Albuquerque International Su… Albu… NM    USA         35.0    -107. 
4 ABR       Aberdeen Regional Airport     Aber… SD    USA         45.4     -98.4
5 ABY       Southwest Georgia Regional A… Alba… GA    USA         31.5     -84.2
6 ACK       Nantucket Memorial Airport    Nant… MA    USA         41.3     -70.1
head(flights)
# A tibble: 6 × 31
   YEAR MONTH   DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
  <dbl> <dbl> <dbl>       <dbl> <chr>           <dbl> <chr>       <chr>         
1  2015     1     1           4 AS                 98 N407AS      ANC           
2  2015     1     1           4 AA               2336 N3KUAA      LAX           
3  2015     1     1           4 US                840 N171US      SFO           
4  2015     1     1           4 AA                258 N3HYAA      LAX           
5  2015     1     1           4 AS                135 N527AS      SEA           
6  2015     1     1           4 DL                806 N3730B      SFO           
# ℹ 23 more variables: DESTINATION_AIRPORT <chr>, SCHEDULED_DEPARTURE <chr>,
#   DEPARTURE_TIME <chr>, DEPARTURE_DELAY <dbl>, TAXI_OUT <dbl>,
#   WHEELS_OFF <chr>, SCHEDULED_TIME <dbl>, ELAPSED_TIME <dbl>, AIR_TIME <dbl>,
#   DISTANCE <dbl>, WHEELS_ON <chr>, TAXI_IN <dbl>, SCHEDULED_ARRIVAL <chr>,
#   ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>, DIVERTED <dbl>, CANCELLED <dbl>,
#   CANCELLATION_REASON <chr>, AIR_SYSTEM_DELAY <dbl>, SECURITY_DELAY <dbl>,
#   AIRLINE_DELAY <dbl>, LATE_AIRCRAFT_DELAY <dbl>, WEATHER_DELAY <dbl>

After Class

  • Complete the 1 exercise of finding a new dataset, import, create a visual for Assignment 8 (Data Import)

  • Finish these exercises for Assignment 8 (EDA)

  • Make sure you come up with a specific research question with your partner during class today.

  • Midterm Revisions due Friday

  • IV1 due next week