Exploratory Data Analysis (EDA)

Brianna Heggeseth

Switch it Up

  • Sit with someone you don’t know well
  • Introduce yourself

Announcements

MSCS Happenings

From Tuesday

Midterm Revisions Part 1

For each problem I marked with an X,

  • write a more correct answer and 1 sentence description of why that is a more correct answer (maybe why the original answer wasn’t correct).

Talk with others in the class; help each other understand the WHY.

Turn into me by next class.

Midterm Revisions Part 2

  • UPDATE: You should have been notified of a shared pdf with feedback

  • Talk through some of the stumbling blocks with your classmates. Take notes for yourself.

  • By the end of THIS week, submit an updated version of the Midterm Part 2 to Moodle and write a reflection about the midterm in your reflection Google Doc for March.

My Deal: You may talk to others in the class (not preceptors, not people who have previously taken it) but you may not directly share code with each other. Instead, talk about the actions more conceptually and point each other to resources.

Learning Goals

  • Understand the first steps that should be taken when you encounter a new data set
  • Develop comfort in knowing how to explore data to understand it
  • Develop comfort in formulating research questions

First Steps of a Data Analysis

Exploratory Data Analysis (EDA), a name given to the process of

  1. “getting to know” a dataset, and
  2. trying to identify any meaningful insights within it.

Exploratory Data Analysis

The process of EDA, as described by Grolemund and Wickham.

Another way to describe EDA:

  1. Understand the basic data that is available to you.
  2. Visualize and describe the variables that seem most interesting or relevant.
  3. Formulate a research question.
  4. Analyze the data related to the research question, starting from simple analyses to more complex ones.
  5. Interpret your findings, refine your research question, and return to step 4.

See paper handout & online course website for more details.

EDA Examples

Practice: Flight Data

Open 13-EDA on the course website for exercises.

Working Together

I want you to work in pairs (3 if needed). List your partner on your Rmd file.

  • Exercises 1-3 about about Understanding Data
    • Work together to ensure you both have understanding.
  • Exercise 4 is Visualizing and Describing
    • Share what you learn with your partner.
  • Exercise 5 is Formulating a Research Question
    • Come up with one specific question together.
  • Exercise 6 is creating a visualization to address that question.
    • Each individual can create their own visualization.

Practice: Flight Data

Let’s practice these steps using data about flight delays from Kaggle.

airlines <- read_csv("https://bcheggeseth.github.io/112_spring_2023/data/airlines.csv")
airports <- read_csv("https://bcheggeseth.github.io/112_spring_2023/data/airports.csv")
flights <- read_csv("https://bcheggeseth.github.io/112_spring_2023/data/flights_jan_jul_sample2.csv")

head(airlines)
# A tibble: 6 × 2
  IATA_CODE AIRLINE               
  <chr>     <chr>                 
1 UA        United Air Lines Inc. 
2 AA        American Airlines Inc.
3 US        US Airways Inc.       
4 F9        Frontier Airlines Inc.
5 B6        JetBlue Airways       
6 OO        Skywest Airlines Inc. 
head(airports)
# A tibble: 6 × 7
  IATA_CODE AIRPORT                          CITY  STATE COUNTRY LATIT…¹ LONGI…²
  <chr>     <chr>                            <chr> <chr> <chr>     <dbl>   <dbl>
1 ABE       Lehigh Valley International Air… Alle… PA    USA        40.7   -75.4
2 ABI       Abilene Regional Airport         Abil… TX    USA        32.4   -99.7
3 ABQ       Albuquerque International Sunpo… Albu… NM    USA        35.0  -107. 
4 ABR       Aberdeen Regional Airport        Aber… SD    USA        45.4   -98.4
5 ABY       Southwest Georgia Regional Airp… Alba… GA    USA        31.5   -84.2
6 ACK       Nantucket Memorial Airport       Nant… MA    USA        41.3   -70.1
# … with abbreviated variable names ¹​LATITUDE, ²​LONGITUDE
head(flights)
# A tibble: 6 × 31
   YEAR MONTH   DAY DAY_OF_WEEK AIRLINE FLIGHT…¹ TAIL_…² ORIGI…³ DESTI…⁴ SCHED…⁵
  <dbl> <dbl> <dbl>       <dbl> <chr>      <dbl> <chr>   <chr>   <chr>   <chr>  
1  2015     1     1           4 AS            98 N407AS  ANC     SEA     0005   
2  2015     1     1           4 AA          2336 N3KUAA  LAX     PBI     0010   
3  2015     1     1           4 US           840 N171US  SFO     CLT     0020   
4  2015     1     1           4 AA           258 N3HYAA  LAX     MIA     0020   
5  2015     1     1           4 AS           135 N527AS  SEA     ANC     0025   
6  2015     1     1           4 DL           806 N3730B  SFO     MSP     0025   
# … with 21 more variables: DEPARTURE_TIME <chr>, DEPARTURE_DELAY <dbl>,
#   TAXI_OUT <dbl>, WHEELS_OFF <chr>, SCHEDULED_TIME <dbl>, ELAPSED_TIME <dbl>,
#   AIR_TIME <dbl>, DISTANCE <dbl>, WHEELS_ON <chr>, TAXI_IN <dbl>,
#   SCHEDULED_ARRIVAL <chr>, ARRIVAL_TIME <chr>, ARRIVAL_DELAY <dbl>,
#   DIVERTED <dbl>, CANCELLED <dbl>, CANCELLATION_REASON <chr>,
#   AIR_SYSTEM_DELAY <dbl>, SECURITY_DELAY <dbl>, AIRLINE_DELAY <dbl>,
#   LATE_AIRCRAFT_DELAY <dbl>, WEATHER_DELAY <dbl>, and abbreviated variable …

After Class

  • Complete the 1 exercise of finding a new dataset, import, create a visual for Assignment 8 (Data Import)

  • Finish these exercises for Assignment 8 (EDA)

  • Make sure you come up with a specific research question with your partner during class today.

  • Midterm Revisions Part 2 due Friday

  • IV1 due next week