5.5 Simulating Randomization into Groups

We have been thinking about estimating the differences in mean arrival delays. But we are interested in whether there is actually a true, real difference because we need to decide whether to have this influence our decisions in booking flights. If the difference is 0, then there is no real difference between morning and afternoon flights.

Let’s use a random sample of 500 flights from the population to investigate this question using a difference approach.

flights_samp500 <- flights %>% 
    sample_n(size = 500) 

Let’s summarize and visualize the relationship between hour of the day (morning or afternoon) and the arrival delay.

flights_samp500 %>%
    group_by(day_hour) %>%
    summarize(median = median(arr_delay), mean = mean(arr_delay))
## # A tibble: 2 x 3
##   day_hour  median  mean
##   <chr>      <dbl> <dbl>
## 1 afternoon     -1 12.6 
## 2 morning       -7 -2.39
flights_samp500 %>%
    ggplot(aes(x = day_hour, y = arr_delay)) +
    geom_boxplot() +
    theme_minimal()

Based solely on the visual and numerical summaries, are arrival delays less in the morning than in the afternoon?

We don’t know the exact reason why some flights were scheduled in the morning or the afternoon and why one flight might be delayed (it’s probably due to a complex combination of factors). Let’s imagine that a randomization process was used to decide when particular flights were scheduled (morning or afternoon); a flip of a coin to decide morning or afternoon.

We want to compare the mean arrival delays in morning flights and in afternoon flights.

If there were no difference in arrival delays between morning and afternoon flights, then it wouldn’t matter whether a flight left in the morning or afternoon. That is, the day_hour variable would be irrelevant to the arrival delay arr_delay. If that were true, then we could reshuffle the values of day_hour and it wouldn’t change our conclusions.

Wouldn’t it be great if we could see how the mean arrival delays might change if we shuffled the flights between the “morning” group and “afternoon” group, randomly?

In fact, wouldn’t it be great if we could look at every permutation of flights between two groups?