Chapter 7 Statistical Inference

Let’s remember our goal of “turning data into information.” Based on a sample data set, we want to be able to say something about the larger population of interest or the general phenomena (not just the data you have collected). If our data is representative of the general population/phenomena, then our sample estimates provide some information but we also need to account for the uncertainty of our estimates due to randomness in the data collection process. Putting our estimates in the context of uncertainty & random variability is called statistical inference. In statistical inference, we care about using sample data to make statements about “truths” in the larger population.

  • To make causal inferences in the sample, we need to account for all possible confounding variables, or we need to randomize the “treatment” and assure there are no other possible reasons for an observed effect.
  • To generalize to a larger population, we need the sample to be representative of the larger population. Ideally, that sample would be randomly drawn from the population. If we actually have a census in that we have data on country, state, or county-level, then we can consider the observed data as a “snapshot in time”. There are random processes that govern how things behave over time, and we have just observed one period in time.

Let’s do some statistical inference based on a simple random sample (SRS) of 100 flights leaving NYC in 2013.

What is our population of interest? What population could we generalize to?

Based on this sample of 100 flights, we can estimate the difference in the mean arrival delay times between flights in the winter compared to the summer. The fit linear regression model suggests that flights in the winter have on average are about half a minute less delayed than summer flights. Do you think that is true for all flights?

If we had a different sample of 100 flights, how much different might that estimate be?

lm.delay <- flights_samp %>%
  with(lm(arr_delay ~ season))

lm.delay %>% 
  tidy()
## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)     6.69       4.87    1.37     0.172
## 2 seasonwinter   -0.655      6.82   -0.0960   0.924

You’ve had a taste of random variation in Chapter 5 and how that plays a role in the conclusions we can draw. In this chapter, we will formalize two techniques that we use to do perform statistical inference: confidence intervals and hypothesis tests. First, we need to formalize the idea of random variation.