7.2 Confidence Interval Examples
7.2.1 Proportion Outcome
Let’s return to the flight data and estimate the proportion of afternoon flights based on a sample of 100 flights from NYC.
First, the classical 95% confidence interval can be constructed using the theory of the Binomial Model (Do the 3 conditions hold? Is n large enough for it to look Normal?)
flights_samp %>%
summarize(prop = count(day_hour)/100) %>%
mutate(SE = sqrt(prop*(1-prop)/100)) %>%
mutate(lb = prop - 2*SE, ub = prop + 2*SE)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 1 x 4
## prop SE lb ub
## <dbl> <dbl> <dbl> <dbl>
## 1 0.53 0.0499 0.430 0.630
Or we could bootstrap and get our confidence interval that way.
alpha <- 0.05
boot_data <- mosaic::do(1000)*(
flights_samp %>% # Start with the SAMPLE (not the FULL POPULATION)
sample_frac(replace = TRUE) %>% # Generate by resampling with replacement
summarize(prop = count(day_hour)/100) # Calculate statistics
)
boot_data %>%
summarize(lower = quantile(prop, alpha/2),
upper = quantile(prop, 1-alpha/2))
## lower upper
## 1 0.44 0.63
Our confidence interval gives a sense of the true proportion of flights departed NYC in the afternoon, keeping in mind that this sample could be one of the unlucky samples (the 5%) that have intervals that don’t contain the true value.
7.2.2 Mean and then Median
Perhaps you really care about the arrival delay time because you have somewhere important you need to be when you take flights out of NYC. Let’s estimate the mean arrival delay based on a sample of 100 flights from NYC.
First off, let’s create a classical confidence interval. Since our sample size is relatively large, we can use the Normal model (instead of William Gosset’s work). We use the sample standard deviation and plug into the SE formula.
flights_samp %>%
summarize(mean = mean(arr_delay), s = sd(arr_delay)) %>%
mutate(SE = s/sqrt(100)) %>%
mutate(lb = mean - 2*SE, ub = mean + 2*SE)
## # A tibble: 1 x 5
## mean s SE lb ub
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.36 33.9 3.39 -0.422 13.1
alpha <- 0.05
boot_data <- mosaic::do(1000)*(
flights_samp %>% # Start with the SAMPLE (not the FULL POPULATION)
sample_frac(replace = TRUE) %>% # Generate by resampling with replacement
summarize(means = mean(arr_delay),medians = median(arr_delay)) # Calculate statistics
)
boot_data %>%
summarize(lower = quantile(means, alpha/2),
upper = quantile(means, 1-alpha/2))
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.230 13.4
Our confidence interval gives potential values for of the true mean arrival delay for flights that departed NYC, keeping in mind that this sample could be one of the unlucky samples (the 5%) that have intervals that don’t contain the true value. Also remember that the mean is sensitive to outliers…Let’s consider the median.
We get a slightly different story if we are interested in the middle number versus the average. But notice, we aren’t using a classical CI here because the sampling distribution of a median is not necessarily Normal.
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 -8.5 5.5
7.2.3 Logistic Regression Model
Are the same relative numbers of morning flights in the winter and the summer? Let’s fit a logistic regression model and see what our sample says.
flights_samp$afternoon = flights_samp$day_hour == 'afternoon'
glm.afternoon <- glm(afternoon ~ season, data = flights_samp, family = 'binomial')
summary(glm.afternoon)
##
## Call:
## glm(formula = afternoon ~ season, family = "binomial", data = flights_samp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.332 -1.126 1.030 1.030 1.230
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1226 0.2863 -0.428 0.668
## seasonwinter 0.4793 0.4036 1.188 0.235
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.27 on 99 degrees of freedom
## Residual deviance: 136.85 on 98 degrees of freedom
## AIC: 140.85
##
## Number of Fisher Scoring iterations: 4
The output for the model gives standard errors for the slopes, so we can create the classical confidence intervals directly from the output,
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.6905940 0.4389059
## seasonwinter -0.3081988 1.2791414
Or with bootstrapping,
boot_data <- mosaic::do(1000)*(
flights_samp %>% # Start with the SAMPLE (not the FULL POPULATION)
sample_frac(replace = TRUE) %>% # Generate by resampling with replacement
glm(afternoon ~ season, data = ., family = 'binomial') # Calculate statistics
)
boot_data %>%
summarize(lower = quantile(seasonwinter, alpha/2),
upper = quantile(seasonwinter, 1-alpha/2))
## lower upper
## 1 -0.3503645 1.367213
Knowing that the sample is random, the interval estimate for the logistic regression slope is given by the confidence interval. But for logistic regression, we exponentiate the slopes to get an more interpretable value, the odds ratio. Here, we are comparing the odds of having a flight in the afternoon between winter months (numerator) and summer months (denominator). Is 1 in the interval? If so, what does that tell you?
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.5012782 1.551009
## seasonwinter 0.7347692 3.593553
boot_data %>%
summarize(lower = exp(quantile(seasonwinter, alpha/2)),
upper = exp(quantile(seasonwinter, 1-alpha/2)))
## lower upper
## 1 0.7044313 3.924396
7.2.4 Linear Regression Model Slope (Categorical Variable)
Everything says that there are longer delays in winter. Is that actually true? Let’s fit a linear regression model to test it
##
## Call:
## lm(formula = arr_delay ~ season, data = flights_samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.694 -21.694 -8.694 11.547 166.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6939 4.8687 1.375 0.172
## seasonwinter -0.6547 6.8176 -0.096 0.924
##
## Residual standard error: 34.08 on 98 degrees of freedom
## Multiple R-squared: 9.408e-05, Adjusted R-squared: -0.01011
## F-statistic: 0.009221 on 1 and 98 DF, p-value: 0.9237
The classical CI for the slope is given with by
## 2.5 % 97.5 %
## (Intercept) -2.967923 16.35568
## seasonwinter -14.183889 12.87457
or with bootstrapping,
boot_data <- mosaic::do(1000)*(
flights_samp %>% # Start with the SAMPLE (not the FULL POPULATION)
sample_frac(replace = TRUE) %>% # Generate by resampling with replacement
lm(arr_delay ~ season, data = .) # Calculate statistics
)
boot_data %>%
summarize(lower = quantile(seasonwinter, alpha/2),
upper = quantile(seasonwinter, 1-alpha/2))
## lower upper
## 1 -15.0511 11.794
The 95% confidence interval gives a sense of the difference of mean arrival delays of flights between winter and summer is given by the confidence interval. Is zero in the interval? If so, what does that tell you?
7.2.5 Linear Regression Model Slope (Quantitative Var)
How well can the departure delay predict the arrival delay? What is the effect of departing 1 more minute later? Does that correspond to 1 minute later in arrival on average? Let’s look at the estimated slope between departure and arrival delays for the sample of 100 flights from NYC.
##
## Call:
## lm(formula = arr_delay ~ dep_delay, data = flights_samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.396 -12.397 -2.904 9.588 60.529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.58743 1.85522 -1.395 0.166
## dep_delay 1.00420 0.06173 16.267 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.72 on 98 degrees of freedom
## Multiple R-squared: 0.7297, Adjusted R-squared: 0.727
## F-statistic: 264.6 on 1 and 98 DF, p-value: < 2.2e-16
The classical CI for the slope is given with by
## 2.5 % 97.5 %
## (Intercept) -6.2690537 1.094198
## dep_delay 0.8816967 1.126705
or with bootstrapping,
boot_data <- mosaic::do(1000)*(
flights_samp %>% # Start with the SAMPLE (not the FULL POPULATION)
sample_frac(replace = TRUE) %>% # Generate by resampling with replacement
lm(arr_delay ~ dep_delay, data = .) # Calculate statistics
)
boot_data %>%
summarize(lower = quantile(dep_delay, alpha/2),
upper = quantile(dep_delay, 1-alpha/2))
## lower upper
## 1 0.9173677 1.181704
If the flight leaves an additional minute later, then we’d expect the arrival delay to be increased by a value in the interval above, on average.
7.2.6 Confidence Intervals for Prediction
Imagine we are on a plane, we left 15 minutes late, how late will arrive? Since we only have a sample of 100 flights, we are a bit unsure of our prediction.
A classical CI can give us an interval estimate of what the prediction should be (if we had data on all flights).
## fit lwr upr
## 1 12.47558 8.881202 16.06996
This is taking into account how uncertain we are about our model prediction because our model is based on sample data rather than population data.
7.2.7 Prediction Intervals
We also know that every flight is different (different length, different weather conditions, etc), so the true arrival delay won’t be exactly what we predict.
So to get a better prediction for our arrival delay, we can account for the size of errors or residuals by creating a prediction interval. This interval will be much wider than the confidence interval because it takes into account how far the true values are from the prediction line.
## fit lwr upr
## 1 12.47558 -22.86868 47.81985
7.2.8 Probability Theory vs. Bootstrapping
In the modern age, computing power allows us to perform boostrapping easily to create confidence intervals. Before computing was as powerful as it is today, scientists needed mathematical theory to provide simple formulas for confidence intervals.
If certain assumptions hold, the mathematical theory proves to be just as accurate and less computationally-intensive than bootstrapping. Many scientists using statistics right now learned the theory because when they learned statistics, computers were not powerful enough to handle techniques such as bootstrapping.
Why do we teach both the mathematical theory and bootstrapping? You will encounter both types of techniques in your fields, and you’ll need to have an understanding of what these techniques are to bridge the gap until statistical inference uses modern computational techniques more widely.