AE 05: Battle of the Budget Airlines

Suggested answers

Application exercise

NYC Flights

  • To demonstrate data wrangling we will use flights, a tibble in the nycflights23 R package. Note: this is an updated version of the package we used last time!

  • The data set includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2023.

Caution

If you have already loaded nycflights13, nycflights23 might not load properly, since they are two packages that contain data sets with the same names. If this is causing problems, click Session -> Restart R and Clear Outputs.

EDA: Battle of the Budget Airlines

There’s a good chance you’ve either experienced or heard about the joys of flying with Spirit and Frontier airlines. They get you where you want to go - no more, no less (well, maybe less) - at a very cheap cost. But which is worse?? Specifically, which of the two airlines has a higher proportion of flights with a delayed arrival time?

Create a data visualization and compute corresponding summary statistics that investigate this question.

flights |> 
  filter(carrier %in% c("F9", "NK")) |> 
  mutate(is_arr_delay = ifelse(arr_delay > 0, "Delay", "No Delay")) |>
  filter(!is.na(arr_delay)) |> 
  ggplot(aes(x = carrier, fill = is_arr_delay)) +
  geom_bar(position = "fill")

flights |> 
  filter(carrier %in% c("F9", "NK")) |> 
  mutate(is_arr_delay = ifelse(arr_delay > 0, "Delay", "No Delay")) |>
  filter(!is.na(arr_delay)) |> 
  count(carrier, is_arr_delay) |>
  group_by(carrier) |>
  mutate(total = sum(n)) |>
  mutate(n = n/total)
# A tibble: 4 × 4
# Groups:   carrier [2]
  carrier is_arr_delay     n total
  <chr>   <chr>        <dbl> <int>
1 F9      Delay        0.517  1218
2 F9      No Delay     0.483  1218
3 NK      Delay        0.384 14769
4 NK      No Delay     0.616 14769

Questions:

Can you use this data to make comments about all Spirit and Frontier flights, or just those leaving New York? Is there anything that might cause this comparison to be unfair?

More EDA

Maybe you care more about how much the delays are by than whether or not there is a delay. If one airline has a lot of delays, but they are all by 5 minutes, does it really matter??

Now, create a data visualization and compute summary statistics that compare delay times for delayed flights. Now which airline is worse? Has this changed your answer?

flights |> 
  filter(carrier %in% c("F9", "NK")) |> 
  filter(arr_delay > 0) |> 
  ggplot(aes(x = arr_delay, y = carrier)) +
  geom_boxplot()

flights |> 
  filter(carrier %in% c("F9", "NK")) |> 
  filter(arr_delay > 0) |> 
  group_by(carrier) |>
  summarise(mean = mean(arr_delay), 
            median = median(arr_delay),
            Q1 = quantile(arr_delay, 0.25),
            Q3 = quantile(arr_delay, 0.75))
# A tibble: 2 × 5
  carrier  mean median    Q1    Q3
  <chr>   <dbl>  <dbl> <dbl> <dbl>
1 F9       67.5     37  12.2    75
2 NK       54.7     27  10      70