Machine Learning Pipeline

Published

March 13, 2026

Purpose of This Page

This page provides the code and setup needed to run the machine learning workflow in a repeatable way from start to finish. Its purpose is to help anyone on the team reproduce the analysis, rerun experiments, and understand how the project is organized so results can be generated consistently across different machines. The shared-notebooks are more for exploratory work and prototyping, while this page is meant to be a more polished and reproducible version of the code that can be easily rerun by anyone on the team.

Setups and Pulling Code

Project Setup and Environnement

To run the code on this site, you will need both the project environment and Quarto set up correctly. The Conda environment manages the Python and R dependencies used throughout the project, while Quarto is used to render the notebooks and project pages into a reproducible website format. Some assumptions are made here such as you already have R, Python, git and Conda installed on your machine. If you do not have these installed, please refer to the official documentation for each tool to get them set up before proceeding.

Clone the project

Start by cloning the repository to your local machine and moving into the project directory.

git clone <repo-url>
cd <repo-folder>

Create the Conda environment

This project uses a Conda environment to keep package versions consistent across the team. Build the environment from the provided environment file.

conda env create -f environment.yml

Then activate it:

conda activate or568_ml_project

Run the ML Pipeline

Option 1: Run this notebook in RStudio or Jupyter with R kernel. Make sure to have the Conda environment activated before launching your IDE.

Option 2: Use quarto render path/to/notebook.qmd to render the notebook into HTML. This will execute the code and generate the output in a reproducible way.

Option 3 (not recommended): Render the entire site using Quarto. This will execute all notebooks and generate the HTML pages.

quarto render
quarto preview

Data Pipeline

The pipeline integrates three independent data streams — Bureau of Transportation Statistics (BTS) on-time performance records, NOAA Global Surface Hourly weather observations, and airport reference dimensions — into a single analytical dataset suitable for delay prediction modelling. Figure 1 provides a schematic overview of each stage. The sections below describe each stage in turn.

A flowchart showing seven stages of the flight delay pipeline: data sources, ingestion, cleaning and normalisation, timestamp alignment and join, feature engineering, postprocessing, and output. — Figure 1: Overview of the flight delay data pipeline, from raw data sources through ingestion, cleaning, temporal alignment, feature engineering, and final output.

Running the Data Pipeline

Big Data Warning

The flight dataset is quite large (over 7 million records for a single year). Running the pipeline_main.py to build it on a laptop with 32 GB of RAM crashed. In order to overcome this I had to use my tower desktop machine with 96 GB of RAM to run the data pipeline. If you want to run the data pipeline yourself, I would recommend using a machine with at least 64 GB of RAM, but ideally 96-128 GB or more. Alternatively, you can use cloud computing resources such as AWS EC2 instances with high memory capacity to run the data pipeline without crashing.

If you are interested in generating the original data you can run the data pipeline with the following command:

python3 run_canonical_years.py --year 2019

This assumes you followed the project setups above and are in the data_pipeline directory.

Data Loading and Cleaning

The following setup pulls the data from S3 using the load_flight_data() function defined in shared-notebooks/common_utils/r/utils.r. This function abstracts away the details of connecting to S3 and loading the Parquet file, allowing us to focus on the data processing and modeling steps. Currently a policy is setup in the S3 bucket to allow public read access, so no credentials are needed to load the data. Unfortunately, this will not be the case in the future and will only remain here for the duration of the project. In the Medallion nomenclature this would be considered our bronze level data, which is the raw data that we will be using on our feature engineering pipeline.

If you want to see what is there you can reach the URL provided in the load_flight_data function and enter that into a browser. That will output some XML the name of the files are nested under Contents/Key.

However, if you would like to regenerate the exact data we used you can run our data pipeline code in the data_pipeline directory. See the pipeline_main.py and `run_canonical_years`` as the entry points for the data pipeline for more details.

Big Data Warning

This project uses lazy-loaded data via Apache Arrow. Lazy loading means that when you open a dataset, no data is actually read into memory — instead you get a query object that represents the data. Operations like filter() and select() are pushed down to the data source and only execute when you explicitly call collect(), at which point only the rows and columns you actually need are pulled into memory. This makes working with large datasets fast and memory efficient.

The setup below loads all required packages via load_project_packages(), defined in utils.r, which installs any missing packages and attaches them to the session. Flight data is loaded from the project’s data source and cached locally as a parquet file at data/flights_raw.parquet on first run — subsequent renders skip the 30-40 second load and read directly from the local cache instead. To refresh the data, delete data/flights_raw.parquet and re-render.

For anyone extending this project, I strongly recommend sticking to Arrow in R or switching to Polars in Python rather than pandas. Pandas loads everything into memory eagerly, which becomes a bottleneck quickly on datasets of this size. Arrow and Polars are both built around the same columnar memory format, are significantly faster, and support the same lazy evaluation pattern used here — filter and select on the lazy frame first, then collect only what you need.

source("shared-notebooks/common_utils/r/utils.r") # common utils
load_project_packages()
set.seed(42)

if (!file.exists("data/flights_raw.parquet")) {
  message("No cache found, loading and writing parquet...")
  load_flight_data() %>% 
    collect() %>% 
    janitor::clean_names() %>%
    arrow::write_parquet("data/flights_raw.parquet")
  message("Cache written to data/flights_raw.parquet")
}

flights_raw <- arrow::open_dataset("data/flights_raw.parquet")

Holiday Analysis

# Define 2019 Federal Holidays
us_holidays_2019 <- as.Date(c(
  "2019-01-01", "2019-01-21", "2019-02-18", "2019-05-27",
  "2019-07-04", "2019-09-02", "2019-10-14", "2019-11-11",
  "2019-11-28", "2019-12-25"
))

# Generate travel windows (+/- 3 days around each holiday)
holiday_windows <- map(us_holidays_2019, ~ seq(.x - 3, .x + 3, by = "day")) %>%
  flatten_dbl() %>%
  as.Date(origin = "1970-01-01")

df_final <- flights_raw %>%
  filter(origin == "BWI", 
          dest %in% c("JFK", "EWR"),
          !is.na(dep_delay),
          !is.na(flight_date),
          !is.na(origin),
          !is.na(dest)) %>%
  select(
    flight_date, arr_delay, dep_delay, carrier_delay, weather_delay,
    dep_hour_local, day_of_week, month,
    dep_temp_c, dep_wind_speed_m_s, dep_ceiling_height_m,
    reporting_airline, taxi_out,
    prev_arr_delay, turnaround_minutes, aircraft_leg_number_day,
    crs_elapsed_time, actual_elapsed_time, distance
  ) %>%
  collect() %>%                        # collect before any mutates
  mutate(
    across(c(prev_arr_delay, weather_delay, carrier_delay), ~replace_na(., 0)),
    flight_date_clean = as.Date(as.character(flight_date), format = "%Y-%m-%d")
  ) %>%
  filter(!is.na(arr_delay), !is.na(dep_delay)) %>%
  mutate(
    is_delayed        = as.factor(ifelse(arr_delay >= 15, "Yes", "No")),
    is_weekend        = as.factor(ifelse(day_of_week %in% c(6, 7), "Yes", "No")),
    is_rush_hour      = as.factor(ifelse(dep_hour_local %in% c(7, 8, 16, 17, 18), "Yes", "No")),
    reporting_airline = as.factor(reporting_airline),
    buffer            = crs_elapsed_time - actual_elapsed_time,
    is_holiday        = as.factor(ifelse(as.Date(flight_date_clean) %in% holiday_windows, "Holiday", "Non-Holiday")),
    plane_size_score  = case_when(
      distance > 1500 ~ "Large",
      distance > 500  ~ "Medium",
      TRUE            ~ "Small"
    )
  )

# Output Holiday Frequency Table
holiday_freq <- as.data.frame(table(df_final$is_holiday))
names(holiday_freq) <- c("Holiday Status", "Count")

holiday_freq %>%
  knitr::kable(caption = "Flight Count by Holiday Status") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left"
  )

Flight Count by Holiday Status
Holiday Status	Count
Holiday	290
Non-Holiday	1497

Part 3 — Delays by Holiday Status

Arrival delays are compared inside and outside a ±3-day window around every 2019 US public holiday. Outliers above 100 minutes are clipped to focus on the central distribution. The boxplot below reveals that holiday periods produce a heavier right tail of delays, not merely a higher median — a small share of flights incur very large delays during peak travel windows.

ggplot(df_final, aes(x = is_holiday, y = arr_delay, fill = is_holiday)) +
  geom_boxplot(outlier.shape = NA, width = 0.5, colour = "grey30") +
  coord_cartesian(ylim = c(-20, 100)) +
  scale_fill_manual(values = c("Holiday"     = pal_2[2],
                               "Non-Holiday" = pal_2[1])) +
  labs(
    title    = "The Holiday Tax: BWI–NYC Arrival Delays",
    subtitle = "±3-day windows around all 2019 US public holidays",
    x        = NULL,
    y        = "Arrival delay (minutes)"
  ) +
  theme_paper +
  theme(legend.position = "none")

Figure 2: Distribution of arrival delays for holiday versus non-holiday travel days (BWI–NYC, 2019). Whiskers extend to 1.5× IQR; outliers above 100 min are clipped.

Part 4 — Principal Component Analysis

Before fitting non-linear models, 11 numeric predictors are standardised (zero mean, unit variance) and reduced via PCA. Standardisation is essential because the inputs span different units and scales — mixing minutes, degrees Celsius, and metres per second without rescaling would give disproportionate weight to high-magnitude variables such as distance.

Variable	Description
`dep_delay`	Departure delay (minutes)
`dep_hour_local`	Scheduled departure hour (local time)
`dep_wind_speed_m_s`	Surface wind speed at origin (m/s)
`dep_temp_c`	Temperature at origin (°C)
`dep_ceiling_height_m`	Cloud ceiling at origin (m)
`prev_arr_delay`	Arrival delay of the same aircraft’s prior leg
`turnaround_minutes`	Ground time between inbound arrival and outbound departure
`aircraft_leg_number_day`	Leg sequence number for the aircraft that day
`buffer`	Scheduled elapsed time minus actual elapsed time
`distance`	Great-circle route distance (km)
`taxi_out`	Taxi-out time at origin (minutes)

The scree plot below guides the choice to retain the first five components for modelling, which together capture the majority of variance while discarding noise dimensions.

df_pca_input <- df_final %>%
  select(dep_delay, dep_hour_local, dep_wind_speed_m_s, dep_temp_c,
         dep_ceiling_height_m, prev_arr_delay, turnaround_minutes,
         aircraft_leg_number_day, buffer, distance, taxi_out) %>%
  drop_na() %>%
  scale() %>%
  as.data.frame()

pca_result <- FactoMineR::PCA(df_pca_input, graph = FALSE)

factoextra::fviz_eig(pca_result, addlabels = TRUE,
                     barfill = pal_2[1], barcolor = "white") +
  labs(title = "PCA Scree Plot — Variance Explained per Component",
       y     = "Variance explained (%)") +
  theme_paper

pca_scores   <- as.data.frame(pca_result$ind$coord[, 1:5])
df_pca_model <- bind_cols(
  is_delayed = df_final$is_delayed[1:nrow(pca_scores)],
  pca_scores
)

Figure 3: PCA scree plot. The elbow after component 3–4 supports retaining five components for downstream modelling.

Part 5 — Tree-Based Modelling

Two tree-based classifiers are trained on the original feature space using a 75/25 stratified train–test split. The target variable is_delayed is a binary indicator for arrivals delayed by 15 or more minutes.

set.seed(42)

df_model <- df_final %>%
  select(is_delayed, dep_delay, dep_hour_local, dep_wind_speed_m_s,
         aircraft_leg_number_day, plane_size_score,
         is_holiday, is_weekend) %>%
  drop_na()

train_idx <- createDataPartition(df_model$is_delayed, p = 0.75, list = FALSE)
train_set <- df_model[ train_idx, ]
test_set  <- df_model[-train_idx, ]

Decision tree

The single decision tree provides an interpretable decision path. Each leaf node shows the predicted class and the share of training observations reaching that node.

tree_fit <- rpart(is_delayed ~ ., data = train_set, method = "class")
rpart.plot(tree_fit, extra = 101,
           main        = "Decision Path for BWI–NYC Delays",
           box.palette = c(pal_2[1], pal_2[2]))

Figure 4: CART decision tree. Node labels show predicted class (0 = on-time, 1 = delayed ≥15 min) and the percentage of observations at each split.

Random forest

The random forest aggregates 100 trees and provides a more robust variable importance estimate. MeanDecreaseGini measures how much each predictor reduces node impurity across all splits, averaged over the ensemble.

rf_fit <- randomForest(is_delayed ~ ., data = train_set,
                       ntree = 100, importance = TRUE)
varImpPlot(rf_fit,
           main = "What Truly Drives Your Delays?",
           col  = pal_2[1], pch = 19, cex = 0.9)

Figure 5: Random forest variable importance (Mean Decrease in Gini). Higher values indicate predictors that most consistently improve split purity.

Part 6 — Non-Linear Modelling with PCA Features

Two non-linear classifiers are trained on the five PCA components derived above. Reducing inputs to orthogonal components removes multicollinearity and improves the stability of both SVM and neural network fits.

SVM (RBF kernel): Maps PCA scores into a higher-dimensional space and finds a maximum-margin hyperplane. Cost parameter C = 1 balances margin width against training misclassification.
Neural network (5 hidden neurons): Single-hidden-layer network with L2 weight decay (decay = 0.01) to limit overfitting on the low-dimensional PCA input.

set.seed(42)
pca_idx   <- createDataPartition(df_pca_model$is_delayed, p = 0.75, list = FALSE)
train_pca <- df_pca_model[ pca_idx, ]
test_pca  <- df_pca_model[-pca_idx, ]

svm_fit   <- kernlab::ksvm(is_delayed ~ ., data = train_pca,
                           kernel = "rbfdot", C = 1)
svm_preds <- predict(svm_fit, test_pca)

nnet_fit   <- nnet::nnet(is_delayed ~ ., data = train_pca,
                         size = 5, decay = 0.01, trace = FALSE)
nnet_preds <- as.factor(predict(nnet_fit, test_pca, type = "class"))

svm_acc  <- confusionMatrix(svm_preds,  test_pca$is_delayed)$overall["Accuracy"]
nnet_acc <- confusionMatrix(nnet_preds, test_pca$is_delayed)$overall["Accuracy"]

tibble(
  Model    = c("SVM (RBF kernel)", "Neural network (5 hidden neurons)"),
  Accuracy = round(c(svm_acc, nnet_acc), 4)
) %>%
  kable(align = c("l", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 1: Classification accuracy of SVM and neural network models on the held-out 25% test set.

Model	Accuracy
SVM (RBF kernel)	0.8213
Neural network (5 hidden neurons)	0.8144

Part 7 — Holiday Category Tree Model

Holiday status is refined from a binary flag into five named categories to give tree-based models finer-grained temporal context.

df_final <- df_final %>%
  mutate(
    is_holiday = factor(case_when(
      flight_date_clean >= (as.Date("2019-11-28") - 3) &
        flight_date_clean <= (as.Date("2019-11-28") + 3) ~ "Thanksgiving",
      flight_date_clean >= (as.Date("2019-12-25") - 3) &
        flight_date_clean <= (as.Date("2019-12-25") + 3) ~ "Christmas",
      flight_date_clean >= (as.Date("2019-07-04") - 3) &
        flight_date_clean <= (as.Date("2019-07-04") + 3) ~ "July_4th",
      flight_date_clean %in% holiday_windows             ~ "Other_Holiday",
      .default                                           = "Non-Holiday"
    ), levels = c("Non-Holiday", "Thanksgiving", "Christmas",
                  "July_4th", "Other_Holiday"))
  )

table(df_final$is_holiday) %>%
  as.data.frame() %>%
  rename(`Holiday period` = Var1, `Flight count` = Freq) %>%
  arrange(desc(`Flight count`)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2: Distribution of flights across named holiday categories after re-coding.

Holiday period	Flight count
Non-Holiday	1497
Other_Holiday	214
Thanksgiving	30
July_4th	25
Christmas	21

Part 8 — Hub and Airline Analysis

This section examines scheduling behaviour across the full 2019 dataset, exploring how airlines build buffer time into published block hours and whether that strategy differs between weekdays and weekends.

Top destination hubs

hub_counts <- flights_raw %>%
  select(dest, reporting_airline, taxi_in) %>%
  collect() %>%
  group_by(dest) %>%
  summarise(
    total_flights   = n(),
    unique_airlines = n_distinct(reporting_airline),
    avg_taxi_in     = mean(taxi_in, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_flights))

hub_counts %>%
  head(10) %>%
  kable(col.names = c("Destination", "Total flights",
                      "Unique airlines", "Avg taxi-in (min)"),
        digits = 1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 3: Top 10 destination airports by total inbound flight volume (2019).

Destination	Total flights	Unique airlines	Avg taxi-in (min)
ATL	395026	15	8.9
ORD	339569	13	14.1
DFW	304346	14	11.8
DEN	252064	12	9.1
CLT	235490	14	12.5
LAX	219996	11	11.0
IAH	179682	14	8.8
PHX	175343	11	7.5
LGA	171665	14	9.0
SFO	170966	9	8.6

ggplot(head(hub_counts, 10),
       aes(x = reorder(dest, total_flights), y = total_flights)) +
  geom_col(fill = pal_2[1]) +
  coord_flip() +
  labs(title    = "Top 10 Destination Hubs — 2019",
       subtitle = "Total inbound flight arrivals",
       x = "Airport", y = "Total flights") +
  theme_paper

Figure 6: Top 10 destination airports by total inbound flight volume (2019).

Top destinations from BWI

hub_data <- flights_raw %>%
  filter(origin == "BWI") %>%
  select(dest) %>%
  collect() %>%
  count(dest, sort = TRUE) %>%
  slice_max(n, n = 10)

ggplot(hub_data, aes(x = reorder(dest, n), y = n)) +
  geom_col(fill = pal_2[1]) +
  coord_flip() +
  labs(title    = "Top 10 Destinations from BWI (2019)",
       subtitle = "Based on total departure frequency",
       x = "Destination", y = "Total flights") +
  theme_paper

Figure 7: Top 10 destination airports from BWI by departure frequency (2019).

Airline scheduling honesty

The scheduling buffer — the difference between published block time (crs_elapsed_time) and actual flight time (actual_elapsed_time) — measures how much padding carriers build into their schedules. A large buffer can absorb operational variability but may also mask chronic performance problems.

honesty_check <- df_final %>%
  mutate(buffer = crs_elapsed_time - actual_elapsed_time) %>%
  group_by(reporting_airline) %>%
  summarise(
    avg_padding_mins  = mean(buffer,    na.rm = TRUE),
    avg_arrival_delay = mean(arr_delay, na.rm = TRUE),
    flight_count      = n()
  ) %>%
  arrange(desc(avg_padding_mins))

honesty_check %>%
  kable(col.names = c("Airline", "Avg padding (min)",
                      "Avg arrival delay (min)", "Flights"),
        digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4: Airline scheduling buffer versus average arrival delay. High padding alongside high average delay suggests the buffer is insufficient to absorb disruption.

Airline	Avg padding (min)	Avg arrival delay (min)	Flights
9E	6.49	2.70	1363
WN	5.24	22.97	59
MQ	4.46	-0.21	365

all_airline_buffers <- flights_raw %>%
  filter(!is.na(crs_elapsed_time), !is.na(actual_elapsed_time)) %>%
  select(reporting_airline, crs_elapsed_time, actual_elapsed_time, arr_delay) %>%
  collect() %>%
  mutate(buffer_mins = crs_elapsed_time - actual_elapsed_time) %>%
  group_by(reporting_airline) %>%
  summarise(
    avg_buffer_mins      = mean(buffer_mins, na.rm = TRUE),
    median_buffer        = median(buffer_mins, na.rm = TRUE),
    flight_count         = n(),
    on_time_arrival_rate = mean(arr_delay <= 0, na.rm = TRUE) * 100
  ) %>%
  filter(flight_count > 100) %>%
  arrange(desc(avg_buffer_mins))

ggplot(all_airline_buffers,
       aes(x = reorder(reporting_airline, avg_buffer_mins),
           y = avg_buffer_mins)) +
  geom_segment(aes(xend = reporting_airline, yend = 0), colour = "grey70") +
  geom_point(size = 4, colour = pal_2[1]) +
  coord_flip() +
  labs(title    = "Airlines: Scheduled vs Actual Block Time",
       subtitle = "Average buffer (minutes) — positive = padded schedule",
       x = "Reporting airline", y = "Average buffer (minutes)") +
  theme_paper

Figure 8: Average scheduling buffer per airline. Points to the right of zero indicate carriers that routinely pad schedules beyond actual flight time.

Weekday vs weekend scheduling strategy

weekend_comparison <- df_final %>%
  mutate(
    buffer_mins = crs_elapsed_time - actual_elapsed_time,
    day_type    = ifelse(day_of_week %in% c(6, 7), "Weekend", "Weekday")
  ) %>%
  group_by(reporting_airline, day_type) %>%
  summarise(
    avg_buffer    = mean(buffer_mins, na.rm = TRUE),
    avg_arr_delay = mean(arr_delay,   na.rm = TRUE),
    flight_count  = n(),
    .groups = "drop"
  )

ggplot(weekend_comparison,
       aes(x = reporting_airline, y = avg_buffer, fill = day_type)) +
  geom_col(position = "dodge", colour = "white") +
  scale_fill_manual(values = c("Weekday" = pal_2[1], "Weekend" = pal_2[2])) +
  labs(title    = "Scheduling Strategy: Weekdays vs Weekends",
       subtitle = "Average buffer (minutes) by reporting carrier",
       x = "Airline", y = "Avg buffer (minutes)", fill = NULL) +
  theme_paper

Figure 9: Average scheduling buffer on weekdays versus weekends by airline. A larger weekend buffer may reflect anticipated congestion.

Part 9 — Holiday Impact and Delay Propagation

The analysis is expanded to the full flights_raw dataset (all routes, all airlines) with named holiday categories, providing a nationally representative picture of how each holiday window affects delay rates and where that delay originates.

flights_holiday <- flights_raw %>%
  filter(!is.na(arr_delay), !is.na(dep_delay)) %>%
  collect() %>%
  mutate(
    flight_date_clean = as.Date(flight_date),
    is_holiday = factor(case_when(
      flight_date_clean >= (as.Date("2019-11-28") - 3) &
        flight_date_clean <= (as.Date("2019-11-28") + 3) ~ "Thanksgiving",
      flight_date_clean >= (as.Date("2019-12-25") - 3) &
        flight_date_clean <= (as.Date("2019-12-25") + 3) ~ "Christmas",
      flight_date_clean >= (as.Date("2019-07-04") - 3) &
        flight_date_clean <= (as.Date("2019-07-04") + 3) ~ "July_4th",
      flight_date_clean %in% holiday_windows             ~ "Other_Holiday",
      .default                                           = "Non-Holiday"
    ), levels = c("Non-Holiday", "Thanksgiving", "Christmas",
                  "July_4th", "Other_Holiday"))
  )

Delay summary by airline and holiday

holiday_airline_summary <- flights_holiday %>%
  group_by(reporting_airline, is_holiday) %>%
  summarise(
    avg_arr_delay = mean(arr_delay,       na.rm = TRUE),
    avg_dep_delay = mean(dep_delay,       na.rm = TRUE),
    pct_delayed   = mean(arr_delay >= 15, na.rm = TRUE) * 100,
    flight_count  = n(),
    .groups = "drop"
  )

holiday_airline_summary %>%
  kable(digits = 2,
        col.names = c("Airline", "Holiday period", "Avg arr delay (min)",
                      "Avg dep delay (min)", "% Delayed ≥15 min", "Flights")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(5, bold = TRUE)

Table 5: Average arrival and departure delays and percentage of flights delayed ≥15 minutes, by airline and holiday period (all routes, 2019).

Airline	Holiday period	Avg arr delay (min)	Avg dep delay (min)	% Delayed ≥15 min	Flights
9E	Non-Holiday	3.04	10.07	17.29	213815
9E	Thanksgiving	-1.31	6.31	14.77	4481
9E	Christmas	-6.64	5.66	10.81	5015
9E	July_4th	3.78	11.84	14.82	3704
9E	Other_Holiday	4.92	11.27	18.37	30117
AA	Non-Holiday	6.80	11.82	20.58	777018
AA	Thanksgiving	4.33	9.18	19.38	17159
AA	Christmas	3.29	10.60	19.15	17504
AA	July_4th	12.75	18.09	24.77	16899
AA	Other_Holiday	6.79	11.85	20.60	118196
AS	Non-Holiday	1.57	5.10	18.49	216795
AS	Thanksgiving	3.32	7.69	21.27	5008
AS	Christmas	9.00	13.90	27.20	5117
AS	July_4th	-0.79	3.50	14.95	5510
AS	Other_Holiday	-1.45	2.61	15.44	32386
B6	Non-Holiday	10.54	17.12	24.84	242951
B6	Thanksgiving	9.57	13.96	24.39	5929
B6	Christmas	5.56	14.84	19.67	5703
B6	July_4th	16.00	23.64	26.32	5596
B6	Other_Holiday	14.69	20.26	26.15	37232
DL	Non-Holiday	1.02	7.99	14.28	823562
DL	Thanksgiving	2.88	8.07	17.06	17495
DL	Christmas	-0.48	8.83	16.12	16648
DL	July_4th	1.79	7.91	13.71	17276
DL	Other_Holiday	2.20	9.12	14.84	117005
EV	Non-Holiday	14.90	16.51	24.63	110361
EV	Thanksgiving	9.88	13.04	19.74	2665
EV	Christmas	-2.84	4.90	11.84	2315
EV	July_4th	22.83	23.81	27.28	2203
EV	Other_Holiday	15.97	17.20	26.03	17139
F9	Non-Holiday	9.11	14.83	25.46	110374
F9	Thanksgiving	6.69	12.03	22.53	2819
F9	Christmas	2.35	11.38	22.61	2840
F9	July_4th	11.19	15.61	26.97	2788
F9	Other_Holiday	5.86	11.83	22.66	16722
G4	Non-Holiday	7.33	9.56	19.66	85777
G4	Thanksgiving	17.00	16.92	29.22	2365
G4	Christmas	13.42	15.23	26.82	2509
G4	July_4th	13.93	15.99	27.02	2694
G4	Other_Holiday	8.06	9.92	21.61	11960
HA	Non-Holiday	0.41	0.99	11.11	68453
HA	Thanksgiving	-3.49	-1.13	7.71	1635
HA	Christmas	1.76	2.71	11.88	1666
HA	July_4th	-0.33	0.94	9.20	1631
HA	Other_Holiday	2.78	3.50	14.23	10506
MQ	Non-Holiday	6.08	8.43	19.63	269440
MQ	Thanksgiving	8.07	9.21	23.36	5647
MQ	Christmas	8.63	15.05	21.55	6032
MQ	July_4th	11.69	14.39	24.57	5837
MQ	Other_Holiday	9.02	10.82	22.23	40051
NK	Non-Holiday	4.84	10.61	18.58	166784
NK	Thanksgiving	-1.03	4.42	14.82	4259
NK	Christmas	7.85	15.46	22.20	4234
NK	July_4th	8.15	14.34	21.24	4260
NK	Other_Holiday	6.09	11.42	17.90	25308
OH	Non-Holiday	6.55	10.44	19.14	237613
OH	Thanksgiving	4.78	8.73	18.97	5223
OH	Christmas	8.39	12.99	21.26	5316
OH	July_4th	14.02	17.08	23.98	5358
OH	Other_Holiday	5.24	9.55	19.09	35794
OO	Non-Holiday	6.73	11.96	18.12	689786
OO	Thanksgiving	13.36	17.45	23.51	14829
OO	Christmas	8.55	18.55	20.35	15768
OO	July_4th	6.47	10.70	17.80	14858
OO	Other_Holiday	7.78	13.22	18.57	101204
UA	Non-Holiday	8.39	13.05	21.32	513934
UA	Thanksgiving	9.22	12.36	23.41	11605
UA	Christmas	0.61	8.16	16.65	11487
UA	July_4th	12.06	16.64	24.63	12123
UA	Other_Holiday	7.26	12.11	19.48	76761
WN	Non-Holiday	2.38	9.77	16.85	1120396
WN	Thanksgiving	5.04	12.42	21.27	24525
WN	Christmas	8.57	16.86	25.83	25585
WN	July_4th	4.61	11.67	18.72	25735
WN	Other_Holiday	1.75	9.32	16.47	167705
YV	Non-Holiday	10.28	13.50	20.49	187292
YV	Thanksgiving	4.91	8.64	16.92	4090
YV	Christmas	13.89	20.24	22.45	4044
YV	July_4th	12.57	15.92	22.08	4203
YV	Other_Holiday	9.00	12.18	20.49	28259
YX	Non-Holiday	4.62	8.33	18.86	271738
YX	Thanksgiving	0.62	5.63	15.31	5808
YX	Christmas	-3.38	5.75	12.14	5998
YX	July_4th	5.87	10.28	17.13	5453
YX	Other_Holiday	5.66	9.03	19.66	40152

Delay heatmap: airline × holiday

ggplot(holiday_airline_summary,
       aes(x = is_holiday, y = reporting_airline, fill = avg_arr_delay)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  geom_text(aes(label = round(avg_arr_delay, 1)),
            size = 3, colour = "white", fontface = "bold") +
  scale_fill_gradient2(low = pal_2[1], mid = "#f7f7f7", high = pal_2[2],
                       midpoint = 0, name = "Avg delay\n(min)") +
  labs(title    = "Holiday Delay Heatmap — Airline × Holiday Period",
       subtitle = "Average arrival delay (min); red = worse, blue = better",
       x = "Holiday period", y = "Airline") +
  theme_paper +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

Figure 10: Average arrival delay (minutes) by airline and holiday period. Red cells indicate worse-than-average performance; blue cells indicate better performance.

Share of flights delayed ≥15 minutes

holiday_summary_overall <- flights_holiday %>%
  group_by(is_holiday) %>%
  summarise(
    avg_arr_delay = mean(arr_delay,       na.rm = TRUE),
    pct_delayed   = mean(arr_delay >= 15, na.rm = TRUE) * 100,
    flight_count  = n(),
    .groups = "drop"
  )

ggplot(holiday_summary_overall,
       aes(x = is_holiday, y = pct_delayed, fill = is_holiday)) +
  geom_col(show.legend = FALSE, width = 0.65) +
  geom_text(aes(label = paste0(round(pct_delayed, 1), "%")),
            vjust = -0.5, fontface = "bold", size = 3.5) +
  scale_fill_manual(values = pal_hol) +
  labs(title    = "Share of Flights Delayed ≥15 Minutes by Holiday Window",
       subtitle = "All airlines, all routes — 2019",
       x = NULL, y = "% of flights delayed") +
  theme_paper

Figure 11: Percentage of all flights delayed ≥15 minutes during each holiday window versus non-holiday days (all airlines, all routes, 2019).

Decomposing delay: inherited vs newly generated

A key analytical distinction is whether a flight’s departure delay was inherited from the previous leg flown by the same aircraft — propagated via tail number through prev_arr_delay — or freshly generated on the current leg. Understanding this split is important for the predictive model: inherited delay is knowable before departure and is captured by prev_arr_late_15 and rotation_continuity_flag, whereas newly generated delay must be modelled from operational and weather conditions.

propagation_df <- flights_holiday %>%
  filter(!is.na(prev_arr_delay)) %>%
  mutate(
    inherited_delay = pmax(prev_arr_delay, 0),
    new_delay       = pmax(dep_delay - prev_arr_delay, 0),
    total_dep_delay = inherited_delay + new_delay
  )

propagation_summary <- propagation_df %>%
  group_by(is_holiday) %>%
  summarise(
    avg_inherited = mean(inherited_delay,  na.rm = TRUE),
    avg_new       = mean(new_delay,        na.rm = TRUE),
    avg_total     = mean(total_dep_delay,  na.rm = TRUE),
    pct_inherited = avg_inherited / avg_total * 100,
    flight_count  = n(),
    .groups = "drop"
  )

propagation_summary %>%
  kable(digits = 2,
        col.names = c("Holiday period", "Avg inherited (min)",
                      "Avg new delay (min)", "Avg total dep delay (min)",
                      "% Inherited", "Flights")) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  column_spec(5, bold = TRUE, color = "white",
            background = spec_color(propagation_summary$pct_inherited,
                                    option = "C", direction = -1))

Table 6: Decomposition of average departure delay into inherited (propagated from the prior leg via tail rotation) and newly generated components, by holiday period.

Holiday period	Avg inherited (min)	Avg new delay (min)	Avg total dep delay (min)	% Inherited	Flights
Non-Holiday	13.80	14.04	27.84	49.57	6102339
Thanksgiving	13.23	14.35	27.57	47.96	135491
Christmas	14.07	16.16	30.23	46.56	137727
July_4th	16.44	16.29	32.73	50.23	136053
Other_Holiday	14.06	14.28	28.34	49.63	904535

propagation_summary %>%
  select(is_holiday, avg_inherited, avg_new) %>%
  pivot_longer(cols = c(avg_inherited, avg_new),
               names_to  = "delay_type",
               values_to = "minutes") %>%
  mutate(delay_type = case_match(delay_type,
    "avg_inherited" ~ "Inherited from prev. flight",
    "avg_new"       ~ "Newly generated this leg"
  )) %>%
  ggplot(aes(x = is_holiday, y = minutes, fill = delay_type)) +
  geom_col(position = "stack", colour = "white", linewidth = 0.3) +
  geom_text(aes(label = round(minutes, 1)),
            position = position_stack(vjust = 0.5),
            colour = "white", fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c(
    "Inherited from prev. flight" = pal_2[2],
    "Newly generated this leg"    = pal_2[1]
  )) +
  labs(title    = "Where Does Holiday Delay Come From?",
       subtitle = "Decomposing departure delay: inherited vs newly generated — all airlines 2019",
       x = NULL, y = "Average delay (minutes)", fill = NULL) +
  theme_paper

Figure 12: Average departure delay decomposed into the inherited component (coral; carried from prior leg via tail rotation) and the newly generated component (blue; created on the current leg).

Airlines as absorbers vs amplifiers of delay

Each airline–holiday combination is classified by whether the carrier absorbs inherited disruption (points below the diagonal: newly generated delay < inherited delay) or amplifies it (points above: newly generated delay > inherited delay).

propagation_df %>%
  group_by(reporting_airline, is_holiday) %>%
  summarise(
    avg_inherited = mean(inherited_delay, na.rm = TRUE),
    avg_new       = mean(new_delay,       na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ggplot(aes(x = avg_inherited, y = avg_new,
             colour = is_holiday, label = reporting_airline)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "grey50") +
  geom_point(size = 3, alpha = 0.85) +
  geom_text_repel(size = 2.8, max.overlaps = 15) +
  scale_colour_manual(values = pal_hol) +
  labs(title    = "Airlines: Absorbers vs Amplifiers of Delay",
       subtitle = "Points above the diagonal → airline adds more delay than it inherited",
       x      = "Avg inherited delay (min)",
       y      = "Avg newly generated delay (min)",
       colour = "Holiday period") +
  theme_paper

Figure 13: Each point represents one airline during one holiday period. Points above the dashed diagonal indicate the airline generated more new delay than it inherited, effectively amplifying disruption.

Training Deep Learning Models

Setups

Hardware optimizations

The pipeline was optimized across three layers of the hardware stack. First, CPU threading was configured so that all available cores are used across NumPy, scikit-learn, and TensorFlow operations rather than defaulting to single-threaded execution. These settings have to be applied before any libraries are loaded, which is why they appear at the very top of the script. Second, XGBoost was configured to run on the GPU in addition to using all CPU cores, which speeds up tree construction particularly for the larger feature sets. Third, TensorFlow’s LSTM training was directed to the GPU, where matrix operations map efficiently to CUDA cores and result in significantly faster epoch times compared to CPU. GPU memory allocation was also set to grow on demand rather than reserving all available VRAM upfront, which keeps the setup stable in a Jupyter environment where the kernel persists between runs.

CPU with 24 cores and ~96 GB RAM was more performant than 8 GB VRAM on GPU. Ideally for training like this we would like to see 40-80 VRAM acording to….This is were the cost-benefit of training larger models comes into play. We could got out to AWS and purchase 10 min or so of GPU. If we assume usage cost is $10 per hour this could add up if we have to train many models over long periods of time. Purchasing a 40-80 GB VRAM GPU is way out of the question where cost range in the 10s of thousands of dollars. Alas, we will stick with using CPU and RAM and get over the fact that training takes 30 min to 1-2 hours depending on if we are running CV or hyper parameter tuning.

Fortuantly running the model only takes mili seconds.

Doing this significantly improved training time. For XGBoost went from 162.1 to 44.78 That’s a reduction of 117.32 seconds — roughly 3.6x faster.

Note on GPU vs CPU usage

GPU - 23 min CPU - 32 min

GPU config and hard hardware

CPU config and hard hardware

full train-and-evaluate pipeline

builds the LSTM input tensors scales the inputs trains a multi-output LSTM predicts on the test set evaluates both classification and regression performance returns everything needed for later comparison

def train_lstm_variant(train_df, test_df, variant_name, current_variant="delay_only", units1=32, units2=16, epochs=10):
    t_variant = time.perf_counter()

    X_train, y_train_cls, y_train_reg, train_pdf = build_lstm_step_matrix(train_df, current_variant=current_variant)
    X_test, y_test_cls, y_test_reg, test_pdf = build_lstm_step_matrix(test_df, current_variant=current_variant)

    X_train_scaled, X_test_scaled, scaler = scale_lstm(X_train, X_test)

    y_train_cls = np.asarray(y_train_cls, dtype=np.float32)
    y_train_reg = np.asarray(y_train_reg, dtype=np.float32)
    y_test_cls = np.asarray(y_test_cls, dtype=np.float32)
    y_test_reg = np.asarray(y_test_reg, dtype=np.float32)

    model = build_lstm_model(
        input_shape=(X_train_scaled.shape[1], X_train_scaled.shape[2]),
        units1=units1,
        units2=units2,
        dropout=0.2,
    )

    early_stop = EarlyStopping(
        monitor="val_loss",
        patience=3,
        restore_best_weights=True
    )

    history = model.fit(
        X_train_scaled,
        {"cls": y_train_cls, "reg": y_train_reg},
        validation_split=0.15,
        epochs=epochs,
        batch_size=256,         # increase if you have RAM headroom (e.g. 512 or 1024)
        callbacks=[early_stop],
        verbose=1,
    )

    pred_cls, pred_reg = model.predict(X_test_scaled, verbose=0)
    pred_cls = pred_cls.ravel()
    pred_reg = pred_reg.ravel()

    cls_metrics = classification_metrics(y_test_cls.astype(int), pred_cls)
    reg_metrics = regression_metrics(y_test_reg, pred_reg)

    timer_log(f"LSTM {variant_name}", t_variant)

    return {
        "variant_name": variant_name,
        "model": model,
        "history": history,
        "X_test": X_test_scaled,
        "test_pdf": test_pdf,
        "y_test_cls": y_test_cls,
        "y_test_reg": y_test_reg,
        "pred_cls": pred_cls,
        "pred_reg": pred_reg,
        "cls_metrics": cls_metrics,
        "reg_metrics": reg_metrics,
        "scaler": scaler,
    }