Workflow Introduction & Setups

Published

March 13, 2026

Purpose of This Page

This page is a walk through for the machine learning workflow used in this project. It provides a reproducible, curated pipeline from data ingestion → feature engineering → model training → evaluation.

Designed for reproducibility across machines
Distinct from shared-notebooks, which are used for exploration and prototyping
Subpages show how the workflow is applied to specific analyses

Project Phases

Phase 1 — Baseline (2019 Study)

Built the data pipeline and feature engineering framework
Developed baseline models (linear + tree-based)
Validated that delay prediction is feasible

Focus: Establish a working ML pipeline

Phase 2 — Multi-Year & System Modeling (2015–2025)

Scaled to ~70M+ flights
Introduced propagation-aware features (2-hop aircraft rotation)
Applied time-aware validation and production-style evaluation

Focus: Model delay propagation and system behavior at scale

Note on Compute

Full 10-year pipeline requires significant resources
This walk through uses a 2019 version for reproducibility and demonstration of the process
Same methods, smaller scale

See the README for running full pipelines.

Main Pages

Workflow Overview — Setup, pipeline, execution
Phase 1: 2019 Study — Baseline modeling
Phase 2: Multi-Year Study — Scaled modeling
Shared Notebooks — Exploratory work (codebase only)

Toward a Online Model and Digital Twin

This framework extends beyond prediction into a real-time, system-level model.

Next steps: - Real-time data ingestion
- Continuous inference
- Forward simulation
- Full digital twin system

Setups and Pulling Code

Project Setup and Environment

To run the code on this site, you will need both the project environment and Quarto set up correctly. See README in the repo for more details. The Conda environment manages the Python and R dependencies used throughout the project, while Quarto is used to render the notebooks and project pages into a reproducible website format. Some assumptions are made here such as you already have R, Python, git and Conda installed on your machine. If you do not have these installed, please refer to the official documentation for each tool to get them set up before proceeding.

Step 1: Clone the project

Start by cloning the repository to your local machine and moving into the project directory.

git clone <repo-url>
cd <repo-folder>

Step 2: Create the Conda environment

This project uses a Conda environment to keep package versions consistent across the team. Build the environment from the provided environment file.

conda env create -f environment.yml
conda activate or568_ml_project

Step 3: Run the ML Pipeline

You have three options for running the pipeline:

Option 1 (Recommended): Run the target notebook in RStudio or Jupyter with an R kernel. Make sure to have the Conda environment activated before launching your IDE.
Option 2: Use quarto render path/to/notebook.qmd to render a specific notebook into HTML. This will execute the code and generate the output in a reproducible way.
Option 3 (Not recommended): Render the entire site using Quarto. This will execute all notebooks and generate all HTML pages.

quarto render
quarto preview

Data Pipeline

The pipeline integrates three independent data streams — Bureau of Transportation Statistics (BTS) on-time performance records, NOAA Global Surface Hourly weather observations, and airport reference dimensions — into a single analytical dataset suitable for delay prediction modelling. Figure 1 provides a schematic overview of each stage. The sections below describe each stage in turn.

A flowchart showing seven stages of the flight delay pipeline: data sources, ingestion, cleaning and normalisation, timestamp alignment and join, feature engineering, postprocessing, and output. — Figure 1: Overview of the flight delay data pipeline, from raw data sources through ingestion, cleaning, temporal alignment, feature engineering, and final output.

Running the Data Pipeline

Big Data Warning

The flight dataset is quite large (over 7 million records for a single year). Running pipeline_main.py to build it on a laptop with 32 GB of RAM crashed. To overcome this, a tower desktop machine with 96 GB of RAM was used to run the data pipeline. If you want to run the data pipeline yourself, it is recommended you use a machine with at least 64 GB of RAM, but ideally 96–128 GB or more. Alternatively, you can use cloud computing resources such as AWS EC2 high-memory instances to run the pipeline without issue.

If you are interested in generating the original data, you can run the data pipeline with the following command (assumes you followed the project setup above and are in the data_pipeline directory):

python3 run_canonical_years.py --year 2019

Data Loading and Cleaning

The following setup pulls the data from S3 using the load_flight_data() function defined in shared-notebooks/common_utils/r/utils.r. This function abstracts away the details of connecting to S3 and loading the Parquet file, allowing us to focus on the data processing and modeling steps.

S3 Access Note

A policy is currently set up in the S3 bucket to allow public read access, so no credentials are needed to load the data. This will not remain the case indefinitely and will only be available for the duration of the project. To browse available files, navigate to the URL referenced in load_flight_data() in a browser — the file names are listed in the XML output under Contents/Key.

If you want to regenerate the exact data used in this project, see pipeline_main.py and run_canonical_years.py in the data_pipeline directory as the entry points for the data pipeline.

Lazy Loading and Memory Efficiency

This project uses lazy-loaded data via Apache Arrow. Lazy loading means that when you open a dataset, no data is actually read into memory — instead you get a query object that represents the data. Operations like filter() and select() are pushed down to the data source and only execute when you explicitly call collect(), at which point only the rows and columns you actually need are pulled into memory. This makes working with large datasets fast and memory efficient.

The setup below loads all required packages via load_project_packages(), defined in utils.r, which installs any missing packages and attaches them to the session. Flight data is loaded from the project’s data source and cached locally as a Parquet file at data/flights_raw.parquet on first run — subsequent renders skip the 30–40 second load and read directly from the local cache instead. To refresh the data, delete data/flights_raw.parquet and re-render.

For anyone extending this project, it is strongly recommended to stick with Arrow in R or switch to Polars in Python rather than pandas. Pandas loads everything into memory eagerly, which becomes a bottleneck quickly on datasets of this size. Arrow and Polars are both built around the same columnar memory format, are significantly faster, and support the same lazy evaluation pattern used here — filter and select on the lazy frame first, then collect only what you need.

In the Medallion architecture nomenclature, this raw S3 data is considered our bronze layer — the unprocessed data that feeds into the feature engineering pipeline downstream.

--- title: "Workflow Introduction & Setups" date: 2026-03-13 format: html toc: true code-fold: false code-tools: true --- # Purpose of This Page This page is a walk through for the machine learning workflow used in this project. It provides a **reproducible, curated pipeline** from data ingestion → feature engineering → model training → evaluation. - Designed for **reproducibility across machines** - Distinct from `shared-notebooks`, which are used for **exploration and prototyping** - Subpages show how the workflow is applied to specific analyses --- ## Project Phases ### **Phase 1 — Baseline (2019 Study)** - Built the **data pipeline and feature engineering framework** - Developed baseline models (linear + tree-based) - Validated that delay prediction is feasible > Focus: *Establish a working ML pipeline* --- ### **Phase 2 — Multi-Year & System Modeling (2015–2025)** - Scaled to **~70M+ flights** - Introduced **propagation-aware features (2-hop aircraft rotation)** - Applied **time-aware validation and production-style evaluation** > Focus: *Model delay propagation and system behavior at scale* --- ### Note on Compute - Full 10-year pipeline requires **significant resources** - This walk through uses a **2019 version** for reproducibility and demonstration of the process - Same methods, smaller scale See the **README** for running full pipelines. --- ## Main Pages - **Workflow Overview** — Setup, pipeline, execution - **Phase 1: 2019 Study** — Baseline modeling - **Phase 2: Multi-Year Study** — Scaled modeling - **Shared Notebooks** — Exploratory work (codebase only) --- ## Toward a Online Model and Digital Twin This framework extends beyond prediction into a **real-time, system-level model**. Next steps: - Real-time data ingestion - Continuous inference - Forward simulation - Full digital twin system --- # Setups and Pulling Code ## Project Setup and Environment To run the code on this site, you will need both the project environment and Quarto set up correctly. See `README` in the repo for more details. The Conda environment manages the Python and R dependencies used throughout the project, while Quarto is used to render the notebooks and project pages into a reproducible website format. Some assumptions are made here such as you already have R, Python, git and Conda installed on your machine. If you do not have these installed, please refer to the official documentation for each tool to get them set up before proceeding. **Step 1: Clone the project** Start by cloning the repository to your local machine and moving into the project directory. ```bash git clone <repo-url> cd <repo-folder> ``` **Step 2: Create the Conda environment** This project uses a Conda environment to keep package versions consistent across the team. Build the environment from the provided environment file. ```bash conda env create -f environment.yml conda activate or568_ml_project ``` **Step 3: Run the ML Pipeline** You have three options for running the pipeline: - **Option 1 (Recommended):** Run the target notebook in RStudio or Jupyter with an R kernel. Make sure to have the Conda environment activated before launching your IDE. - **Option 2:** Use `quarto render path/to/notebook.qmd` to render a specific notebook into HTML. This will execute the code and generate the output in a reproducible way. - **Option 3 (Not recommended):** Render the entire site using Quarto. This will execute all notebooks and generate all HTML pages. ```bash quarto render quarto preview ``` --- # Data Pipeline The pipeline integrates three independent data streams — Bureau of Transportation Statistics (BTS) on-time performance records, NOAA Global Surface Hourly weather observations, and airport reference dimensions — into a single analytical dataset suitable for delay prediction modelling. @fig-pipeline provides a schematic overview of each stage. The sections below describe each stage in turn. ```{r} #| label: fig-pipeline #| fig-cap: "Overview of the flight delay data pipeline, from raw data sources through ingestion, cleaning, temporal alignment, feature engineering, and final output." #| fig-alt: "A flowchart showing seven stages of the flight delay pipeline: data sources, ingestion, cleaning and normalisation, timestamp alignment and join, feature engineering, postprocessing, and output." #| echo: false #| out-width: "100%" knitr::include_graphics("images/common_site/data-pipeline.png") ``` ## Running the Data Pipeline ::: {.callout-warning} ### Big Data Warning The flight dataset is quite large (over 7 million records for a single year). Running `pipeline_main.py` to build it on a laptop with 32 GB of RAM crashed. To overcome this, a tower desktop machine with 96 GB of RAM was used to run the data pipeline. If you want to run the data pipeline yourself, it is recommended you use a machine with at least 64 GB of RAM, but ideally 96–128 GB or more. Alternatively, you can use cloud computing resources such as AWS EC2 high-memory instances to run the pipeline without issue. ::: If you are interested in generating the original data, you can run the data pipeline with the following command (assumes you followed the project setup above and are in the `data_pipeline` directory): ```{bash} #| eval: false python3 run_canonical_years.py --year 2019 ``` ## Data Loading and Cleaning The following setup pulls the data from S3 using the `load_flight_data()` function defined in `shared-notebooks/common_utils/r/utils.r`. This function abstracts away the details of connecting to S3 and loading the Parquet file, allowing us to focus on the data processing and modeling steps. ::: {.callout-note} ### S3 Access Note A policy is currently set up in the S3 bucket to allow public read access, so no credentials are needed to load the data. **This will not remain the case indefinitely** and will only be available for the duration of the project. To browse available files, navigate to the URL referenced in `load_flight_data()` in a browser — the file names are listed in the XML output under `Contents/Key`. If you want to regenerate the exact data used in this project, see `pipeline_main.py` and `run_canonical_years.py` in the `data_pipeline` directory as the entry points for the data pipeline. ::: ::: {.callout-note} ### Lazy Loading and Memory Efficiency This project uses lazy-loaded data via Apache Arrow. Lazy loading means that when you open a dataset, no data is actually read into memory — instead you get a query object that represents the data. Operations like `filter()` and `select()` are pushed down to the data source and only execute when you explicitly call `collect()`, at which point only the rows and columns you actually need are pulled into memory. This makes working with large datasets fast and memory efficient. The setup below loads all required packages via `load_project_packages()`, defined in `utils.r`, which installs any missing packages and attaches them to the session. Flight data is loaded from the project's data source and cached locally as a Parquet file at `data/flights_raw.parquet` on first run — subsequent renders skip the 30–40 second load and read directly from the local cache instead. To refresh the data, delete `data/flights_raw.parquet` and re-render. For anyone extending this project, it is strongly recommended to stick with Arrow in R or switch to Polars in Python rather than pandas. Pandas loads everything into memory eagerly, which becomes a bottleneck quickly on datasets of this size. Arrow and Polars are both built around the same columnar memory format, are significantly faster, and support the same lazy evaluation pattern used here — filter and select on the lazy frame first, then collect only what you need. ::: In the Medallion architecture nomenclature, this raw S3 data is considered our **bronze** layer — the unprocessed data that feeds into the feature engineering pipeline downstream.