Workflow Introduction & Setups
Purpose of This Page
This page is a walk through for the machine learning workflow used in this project. It provides a reproducible, curated pipeline from data ingestion → feature engineering → model training → evaluation.
- Designed for reproducibility across machines
- Distinct from
shared-notebooks, which are used for exploration and prototyping - Subpages show how the workflow is applied to specific analyses
Project Phases
Phase 1 — Baseline (2019 Study)
- Built the data pipeline and feature engineering framework
- Developed baseline models (linear + tree-based)
- Validated that delay prediction is feasible
Focus: Establish a working ML pipeline
Phase 2 — Multi-Year & System Modeling (2015–2025)
- Scaled to ~70M+ flights
- Introduced propagation-aware features (2-hop aircraft rotation)
- Applied time-aware validation and production-style evaluation
Focus: Model delay propagation and system behavior at scale
Note on Compute
- Full 10-year pipeline requires significant resources
- This walk through uses a 2019 version for reproducibility and demonstration of the process
- Same methods, smaller scale
See the README for running full pipelines.
Main Pages
- Workflow Overview — Setup, pipeline, execution
- Phase 1: 2019 Study — Baseline modeling
- Phase 2: Multi-Year Study — Scaled modeling
- Shared Notebooks — Exploratory work (codebase only)
Toward a Online Model and Digital Twin
This framework extends beyond prediction into a real-time, system-level model.
Next steps: - Real-time data ingestion
- Continuous inference
- Forward simulation
- Full digital twin system
Setups and Pulling Code
Project Setup and Environment
To run the code on this site, you will need both the project environment and Quarto set up correctly. See README in the repo for more details. The Conda environment manages the Python and R dependencies used throughout the project, while Quarto is used to render the notebooks and project pages into a reproducible website format. Some assumptions are made here such as you already have R, Python, git and Conda installed on your machine. If you do not have these installed, please refer to the official documentation for each tool to get them set up before proceeding.
Step 1: Clone the project
Start by cloning the repository to your local machine and moving into the project directory.
git clone <repo-url>
cd <repo-folder>Step 2: Create the Conda environment
This project uses a Conda environment to keep package versions consistent across the team. Build the environment from the provided environment file.
conda env create -f environment.yml
conda activate or568_ml_projectStep 3: Run the ML Pipeline
You have three options for running the pipeline:
- Option 1 (Recommended): Run the target notebook in RStudio or Jupyter with an R kernel. Make sure to have the Conda environment activated before launching your IDE.
- Option 2: Use
quarto render path/to/notebook.qmdto render a specific notebook into HTML. This will execute the code and generate the output in a reproducible way. - Option 3 (Not recommended): Render the entire site using Quarto. This will execute all notebooks and generate all HTML pages.
quarto render
quarto previewData Pipeline
The pipeline integrates three independent data streams — Bureau of Transportation Statistics (BTS) on-time performance records, NOAA Global Surface Hourly weather observations, and airport reference dimensions — into a single analytical dataset suitable for delay prediction modelling. Figure 1 provides a schematic overview of each stage. The sections below describe each stage in turn.
Running the Data Pipeline
The flight dataset is quite large (over 7 million records for a single year). Running pipeline_main.py to build it on a laptop with 32 GB of RAM crashed. To overcome this, a tower desktop machine with 96 GB of RAM was used to run the data pipeline. If you want to run the data pipeline yourself, it is recommended you use a machine with at least 64 GB of RAM, but ideally 96–128 GB or more. Alternatively, you can use cloud computing resources such as AWS EC2 high-memory instances to run the pipeline without issue.
If you are interested in generating the original data, you can run the data pipeline with the following command (assumes you followed the project setup above and are in the data_pipeline directory):
python3 run_canonical_years.py --year 2019Data Loading and Cleaning
The following setup pulls the data from S3 using the load_flight_data() function defined in shared-notebooks/common_utils/r/utils.r. This function abstracts away the details of connecting to S3 and loading the Parquet file, allowing us to focus on the data processing and modeling steps.
A policy is currently set up in the S3 bucket to allow public read access, so no credentials are needed to load the data. This will not remain the case indefinitely and will only be available for the duration of the project. To browse available files, navigate to the URL referenced in load_flight_data() in a browser — the file names are listed in the XML output under Contents/Key.
If you want to regenerate the exact data used in this project, see pipeline_main.py and run_canonical_years.py in the data_pipeline directory as the entry points for the data pipeline.
This project uses lazy-loaded data via Apache Arrow. Lazy loading means that when you open a dataset, no data is actually read into memory — instead you get a query object that represents the data. Operations like filter() and select() are pushed down to the data source and only execute when you explicitly call collect(), at which point only the rows and columns you actually need are pulled into memory. This makes working with large datasets fast and memory efficient.
The setup below loads all required packages via load_project_packages(), defined in utils.r, which installs any missing packages and attaches them to the session. Flight data is loaded from the project’s data source and cached locally as a Parquet file at data/flights_raw.parquet on first run — subsequent renders skip the 30–40 second load and read directly from the local cache instead. To refresh the data, delete data/flights_raw.parquet and re-render.
For anyone extending this project, it is strongly recommended to stick with Arrow in R or switch to Polars in Python rather than pandas. Pandas loads everything into memory eagerly, which becomes a bottleneck quickly on datasets of this size. Arrow and Polars are both built around the same columnar memory format, are significantly faster, and support the same lazy evaluation pattern used here — filter and select on the lazy frame first, then collect only what you need.
In the Medallion architecture nomenclature, this raw S3 data is considered our bronze layer — the unprocessed data that feeds into the feature engineering pipeline downstream.