3.2 An Elastic Multiarchitecture Cloud-based High Performance Computing Environment for the Global Forecast System

Monday, 29 January 2024: 2:00 PM
324 (The Baltimore Convention Center)
Stefan F. Cecelski, PhD, Maxar, Westminster, CO; and C. Cassidy, N. Lucas, B. Summa, R. Haas, and T. Hartman

Handout (1.1 MB)

Since June 2019, Maxar has been running a productionized version of the Finite-Volume Cubed-Sphere Global Forecast System (FV3GFS) numerical weather prediction (NWP) model using cloud-based high performance computing (HPC) on Amazon Web Services (AWS). As the FV3GFS model has evolved and AWS HPC-based offerings have become more expansive, Maxar has iteratively improved upon its production workflows to mitigate traditional HPC bottlenecks while also expanding upon the elasticity and resiliency of the environment. The evolution of this production cloud-based HPC environment will be discussed, focusing on the advancements that helped Maxar create a fully automated, elastic and self-healing cloud HPC environment for the FV3GFS model.

Specifically, Maxar runs both the deterministic and ensemble versions of the FV3GFS model, emulating the model configurations used by the National Centers for Environmental Prediction (NCEP) as much as possible. These model configurations have changed since 2019, which Maxar has been able to embrace with changes to the technological and programmatic strategies used in allocating and configuring cloud-based HPC resources. Mainly, Maxar has grown to leveraging over 40,000 cores for running these production workflows – but only allocates all the related cloud-based cluster resources for the length of each forecast simulation. Further, for just its deterministic FV3GFS cluster, Maxar has increased its core count by over 3 times since 2019 but has also seen a ~50% reduction in the total wall clock time to run the model. Using the latest available technologies, this total wall clock time reduction does not come with increased costs, but rather, significant cost savings. Maxar will highlight such cloud-based strategies that help minimize the risks and cost of large, “bursty” NWP HPC workloads within cloud providers while also demonstrating cloud-based production HPC environments are resilient and self-healing.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner