Enabling Scalable, Serverless Weather Model Analyses by \

Rothenberg, Daniel; Rothenberg, Daniel

Over the past decade, several initiatives across the weather enterprise have sought to migrate both archives of historical weather forecast model outputs and mirrors of real-time operational products to the commercial cloud. A primary goal for this migration was to simplify and streamline the economics of downstream applications built on this data; having access to tera- or petascale model archives and the ability to elastically spin-up compute as-needed to process it for a given application promised to eliminate the barrier of large upfront capital and resources historically needed to develop value-added products or applications. However, many of these efforts have been stymied by challenges both technical (including binary file formats inconsistent with the access patterns that object storage in most commercial clouds typically provide) and practical (limited labor/workforce or capital investment to "cloud-optimize" datasets).

The "holy grail" of pushing operational weather model outputs onto the cloud remains the distribution of data formatted in ARCO formats - analysis-ready, cloud-optimized file formats such as Zarr, and complementary catalogs to allow programmatic discovery and ingestion in downstream applications. Some projects such as Pangeo Forge have made significant headway in building turn-key software and infrastructure that can in many cases automate the re-processing of weather and climate data in ARCO formats. But they still critically require software engineers to customize and run processing workflows and for organizations to cover the cost of running them. Despite strong collaborations between the weather enterprise and large tech companies, there has been limited progress in tapping into such resources, and therefore limited distribution of ARCO datasets.

Recently, novel approaches have emerged that could eliminate the need for re-processing data in ARCO formats yet still empower scalable and cheap analyses using ephemeral resources on the cloud. One particular library, kerchunk, allows users to scan HDF or GRIB files archived in cloud/object storage and create a reference mapping that emulates an ARCO dataset (typically by translating the data into a Zarr-based representation with maps to byte-range reads necessary to extract relevant slices from the original dataset). Scanning requires a finite number of eager reads through the target dataset; from then on, users can wholly rely on lazy access to the data using standard cloud retrieval operations (they do not ever need to download the full file). End users ultimately can slice through remote datasets hosted on the cloud similar to how a THREDDS or OPeNDAP server would enable access - but with all the compute and data access conducted client-side instead of on a remote server.

In this presentation, we demonstrate a set of real-world workflows which employ the kerchunk library to pre-emptively and eagerly create reference mappings for real-time operational NOAA forecast datasets (namely the GFS and HRRR) published on Google Cloud Storage as part of the Google Cloud Public Datasets program and collaboration between Google and NOAA. These datasets are published in their original format (mainly GRIB2); no ARCO optimizations are applied. Our workflows lean on an event-based, serverless architecture to automatically prepare the kerchunk'd mappings and derivative products (such as consolidated mappings for an entire forecast epoch) in a highly reliable, cheap fashion. For example, the GFS workflow we present produces a mapping to a large subset of model output fields for less than $1 per day of operational model cycles on Google Cloud Platform, with near-zero overhead storage cost or data duplication (we map to the "official" archive of NOAA data on GCS; the reference files themselves are small JSON files to which we apply no optimizations and store in a public bucket on GCS).

Furthermore, to showcase the utility of these simple kerchunk reference mappings, we demonstrate an open source model analysis/visualization workflow, "plotflow", which leverages a scalable, serverless architecture to cheaply and quickly produce very large sets of weather model imagery for display in a typical weather analysis website (such as TropicalTidbits or Weathermodels.com). The workflow requires no dedicated server or storage resources; it leans entirely on ephemeral compute and retrieval of operational weather model outputs archived on the cloud using the ARCO-like interface that the kerchunk reference mappings provide. Although we do note that users could trivially run the workflow without modification on nearly any compute infrastructure they have access to, from a personal laptop to a large, dedicated VM in the cloud or a RaspberryPi on their bookshelf.

By showcasing the simplicity, cheapness, and utility of kerchunk-based mappings, we hope to motivate future collaborations which could deliver on the promise of ARCO-like datasets without the need for costly reprocessing of traditional weather and climate model datasets.

J5B.3 Enabling Scalable, Serverless Weather Model Analyses by "Kerchunking" Data in the Cloud