J8B.3 A Workflow for Serving Model Data in the Cloud to a Broader Community

Tuesday, 30 January 2024: 5:00 PM
336 (The Baltimore Convention Center)
Jonathan Joyce, RPS Group, South Kingstown, RI; and B. Adams, J. Doyle, K. Fillingham, M. Iannucci, A. Kerney, K. Knee, D. Moretti, J. Quintrell, D. Snowden, T. C. Vance, and M. Wengren

The Next-Generation Data Management and Cyberinfrastructure project’s research and prototyping has demonstrated several effective techniques for improving access to gridded model forecast data in the cloud. This talk will focus on the two most important data-related solutions: zarr/kerchunk for storing data and xpublish for serving data.

Converging on a specific data format is fraught with challenges but we were able to find a middle-road with kerchunk, which indexes the native gridded data (NetCDF/GRIB) into the zarr specification for fast selection of specific byte ranges. We use a common notification/queueing pattern using SNS/SQS notifications from S3 data stores to perform the indexing as data is available. We also generate 30 day and model-run aggregations which provide a virtual view of the data as a single dataset although it may physically be composed of many different files.

This process makes data access from the cloud more efficient but still requires a lot of domain knowledge, dependencies, and engineering to make it usable. It’s hard to claim that model data is truly FAIR given inherent data challenges such as different projections, formats, and storage schemes; coupled with infrastructure challenges of scaling, distribution, and storage. To address those challenges, we are serving the data through a data broker layer named Xpublish. We propose that developing a common, open-source framework for serving gridded environmental data can greatly simplify and democratize data access.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner