J16B.5 A New Cloud Optimized Dataloader for Modern ML Applications

Thursday, 1 February 2024: 5:15 PM
336 (The Baltimore Convention Center)
Joseph J Hamman, Earthmover PBC, New York, NY; and R. Abernathey and D. Cherian

Today, research and operational weather forecast applications are fortunate to have more data than they know what to do with. Increasingly, this data, which is sourced from a wide array of sensors, satellites, and models, is being made available in the cloud for public use. At the same time, the latest generation of machine learning models promise to unlock advances in many key forecast applications. However, it is still a painstaking process to build the most ambitious of such models due to the scale of data required for training. In many cases, we simply don’t have the tools and infrastructure in place to effectively feed the vast amounts of data needed to train these models.

Our recent focus has been at the data infrastructure layer of these exciting machine learning advances, with the goal of accelerating the rate at which practitioners can develop new modeling applications. Specifically, we have been focused on finding optimal ways to store, query, and transform data from cloud object storage into machine learning models for training and inference. Here we present a new software integration between Zarr (a cloud optimized open source data format) and PyTorch (a leading machine learning framework) that has been developed to optimize the performance of ML training algorithms that depend on data inputs in the form of multidimensional arrays. This new integration implements asynchronous loading, processing and training of batches, with the aim of fully-utilizing the advanced hardware (e.g. GPUs) when training from data stored in cloud object storage. We show how this new integration interoperates with common software libraries and how it performs under the load of real-world applications.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner