Our recent focus has been at the data infrastructure layer of these exciting machine learning advances, with the goal of accelerating the rate at which practitioners can develop new modeling applications. Specifically, we have been focused on finding optimal ways to store, query, and transform data from cloud object storage into machine learning models for training and inference. Here we present a new software integration between Zarr (a cloud optimized open source data format) and PyTorch (a leading machine learning framework) that has been developed to optimize the performance of ML training algorithms that depend on data inputs in the form of multidimensional arrays. This new integration implements asynchronous loading, processing and training of batches, with the aim of fully-utilizing the advanced hardware (e.g. GPUs) when training from data stored in cloud object storage. We show how this new integration interoperates with common software libraries and how it performs under the load of real-world applications.

