Rapidly Prototyping High-Performance Meteorological Data Systems Using Xarray and Numba

Rothenberg, Daniel; Rothenberg, Daniel

One of the most difficult challenges facing ClimaCell is the pace of the weather business environment. In order to maintain a competitive advantage in the industry, we must be able to rapidly react to clients' needs and research, develop, and ultimately operationalize new data products as quickly as possible. A key strategy we employ to accelerate our development pace is by unifying our research and operational code bases using a consistent suite of tools, technologies, and packages, all within the scientific Python ecosystem. Some of these packages are specifically employed to ease the development of high-performance data processing systems, without requiring complex external dependencies on compiled code bases (like Fortran or CUDA modules), which can be difficult to deploy in cloud environments - and sometimes more difficult to develop than comparable Python-based codes in the first place.

In this talk, I will provide an overview of how ClimaCell leverages several key scientific Python packages for building performant meteorological data analysis codes. In particular, I will highlight how we use xarray to help manage the ingestion of data from NetCDF, GRIB, and GeoTIFF data sources and feed data processing pipelines, with the help of numba-compiled universal and generalized universal (ufuncs and gufuncs) to perform numerical heavy lifting. Key to this workflow is the (relatively) new apply_ufunc() machinery within xarray, which allows us to quickly re-purpose code to work with different datasets, regardless of their dimensionality and without requiring invasive re-writes to the code. I will also highlight how we extend these workflows using dask, which helps us obfuscate the hardware on which code runs and minimizes the work necessary to adapt prototype code running on a researcher's personal laptop, to an arbitrary system in the cloud. The examples and solutions shared during this presentation should be useful for any researcher seeking to maximize their productivity using the scientific Python ecosystem, especially those researchers who must develop and run on their code on different computer systems.

2.5 Rapidly Prototyping High-Performance Meteorological Data Systems Using Xarray and Numba