Frameworks for Gaining Insight and Machine Learning on Large Climate and Weather Datasets

Jackson, Robert; Jackson, Robert

Analyzing datasets that are too large to fit on a single personal computer, such as multidecadal weather radar datasets, is a complex problem that requires devising strategies for the access, storage, and reduction of such datasets to statistics usable by scientists. Given the large volumes of weather data available from satellites, climate models, and weather radar networks, analyses of such datasets is becoming increasingly common. In addition, with the growth of the open source scientific Python community, there are now tools that have been developed to analyze such datasets. For example, Dask is a Python package that uses directed acyclic graphs to construct maps that scale analyses of large datasets onto multiple machines. Ophidia is a framework that is capable of distributed analyses of multidimensional arrays. Finally, DLHub is an Argonne effort to collect and publish machine learning models, and to make them easily runnable across distributed computing clusters. In this talk, we assess the applicability of Dask, Ophidia, and DLHub for the analysis of large climate datasets using data from scanning radars and storm reports from both Darwin, Australia and the southeast United States.

Examples of how Dask and Ophidia can compute 3 hourly mean rainfall rates as well as spatial means of rainfall rates over subsets of CPOL dataset using Dask and Ophidia will be shown. Due to Ophidia’s I/O optimizations, Ophidia typically computes timeseries of 3 hourly averages from a 1 degree gridbox in the CPOL domain in about half of the time that Dask does. Examples of how DLHub can be used to semi-automatically label convective phenomena such as derechos, supercells, and cold pools will also be shown.

J2.5 Frameworks for Gaining Insight and Machine Learning on Large Climate and Weather Datasets