Workflows for Exploratory Data Analysis and Machine Learning with Large Collections of Atmospheric Vertical Profiles for Convective Hazards Research

Thielen, JT; Thielen, JT

Vertical profiles of temperature, pressure, humidity, and wind (such as observed soundings from radiosondes or simulated soundings from model output) are essential data products for the analysis and forecasting of severe convective storms and their associated hazards such as tornadoes, wind, hail, and extreme precipitation. Packages in the Python ecosystem such as MetPy and SHARPpy are well-established and well-regarded for common sounding analysis and visualization tasks such as calculating convective parameters and plotting Skew-T diagrams for single profiles or small collections of profiles. However, for large collections of vertical profiles, as one would encounter in climatological studies or in machine learning model training and testing, the standard workflows using these tools become unwieldy, as such use cases generally fall outside the original scope of these tools' designs. To move beyond mere parallelization of these baseline implementations, any (sufficiently general) workflow for large collections of vertical profiles would need to confront two central challenges: irregular array lengths and fast integration of one-dimensional differential equations. The Scientific Python ecosystem indeed provides the tools needed to address these challenges, allowing general workflows without requiring simplifications such as interpolation to regular levels or vectorizable analytical routines. Furthermore, the in-memory representations of these profile collections can be readily brought into Python libraries for machine learning, allowing sequence-based model architectures (such as Transformer/Multi-Head Attention, which is best known for its use in large language models) to be used on atmospheric profiles in their native resolutions.

Motivated by research into improving convective hazard probability models via ML-guided, per-profile feature engineering, this presentation describes reusable scientific workflows in Python for handling large collections of irregular-length vertical profiles, both in the exploratory phase as well as integration into machine learning libraries such as PyTorch or TensorFlow. These workflows center around collections of profiles structured as ragged arrays using Awkward Array and Apache Arrow, which are read from Parquet and processed using numerical routines in Numba. Given the disparate storage formats in use for sounding data, special attention is given to the data format translation portions of the workflows. Additionally, novel implementations of convective parameters used by the motivating research project, such as entrainment CAPE, will be demonstrated. Finally, ongoing and future prospects of bringing aspects of these workflows into commonly-used libraries, particularly in how they relate to ongoing efforts by the MetPy and SHARPpy maintainers, will also be discussed.

9.3 Workflows for Exploratory Data Analysis and Machine Learning with Large Collections of Atmospheric Vertical Profiles for Convective Hazards Research