We have created a unique data set of more than 60 simulations of tornadic and non-tornadic supercell thunderstorms. Creating this data requires HPC resources for the simulations and for the storage. Each simulation has a horizontal grid spacing of 100 m and a stretched vertical grid ranging from 40 m to 500 m. The supercell simulations were generated using the compressible mode of the Bryan Cloud Model 1 (CM1). The CM1 is a three-dimensional, time-dependent, non-hydrostatic numerical model designed primarily for research on deep precipitating convection (i.e., thunderstorms).
The simulations provide a highly detailed view of the atmosphere, including the fine-scale processes that give rise to tornadogenesis. However, the data are so large that traditional, non-automated techniques are extremely inefficient and therefore unlikely to produce new insights into tornado formation. Fortunately, machine learning and data mining techniques can handle large data sets efficiently and may be able to provide more general insights than can be obtained from analysis of a small number of cases. We have developed large-scale machine learning methods for meteorological data called the Spatiotemporal Relational Probability Tree (SRPT) and the Spatiotemporal Relational Random Forest (SRRF). In this talk, we present our preliminary results of applying the SRPT and SRRF to the high-resolution supercell simulations and then discuss the HPC requirements of the simulations, post-processing steps, and machine learning/data mining techniques.