Handout (3.2 MB)
In this study, we adopt a new approach that mimics the multi-resolution feature engineering used in deep learning methods. Grid points are pooled over various spatial scales, then filtered to coarsen the fields. We apply maximum value filters to intrastorm variables, and mean value filters to environmental variables, after which ensemble statistics are calculated. Two archetypes of ML models are trained: logistic regression (LR) and histogram-based gradient boosted trees (HGBT). The models are trained to predict the probability of severe weather (hail, wind, and/or tornado) within 36 km of a point using storm reports from NOAA’s Storm Events Database as our target data.
The ML models are compared to a set of rigorous baselines consisting of optimized neighborhood maximum ensemble probabilities (NMEP) of threshold exceedance. The baseline variables are 2-5 km UH for any-severe and tornadoes, 80-meter wind speed for severe wind, and HAILCAST for severe hail. Feature ablation experiments are also conducted to ascertain how predictors contribute to the models’ skill. Skill is evaluated using metrics such as the Critical Success Index, Area under the Performance Diagram Curve, and Brier Skill Score.
Of the models evaluated, HGBT achieves the highest performance followed closely by LR. Despite the similar objective performance, systematic variations exist within the output of the ML models due to differences in the algorithms. This includes LR’s ability to output higher probabilities than HGBT. The largest improvements over the baselines occur for severe wind and severe hail, followed by any-severe and tornadoes. Intrastorm features are responsible for the majority of skill, as models trained with only intrastorm features have comparable performance to models trained with all predictors. Little benefit is gained from using multiple scales of intrastorm features, as models with only one scale of intrastorm predictors have nearly the same performance as models with multiple intrastorm predictor scales. Models comprising only environmental predictors generally have low skill and perform worse than the baseline, except when predicting severe wind. Using a single scale of both intrastorm and environmental predictors generally produces more skillful models than using multiple scales of either intrastorm or environmental predictors. Ongoing work includes exploring deep learning as an avenue to improve the quality and skill of the provided guidance, as well as incorporating a higher fidelity target dataset such as the Maximum Estimated Size of Hail (MESH). We aim to have these products evaluated in the 2024 Hazardous Weather Testbed Spring Forecasting Experiment.

