Our training dataset includes WoFS ensemble forecasts available every 5 minutes out to 150 min of lead time from the 2017-2019 NOAA Hazardous Weather Testbed Spring Forecasting Experiments (HWT-SFE; 81 dates). Using a novel ensemble storm track identification method, we extracted three sets of predictors from the WoFS forecasts: intra-storm state variables, near-storm environment variables, and morphological attributes of the ensemble storm tracks. We then trained random forests, gradient-boosted trees, and logistic regression algorithms to predict which WoFS 30-min ensemble storm tracks will overlap a tornado, severe hail, and/or severe wind report. To provide rigorous baselines against which to evaluate the skill of the ML models, we extracted the ensemble probabilities of hazard-relevant WoFS variables exceeding tuned thresholds from each ensemble storm track. The ML models performed well in both retrospective and real-time settings and were more skillful than the baseline systems.
To showcase the potential usefulness of the ML guidance, we conducted a data denial experiment during the 2022 Hazardous Weather Testbed Spring Forecasting Experiment (HWT-SFE). In an effort to improve confidence in the ML guidance, interactive explainability graphics were integrated into the WoFS web viewer. The participants were given the task of generating 1-hour outlooks for 21-22 and 22-23 UTC. One group had access to the entire WoFS suite, while the other group had access to the experimental ML products in addition to the WoFS suite. After issuing their outlooks, participants were surveyed on various aspects, including their confidence in their outlooks, the number of products they reviewed, and their opinion on the usefulness of the ML guidance. When objectively evaluating the outlooks against the observed storm reports, we found that ML-based outlooks outperformed non-ML-based outlooks for multiple verification metrics. Furthermore, when subjectively evaluated against all available verification datasets (reports, warning polygons, radar-estimated hail size, etc.), outlooks generated by ML users were rated significantly higher, especially for wind. Although the participants generally liked the explainability graphics, they did not always find them necessary as they added to their forecast overload. We also found that the ML guidance did not necessarily streamline the forecast process, which may be because additional training and exposure is required to fully leverage the ML guidance. In general, the experiment demonstrated the usefulness of the ML guidance for severe weather forecasting.

