904 Using Machine Learning to Produce Watch-to-Warning Severe Weather Guidance

Thursday, 1 February 2024
Hall E (The Baltimore Convention Center)
Sam Varga, University of Oklahoma, Norman, OK; Cooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, OK; National Severe Storms Laboratory, Norman, OK; and M. L. Flora and C. K. Potvin

Handout (3.2 MB)

The Warn-on-Forecast System (WoFS) is a convection allowing ensemble that operates at Watch-to-Warning lead times (0-6 hours). A suite of WoFS-based machine learning (ML) products, such as those described in Flora et al. (2021), was developed previously to leverage the WoFS and provide skillful guidance for severe weather. Access to WoFS ML products has been shown to increase the skill of outlooks produced during Hazardous Weather Testbed Spring Forecasting Experiments (Clark et al., 2023). The current suite of WoFS ML products is focused on lead times of 0-3 hours, thereby lacking ML products for the latter half of the Watch-to-Warning window. Our study addresses this gap by using ML on WoFS output to produce calibrated severe weather guidance for lead times of 2-6 hours.

In this study, we adopt a new approach that mimics the multi-resolution feature engineering used in deep learning methods. Grid points are pooled over various spatial scales, then filtered to coarsen the fields. We apply maximum value filters to intrastorm variables, and mean value filters to environmental variables, after which ensemble statistics are calculated. Two archetypes of ML models are trained: logistic regression (LR) and histogram-based gradient boosted trees (HGBT). The models are trained to predict the probability of severe weather (hail, wind, and/or tornado) within 36 km of a point using storm reports from NOAA’s Storm Events Database as our target data.

The ML models are compared to a set of rigorous baselines consisting of optimized neighborhood maximum ensemble probabilities (NMEP) of threshold exceedance. The baseline variables are 2-5 km UH for any-severe and tornadoes, 80-meter wind speed for severe wind, and HAILCAST for severe hail. Feature ablation experiments are also conducted to ascertain how predictors contribute to the models’ skill. Skill is evaluated using metrics such as the Critical Success Index, Area under the Performance Diagram Curve, and Brier Skill Score.

Of the models evaluated, HGBT achieves the highest performance followed closely by LR. Despite the similar objective performance, systematic variations exist within the output of the ML models due to differences in the algorithms. This includes LR’s ability to output higher probabilities than HGBT. The largest improvements over the baselines occur for severe wind and severe hail, followed by any-severe and tornadoes. Intrastorm features are responsible for the majority of skill, as models trained with only intrastorm features have comparable performance to models trained with all predictors. Little benefit is gained from using multiple scales of intrastorm features, as models with only one scale of intrastorm predictors have nearly the same performance as models with multiple intrastorm predictor scales. Models comprising only environmental predictors generally have low skill and perform worse than the baseline, except when predicting severe wind. Using a single scale of both intrastorm and environmental predictors generally produces more skillful models than using multiple scales of either intrastorm or environmental predictors. Ongoing work includes exploring deep learning as an avenue to improve the quality and skill of the provided guidance, as well as incorporating a higher fidelity target dataset such as the Maximum Estimated Size of Hail (MESH). We aim to have these products evaluated in the 2024 Hazardous Weather Testbed Spring Forecasting Experiment.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner