In this analysis, we evaluate how well Bayesian model averaging performs for 0-10 day forecasts of 2m temperature across the Continental United States using a 6 month period from April to September 2017 via a multi-model ensemble featuring a large, diverse set of regional and global deterministic and ensemble models. Verification metrics such as the probability integral transform (PIT), continuous ranked probability score (CRPS), reliability diagram, and sharpness are applied to both in-sample and out-of-sample datasets. Results indicate that without calibration, 2m temperature observations occur outside of the predictive density function approximately 20% of the time, whereas BMA calibration produces nearly flat, well-calibrated PIT diagrams. In addition, the CRPS is reduced most materially in the short-term and at all lead-times, signalling an improvement in the probabilistic equivalent of the mean absolute error. To underscore the value of using a calibrated multi-model ensemble, these results are compared against calibrated single model ensemble systems using the ECMWF Ensemble Forecast System and the Global Ensemble Forecast System. In light of these results, we encourage consumers and businesses that are considering and/or using probabilistic content in their data-driven decision making to ask the question whether a weather provider’s forecast probabilities mean what they say.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner