Mean What You Say-Calibration and Verification of the Probability Forecast Platform (PFP) at The Weather Company

Belanger, James I.; Belanger, James I.

The Weather Company, an IBM Business, has developed the Probability Forecast Platform (PFP) to generate probabilistic weather forecast content intended to yield better data-driven decisions by consumers and businesses. A critical component of producing useful probabilistic forecasts is ensuring that the resulting probabilities and percentiles mean what they say, i.e., are statistically well-calibrated and reliable, and are as sharp as possible given those constraints. A variety of methods have been proposed in the literature to produce calibrated probabilities, including extended logistic regression (Wilks 2009), nonhomogeneous Gaussian regression (Gneiting et al. 2005), direct application of Bayes’ rule (Hodyss et al. 2016), and Bayesian model averaging (BMA; Raftery et al. 2005). The benefit of using BMA is that it optimizes the weights and spread necessary to produce a calibrated probability density function based on a small sample of training data, whereas extended logistic regression and nonhomogeneous Gaussian regression require longer training sets to achieve similar results. In addition, BMA preserves the potential multimodal nature of the underlying weather forecast models used by PFP.

In this analysis, we evaluate how well Bayesian model averaging performs for 0-10 day forecasts of 2m temperature across the Continental United States using a 6 month period from April to September 2017 via a multi-model ensemble featuring a large, diverse set of regional and global deterministic and ensemble models. Verification metrics such as the probability integral transform (PIT), continuous ranked probability score (CRPS), reliability diagram, and sharpness are applied to both in-sample and out-of-sample datasets. Results indicate that without calibration, 2m temperature observations occur outside of the predictive density function approximately 20% of the time, whereas BMA calibration produces nearly flat, well-calibrated PIT diagrams. In addition, the CRPS is reduced most materially in the short-term and at all lead-times, signalling an improvement in the probabilistic equivalent of the mean absolute error. To underscore the value of using a calibrated multi-model ensemble, these results are compared against calibrated single model ensemble systems using the ECMWF Ensemble Forecast System and the Global Ensemble Forecast System. In light of these results, we encourage consumers and businesses that are considering and/or using probabilistic content in their data-driven decision making to ask the question whether a weather provider’s forecast probabilities mean what they say.

12A.7 Mean What You Say-Calibration and Verification of the Probability Forecast Platform (PFP) at The Weather Company