The monitoring of ensemble forecast skill for high-impact weather is investigated by comparing the sensitivity of metrics such as the Brier score, logarithmic score, and the new diagonal elementary score, to choices about the event threshold definition, the reference forecast, and the role assigned to representativeness errors. A consistent picture emerges in which the design of the event climatology is found to play the key role. The results also suggest to what extent different metrics are needed for monitoring forecasts of extremes, and whether the relative operating characteristic for the Extreme Forecast Index (one of ECMWF’s headline scores) is in principle sufficient for this purpose.
As today’s ensemble forecasts are increasingly well calibrated it becomes more important to take into account observation uncertainty in the verification process. The continuous ranked probability score (CRPS), for example, can give misleading results if a model upgrade is associated with a change of the spread-skill relation, and observation uncertainty is not considered. It is shown that although these effects are subtle they can affect the conclusions drawn about the overall upper-air skill improvement of a new model cycle.
Systematic biases in near-surface parameters such as 2 m temperature or 10 m wind speed are difficult to reduce because often they result from a combination of partly compensating errors in the models’ parameterizations and representation of surface processes. It is shown that by using conditional verification methods some of these errors can be disentangled and attributed to specific processes in the model. For example, a negative bias in wintertime 2 m temperature at night at mid-latitudes can be shown to result from an underestimation of low cloudiness, while a concurrent warm bias at higher latitudes can partly be attributed to the representation of snow cover using a single-layer model. Underestimation of the diurnal cycle of 2 m temperature in summer appears to be due to a too strong thermal coupling between the radiating surface and the underlying soil.
Upper-air headline scores have been used at ECMWF for some time in the context of high-level monitoring of forecast skill. Recently, two new user-oriented headline scores based on 2 m temperature have been adopted to put increased focus on the near-surface ensemble performance of the model. In the medium range, this is the fraction of large 2 m temperature errors as measured by the CRPS exceeding a threshold, and in the extended range it is the ranked probability skill score for weekly temperature anomalies. The effect of recent model upgrades on these scores is discussed.
Intra- and inter-annual variations in atmospheric predictability affect the evolution of forecast skill, and comparison with a reference system (previously ERA-Interim, now ERA5) is used to increase the signal-to-noise ratio of the results. As part of ECMWF’s role as a WMO Lead Centre, comparison with other global models is also routinely used as a means of separating actual improvements in skill from predictability variations. In addition to the established upper-air exchange of scores, WMO has now issued guidelines for the exchange of station-based surface scores, which ECMWF will present via an interactive map tool.