Terminal Aerodrome Forecasts (TAFs) are probabilistic forecasts, and generally require statistical methods of verification. The approach taken at UKMO is analysis via reliability tables. Reliability tables provide a concise method of basic data storage, and yet by retaining stratification by probability, a large number of separate scores and statistics can then be derived and provided to the verification scheme customers.
Presentation via the Internet of the many statistics derived from these tables satisfies one of the goals of this scheme, which is to help the forecaster improve forecast quality. The other main goals of the UKMO TAF verification scheme are to provide assessment figures to management, and to give external customers and auditors a measure of TAF performance. Invariably just a final, single, objective score is required for this purpose. The choice however of one particular score from the many available is perhaps not obvious. The principal goal of this paper is to provide objective arguments helpful in choosing a single score by comparing the performance of different scoring techniques for actual probabilistic forecasts. Data from 5 years of UK and continental European TAFs constitute the forecast sample of this study.
Although a possible 14 different weather elements for each TAF can be analysed (including wind, precipitation, cloud etc) this study presents results for visibility and cloud-base only. The event thresholds for verification are based on ICAO thresholds, with a reliability table being compiled for each. Although tables are collated monthly, analysis is based on aggregate samples for the 12 preceding months, thus eliminating seasonal variation and much of the noise associated with smaller monthly samples. Verifying METAR observations are generally performed half-hourly, so 18 separate forecasts are entered into the reliability table for a single 9-hour TAF. PROB30, PROB40 and TEMPO are interpreted as 30%, 40%, and 30% forecasts respectively for the event. BECMG forecasts crossing a threshold are interpreted as a linear change in probability between 0% and 100% over the duration of the BECMG forecast.
The different scoring methods for probabilistic forecasts that can be derived from reliability tables, and that are compared here include:
- Brier score (and Brier score decomposition),
- Ranked probability score,
- Skill scores, based on persistence and null-event forecasts,
- Contingency table scores (e.g., ETS, Heidke) obtained from summed probabilities,
- Brier climatology skill score,
- ROC (Relative Operating Characteristic) curve area.
Despite the different formulations for the different scores, if they are all good measures of forecast quality they should demonstrate the same trends, showing rising and falling skill in the same way. Preliminary investigation indicates the Brier score is often sensitive to changes in sample climatology, while ROC area is sometimes insensitive to changing forecast skill. Averaging scores for different weather elements can camouflage trends in forecast performance. In addition, we conclude that although single scores can provide a useful summary to management, they are insufficient to provide a complete assessment of forecast quality, especially to the forecaster trying to improve his forecast.