The TAF verification schemes studied include the following.
- Interpretation of TAF change groups (PROB30, BECMG, TEMPO, etc) as probabilistic forecasts, and verification of large samples of these using standard probabilistic techniques. At the Met Office, analysis and scoring is based on reliability tables, which are constructed for each of the ICAO cloud-base and visibility thresholds (and other weather elements too).
- An alternative probabilistic approach, often followed in parts of Europe, is to sum the probabilities into contingency tables before scoring.
- The first Gordon scheme (1989) also assigns probabilities, but promotes assessment using the Probability Weighted Error (PWE).
- Comparison of the worst forecast and observed conditions within 3-hour time blocks (the second Gordon scheme, 1992).
Reliability tables provide a concise method of basic data storage, and yet by retaining stratification by probability, a large number of separate scores and statistics can then be derived. These range from the simple, such as miss and false alarm frequencies, up to more complex measures such as Brier score decompositions, ROC curves, and, for multiple thresholds, ranked probability scores.
The analysis of large samples of tabulated data for a chosen set of verification thresholds is particularly useful for the calculation of summary management scores. There is an increasing need though among users and forecasters for assessment of TAFs on an individual basis, a task for which these techniques are not adapted. Gordons's PWE scheme however is designed to be sensitive to the distance between actual and forecast conditions in any single TAF, and is being trialed therefore at the Met Office to assess individual forecasts. Preliminary results from this will be presented.
Complex measures of quality can be difficult to present to a non-expert verification scheme customer, such as management or the aviation users. The second Gordon scheme avoids this problem, although it does not find favour among forecasters, since the scheme does not reward the full content of the forecast, lacks sensitivity, and provides insufficient information to a forecaster trying to improve his TAFs.
Finally, experience shows that all verification customers wish to receive a single summary measure of quality, despite recognition that a single score can never tell the whole story. There are many different scores possible, and choice can be difficult. Investigation indicates the Brier score is often rather sensitive to changes in sample climatology, while ROC area is sometimes insensitive to changing forecast skill, and neither can be recommended. However, one's final choice for an objective score is to some extent subjective, and will depend also on the verification scheme customers.
Supplementary URL: