Using Eta model ensemble predictions of rainfall produced by the NCEP Short Range Ensemble Forecast system, several standard precipitation verification scores (including frequency and magnitude bias, equivalent threat score, and simple average absolute difference), and the operational daily raingage network, we estimate confidence intervals for scores under several different scenarios. First, to determine the effect of observation quality on verification, we compare verification results produced subject to different levels of raingage quality control. Second, to gain a more general idea of observation variability and its effect on scores, we compute a range ("ensemble", if you will) of possible observed precipitation fields using bootstrapping methods to resample the raingage observations. A byproduct of these latter computations is the determination of confidence intervals for the actual verification scores and a partial answer to the question: How large must a difference in these scores be to justify claims of model superiority or model improvement? With the results of these comparisons in mind, we qualitatively address the legitimacy of estimates of verification scores computed using average or median ensemble predictions and average or median ensemble observations.

Supplementary URL: