Abstract: The effects of data issues and skill scores in creating predictive models (89th American Meteorological Society Annual Meeting )

Tuesday, 13 January 2009: 4:45 PM

The effects of data issues and skill scores in creating predictive models

Room 125A (Phoenix Convention Center)

Matthew J. Pocernich, NCAR, Boulder, CO

Poster PDF (2.4 MB)

The 2008 AI Competition challenge is to create a classification algorithm to identify the precipitation type using polarmetric radar data. Three outcomes are possible. The precipitation may be liquid, frozen or non-existant(none). This paper will primarily use the randomForest and hierarchical clustering to classify these observations. Improvements to this basic method will be sought by exploring the following aspects of the problem.

Independence of the data. Ideally, data used in predictive statistical models should be independent. Different aspects of spatial data are explored. Spatially, most observations are reported near 3000 Bart Conner Dr, Norman, OK. Information is not explicitly given that states that the data is ordered, but by plotting data sequentially, runs of highly correlated data are visible. Efforts have been made to increase the independence of the data in order to create a more robust model.

Pierce Skill Score. 58% of the observed values suggest frozen precipitation, 28% liquid and 14% is neither. The PSS does not reward bold forecasts of low probability events rather each correct forecast is rewarded equally. Using the rationale that it is more difficult to forecast a rare event, to optimize the model with respect to the PSS, one might require stronger evidence before selecting rarely used events category.

Assumption of perfect observations. Statistical models are generally built using the assumption that the observations are truth. We know this isn't true, but to assume otherwise is difficult. This study will explore the effect of measurement error on the stability of the model. This is done by simulating random changes in the observations for several mis-classification rates. This information cannot be incorporated into the final model because the assumption on misclassification error is admittedly a guess, but it may be informative in assessing the robustness of the model. Ideally, some misclassification error will not structurally change the model.

Supplementary URL: