The results from any technique used to estimate missing data depend, to a large extent, on the patterns of interrelated data and the manner in which the data are missing. The mechanism responsible for missing data should be assessed as random or systematic. In many cases, a few consecutive missing observations can be estimated with little error; however, if a month of data is missing, the results would be different. Motivated by such design questions and the ubiquity of missing data, the present analysis seeks to examine how a number of techniques used to estimated missing data perform when various types and amounts of missing data exist.
In this work, different types of machine learning techniques, such as support vector machines (SVMs) and, artificial neural networks (ANNs) are tested against standard imputation methods (e.g., multiple regression). All methods are used to predict the known values of climatological data which have been altered to produce missing data. These data sets are on the order of 500 variables and a large range of observations. Both precipitation and air temperature data are used to provide a range of inherent spatial coherence seen by analysts. The MSE of each technique will be presented to assess the efficacy of each technique. Preliminary results indicate that SVMs are promising is reducing error compared to linear techniques.
Supplementary URL: