88th Annual Meeting (20-24 January 2008)

Tuesday, 22 January 2008: 9:30 AM
Imputation of missing data with nonlinear relationships
206 (Ernest N. Morial Convention Center)
Michael B. Richman, Univ. of Oklahoma, Norman, OK; and I. Adrianto and T. B. Trafalis
Poster PDF (226.3 kB)
A problem common to meteorological and climatological datasets is missing data. The majority of analysis techniques require that all variables be represented for each observation; hence, some action is required in the presence of missing data. In cases where the individual observations are thought not important, deletion of every observation missing one or more pieces of data (complete case deletion) is common. As the amount of missing data increases, tacit deletion has been shown to lead to bias in the remaining data and in subsequent analyses, such as data mining. If the data are deemed important to preserve, some method of imputing the missing values may be used. The results from any technique used to estimate missing data depend, to a large extent, on the patterns of interrelated data and the manner in which the data are missing. The mechanism responsible for missing data should be assessed as random or systematic. Motivated by such design questions and the ubiquity of missing data, the present analysis seeks to examine how a number of techniques used to estimated missing data perform when various types and amounts of missing data exist for configurations where the relationships are nonlinear.

In this work, different types of machine learning techniques, such as support vector machines (SVMs) and, artificial neural networks (ANNs) are tested against standard imputation methods (e.g., multiple regression). All methods are used to predict the known values of data generated from nonlinear functions which have been altered to produce missing data. The MSE of many iterations will be presented along with the MAE of the variance/covariance structures to assess the efficacy of each technique. Preliminary results indicate that SVMs are promising is reducing error compared to linear techniques. The impacts of the percentage of missing data are striking and indicate that even modest amounts of missing data can distort the relationships between the data, making data mining problematic.

Supplementary URL: