Abstract: Multiple imputation through machine learning algorithms (87th AMS Annual Meeting)

Tuesday, 16 January 2007: 4:30 PM

Multiple imputation through machine learning algorithms

210B (Henry B. Gonzalez Convention Center)

Michael B. Richman, Univ. of Oklahoma, Norman, OK; and T. B. Trafalis and I. Adrianto

Poster PDF (315.5 kB)

A problem common to meteorological and climatological datasets is missing data. The majority of multivariate analysis techniques require that all variables be represented for each observation; hence, some action is required in the presence of missing data. In cases where the individual observations are thought not important, deletion of every observation missing one or more pieces of data (complete case deletion) is common. As the amount of missing data increases, tacit deletion can lead to bias in the remaining data and in subsequent analyses. If the data are deemed important to preserve, some method of imputing the missing values may be used. Historically, the statistical mean has been used as it was thought to minimize perturbations. Despite that, the use of the mean injects the same value into every instance of missing data and has been shown to create artificially low variation. What is desired is a principled method that uses information available to predict the missing values. One class of technique that satisfies this principle is known as multiple imputation.

The results from any technique used to estimate missing data depend, to a large extent, on the patterns of interrelated data and the manner in which the data are missing. The mechanism responsible for missing data should be assessed as random or systematic. In many cases, a few consecutive missing observations can be estimated with little error; however, if a month of data is missing, the results would be different. Motivated by such design questions and the ubiquity of missing data, the present analysis seeks to examine how a number of techniques used to estimated missing data perform when various types and amounts of missing data exist.

In this work, different types of machine learning techniques, such as support vector machines (SVMs) and, artificial neural networks (ANNs) are tested against standard imputation methods (e.g., multiple regression). All methods are used to predict the known values of climatological data which have been altered to produce missing data. These data sets are on the order of 500 variables and a large range of observations. Both precipitation and air temperature data are used to provide a range of inherent spatial coherence seen by analysts. The MSE of each technique will be presented to assess the efficacy of each technique. Preliminary results indicate that SVMs are promising is reducing error compared to linear techniques.

Supplementary URL: