87th AMS Annual Meeting

Tuesday, 16 January 2007: 3:30 PM
Nonlinear principal component analysis: A new information criterion for model selection in noisy climate datasets (Invited Speaker)
210B (Henry B. Gonzalez Convention Center)
William W. Hsieh, Univ. of British Columbia, Vancouver, BC, Canada
With very noisy data, overfitting is a serious problem in pattern recognition. For nonlinear regression, having plentiful data eliminates overfitting, but for nonlinear principal component analysis (NLPCA), overfitting persists even with plentiful data. Thus simply minimizing the mean square error (MSE) is not a sufficient criterion for NLPCA to find good solutions in noisy data. A new holistic information criterion H is proposed which selects the curve with the right amount of flexibility so it neither underfits nor overfits.

First, an index is proposed which measures the disparity between the nonlinear principal components u and u′ for a data point x and its nearest neighbor x′. This index I = 1 − C (C being the Spearman rank correlation between u and u′), tends to increase with overfitted solutions. Among NLPCA models with various amounts of flexibility, the one which minimizes the information criterion H (= MSE times I) automatically selects the model with the right amount of flexibility. Tests are performed using autoassociative neural networks for NLPCA on synthetic and real climate data (including the tropical Pacific sea surface temperature and the North American winter surface air temperature) with very good results. This information criterion also automatically chooses between using an open or a closed curve fit for a dataset.

Supplementary URL: