Optimization of neural net training using patterns selected by cluster analysis: a case-study of ozone prediction level
In the present work, we predict the Ozone levels in the urban area of Rome. In order to optimise the selection of input patterns we used cluster analysis techniques during the neural net training phase.
NN training phase, usually the main problems concern the best pattern selection to be used in the generalization perform, as well as variables distribution representative of all information.
As known, the performance of generalisation is highly dependent by the significance of pattern selection. In general, patterns selection is used as random choice starting from some percentage of total data. In our work, we use the results of cluster analysis applied for the patterns selection during the training phase.
This approach improves the accuracy of the ozone prediction, enhances the learning capabilities and NN potential to predict ozone and, a very interesting result, synthesizes information for large data set.
Data set description
Our dataset come from a background monitoring station of the ARPAL (Environmental Protection Agency of Lazio Region) in Rome (Villa Ada monitoring station), during all the calendar year 2007. The city of Rome is characterized by frequent ozone peaks, associated with hot sunny days and turbulence conditions. Other important factors derive by the main primary pollutants (NOX, CO) coming from the urban sources. Villa Ada monitoring station represents typical sub-urban situations with high ozone concentration levels located in the NNW direction. .
Dataset concerns about 7370 hourly patterns and is composed by pollutants variables (Ozone, Carbon monoxide, Nitrogen monoxide, Nitrogen Oxide) and by conventional meteorological variables (Temperature, Global Solar Radiation, Relative Humidity,...).
Different methodologies could be employed to optimise NN performances. As known, one of the main weakness factors regards the meaningful of the pattern choice during the training phase as regards generalisation one, and consequently we concentrate our attention on patterns selection techniques.
The pattern choice was executed by the random pattern selection for different percentage of input data.
In our approach, training pattern selection procedure is given by the cluster techniques (K-means algorithm). So doing, we intend to suggest a method for the choice of patterns that is able to optimize the NN training with a small amount of input patterns, to simulate chemical reactions in the urban area, like Rome, for the Ozone levels and to simulate outlier situations(i.e. high hourly ozone peaks).
We utilise cluster methods exclusively to select the best and significant patterns, while these techniques usually are used to synthesise data in homogeneous group whose average dissimilarity to all other items in the same cluster is minimal.
In particular, we train the NN using the patterns constituted by the centroids coming from k-means algorithm; using ad hoc percentages of the whole dataset (we tested 0.5% up to 50% of total data amount).
After the neural network has been successfully trained, its performances are tested on a separate testing sets constituted by the original dataset. In this way, it is possible to verify higher accuracy of generalisation and prediction of our approach than one trained with patterns drawn from centroids dataset.
The results of our approach are compared with the Conventional Random Pattern Selection (CRPS), our benchmark, for different percentage of input patterns.
As NN architectures, we use the Multi Layer Perceptron (MLP) model with one hidden layers and 10 hidden neurons. We tested different hidden NN choices (12 and 14 hidden neurons), but the best performance was obtained by 10 neurons. Moreover, we utilise different neurons methods to weights correction and our results derived by conjugate gradient algorithm.
Results and discussion
We applied NN to the results coming from the patterns selection process to forecast time series of ozone levels concentrations using as input data, meteorology, as well as primary and secondary pollutants.
The performance of our approach is compared with CRPS in term of determination coefficient (R2) for different percentage of input patterns during the test phase.
The simulations show different significant results.
In term of NN benchmark, we consider different percentages of input patterns and we observe a rapid increase of performance after the 10% of data. If we use input patterns greater than 10% of data, R2 is greater than 0.8. The NN performances decrease in meaningful way for lower percentages of input patterns. In fact, R2 is 0.05 and 0.45 for the 1% and 3% of entire dataset respectively.
The use of cluster analysis as pattern selection increases NN performances in a very significant way. The NN training obtained by use of 0.5% and 1% by total data of cluster (37 and 74 patterns respectively given by centroids coordinates) gives R2 ranging from 0.55 to 0.66.
In order to obtain the same value in term of determination coefficient by using the CRPS strategy at different percentage of input, we compare results within the above different approaches.
The performances of cluster analysis choices are always higher than conventional benchmark. In particular, our results show three different behaviours linked to information data.
The first is related to selection up to 1% of cluster. In this case, we have an equivalent performance about 3% (corresponding to the 221 pattern on 7370) of CRPS. The second is related to increase the selection by cluster methods from 3% to 10% and we obtained the same performance up to 16% of CRPS. At the end, beyond 30% of patterns selected by cluster methods, we achieved equivalent performances of CRPS greater than 77% in term of R2.
Our results are very encouraging and show that the NN model performance is improved using cluster analysis, as regards the conventional random pattern choice, in which randomly assign patterns based on relative number or percentage of cases.
Simulations based on cluster analysis show that NN converges more rapidly than conventional algorithms. Moreover, these results demonstrate that our approach is feasible and effective, resulting in a substantial reduction of data input requirement and outperform other techniques applied in this contest and therefore, the combining techniques are more accurate than each individual methodologies and offer increasing performance as regards each method.