4.3 Optimization of Input Data for a Neural Network with a Mathematical Argument Diagram: A Case Study of Ozone Prediction

Wednesday, 13 January 2016: 11:30 AM
Room 354 ( New Orleans Ernest N. Morial Convention Center)
Armando Pelliccioni, INAIL, Monteporzio Catone, Italy; and S. E. Haupt
Manuscript (594.9 kB)

1- Introduction In scientific convention, theory explains phenomena observed in nature, and phenomena are revealed through empirical observation. Measurements and theory are two intimately interconnected aspects and science focuses on explanation of the measurements by the theories. These considerations open questions on the consistency of scientific information and, most importantly, on how to adequately obtain theoretical results. These arguments are key to neural network (NN) applications, particularly when mathematical simulations using NN are applied to environmental science. this work emphasizes the importance of organizing information for the optimization input variables for the training phase when applying NN to an environmental dataset. In particular, attention will be focused on the optimization of input information when a NN is applied. An example application of ozone levels using different input data demonstrates the issue.

2. Hidden variables and hidden deterministic models for air pollution A common goal is to build models to reproduce real observational data. Many models are available to simulate environmental situations. Each model assumes several well-defined conditions and can simulate the environment only under those conditions. When these simulations are compared with measurements derived from monitoring stations there is often a discrepancy, especially for chemical reactions in urban areas. At first sight, it appears that “reality” is found in the measurements. But for many environment cases, measurements are strictly linked to specific locations and conditions; thus, one must question whether this specific situation is representative of all cases when they are a mere reproduction of experimental data. It is for these reasons that the intelligent modelling approach (such as NN or SVM) can be considered as a valid alternative to the deterministic approach when it is necessary to reproduce the real case. If we want to consider the main factors linking environmental measurements, pollutant levels vary as a consequence of the following reasons: • chemical reactions during the transport time from emission to deposition on the ground; • dispersion due to meteorological conditions; • thermal and mechanical turbulence of the atmosphere; • emission factors characteristic of each source may be imprecise; While the relationship between cause and effect can be realatively easy to determine for closed systems, in real environments this connection can be very complex to determine and the complexity of the models is strictly linked to the complexity of the situations described (existence of hidden variables or hidden deterministic models).

3. The definition of the mathematical arguments diagram (MAD) and performance for ozone prediction.

To optimize the input data for a NN it is fundamental to define the mathematical arguments diagram (MAD) concept. The MAD is a graph that provides the relationship between variables interconnected by a graphical interface and their information levels. The MAD contains the following information: • the input independent variable (X1, … Xn); • the parameters of the models (K1… Kp); • the output relationship for the dependent variable (Y(X1..Xn;[K1..Kp]). • The information level (number of square boxes as input information)

The MAD shows the explicit relationships between the variables needed for the system to predict the output variable. The use of this tool is helpful to classifying the parameters and variables that are fundamental for the construction of the mathematical model. All data must classified in information levels and the MAD refers to a different levels of information. In environmental situations, the relationships that form the relationships that determine the air pollutant concentrations are intricate. The complete MAD for the air pollution system involves up to seven information levels (not shown here) and are very complex. Here are shown, the MAD for a case of photochemical production of O3; it demonstrates that the maximum information levels is three for this problem. These reactions involve the NO, O3 and NO2 pollutant and they are activated by solar radiation. We apply this scheme to reproduce the ozone in the Rome area. To train a multilayer perceptron model we utilize some conventional input parameters (such as temperature, wind speed and direction, relative humidity, friction velocity and solar radiation) linked to the first level of information. These simulations involve eight input variables referred to as (I+M). To test the influence of the secondary level of information a new set of input variables, called time variables, has been introduced as input data (for a total of 12 input variables). The time variables are hour of the day, day of the week, day of the month and month of the year. These variables are not strictly related to the chemical reactions that produce ozone but are related indirectly by the MAD scheme to ozone levels and take into account seasonal effects and periodical turbulence conditions. We named results derived by them as I+M+Ex. The performance of models using different input data and a Neural Network (NN) and Multi Regression model (MLR) are shown in the table. It is clear that adding the secondary variables improves the model performance.

I+M I+M+Ex MLR NN MLR NN R2 0.44 0.62 0.53 0.69 Bias (&mug/m3) 19.43 12.05 16.55 10.10 MAE 17.10 13.52 15.58 12.42

The external (or secondary) variables (Ex) make a substantialimprovement in the performance of the multiple regression model (MR) by about 20%, while we note an increase of the performance of neural networks (NN) with its introduction of about 12%.

4. Conclusion The complexity derived from MAD analysis demonstrate, for the evaluation of photochemical reaction of air pollution, that it is helpful to classify the input information in levels, before applying the NN. The point derived from MAD analysis provides a conceptual method for meaningful assessment of the value of each variable used in the training phase to reproduce a target. We have introduced the time variables that are not directly related to the pollutant and meteorological conditions in the atmosphere, but are useful in modelling the ozone concentration. The new variables improve the accuracy of the NN as compared to the MLR model for forecasting ozone for the urban area of Rome.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner