This project had three main objectives. The first was to assess performance of this type of neural network on difference sites from around the globe. The second was to determine how the amount of available data and the number of input variables affected performance, with the aim to determine what minimum data or meta-data type is required. Finally, we used a trained neural network to predict missing N2O emissions and estimate annual fluxes.
Sites had varying levels of input meta-data and number of sample days (ranging from 7-20 meta-data variables and 70-350 days). However, for the site comparison, we limited the inputs to only those common between the sites. We examined the data used for each model at individual year time steps. In addition to the environmental inputs from each site, we added a continuous numeric time of year and binary season variable to each data set. These variables were added in order to capture daily and seasonal patterns. We then simultaneously selected the initial weights and the number of hidden nodes for each model through a trial and error process to achieve the lowest root mean square error. The models were trained using the NeuralNet package in R and cross-validated using k-fold cross validation with k=10. To measure the final model’s accuracy, we plotted a known set of data against a predicted set and calculated the R2 value. To test how adding inputs would affect the accuracy of the model, inputs were added sequentially, beginning with the five used for the site comparison (rainfall, air temperature, time of year, summer, fall), then in order of their predicted effect on N2O emissions. To test whether the amount of available data had an effect on model accuracy, we calibrated and trained models using less and less data, starting with 80% of the meta-data and decreasing by 20% each time.
Preliminary results indicate that performance varies by site. We also found that when the total percentage of available data was plotted against the R2 values, there appeared to be no relationship between the two. When we began removing data from the data set, there was no significant decrease in accuracy until only 20% of the data was used to train and calibrate the models. As for inputs, we found that prediction began to decrease after nine input variables were used. We believe that this is due to redundancy in the information provided by each of the inputs which decreased performance. Finally, we used the trained model to fill the gaps in the data and compared it to a linear interpolation. We then calculated the area under the curve for each method of gap filling to get an annual emission rate and compared the results. The neural network filled data resulted in both under and over estimation compared to that of the linear method.
This project served as a proof of concept for using Feedforward MLPs to fill gaps in N2O emission data. In the future, researchers should focus on developing a globally convergent algorithm for structure selection and calibration. Error surfaces for environmental models are inherently bumpy and it is difficult to achieve global convergence in a computationally efficient manner. This model is currently being tested across the suite of sites found within the Global N2O Database to examine the possibility of a globally convergent algorithm as well as test the efficacy of neural networks to be used in gap filling for N2O emissions data. Other neural network architectures, particularly recurrent neural networks, should also be tested to see if performance could be improved by giving the model information about past trends. Additionally, standards must be developed for testing these types of models. Results from this project can be used by researchers to determine the most valuable metadata to collect from a site. Finally, neural networks should be compared to other gap filling methods.