2.2 Spatial Variation of Air Pollutants Using Machine Learning Models

Monday, 13 January 2020: 10:45 AM
211 (Boston Convention and Exhibition Center)
Jiajun Gu, Cornell Univ., Ithaca, NY; and G. Bang, A. Guha Roy, M. Brauer, and M. Zhang

Ambient air pollution has a major impact on human health. To effectively control and reduce ambient air pollution, resolving the spatial variability of different air pollutants is critical. Especially in the urban environment, people are experiencing higher levels of outdoor air pollution than people in rural areas. Due to spatial inhomogeneity resulting from multiple emission sources and complex morphology, models for simulating the dispersion of airborne pollutant may lose their generality in the urban environment, and model parameters are difficult to determine. Therefore, in this study, we took a stochastic view in the atmospheric dispersion modeling problem by using Gaussian process regression (GPR). Gaussian process is a stochastic process such that every indexed finite collection of random variables satisfies a multivariate normal distribution. The key idea behind GPR is to impose a multivariate Gaussian prior distribution over functions which reflect the input-output relations. We compared GPR with two other commonly used models, i.e., land-use (linear) regression (LUR) and random forest (RF), by predicting the spatial variation of nitrogen dioxide (NO2) and nitrogen oxide (NO) concentrations using spatial variables in Vancouver, BC, Canada. The measurement data came from two sampling campaigns conducted from Oct 19 – Nov 2, 2009, and from Apr 19 – May 3, 2010 at 116 sites to capture the range of long-term concentrations. Measurements averaged across both campaigns were used as estimates of annual average concentration values. Model predictors include 168 spatial variables describing location, distance to the nearest highway, road length, land use, and population density at different radii around each sampling site. All the models were evaluated using multiple random hold-out sets. In terms of prediction accuracy, GPR outperformed LUR and RF due to its nonparametric flexibility and capability to quantify uncertainties of predictions. Meanwhile, by comparing lengthscales of radial basis function (RBF) kernel used in GPR, it is able to rank the contribution of each land-use variable to the pollution concentrations, which is important to the epidemiologic studies of the health effects of air pollution.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner