Gradient Boosting Machine Learning to Improve Satellite-Derived Column Water Vapor with Implications for Estimating Ground-Level Fine Particulate Air Pollution

Just, Allan C.; Just, Allan C.

Introduction:

As recently reported, column water vapor (CWV) measurements can provide a measure of the vertical mixing of aerosols, and this can significantly improve respirable particulate matter (PM_2.5) mapping for health studies. Remote sensing can be used to estimate CWV in dynamic conditions and at high resolution (1 km * 1 km), such as in the leading Multi-Angle Implementation of Atmospheric Correction (MAIAC) retrieval algorithms derived from daily overpasses of NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently demonstrated that machine learning using gradient boosting can improve the estimation of MAIAC aerosol optical depth (AOD) parameters (Just et al. Remote Sensing 2018). However, it has not yet been assessed whether machine-learning approaches can improve the estimation of column water vapor and the implications of such a correction for estimating surface PM_2.5 concentrations.

Description of the data:

In order to assess the performance of the MAIAC CWV estimates, datasets were built using MAIAC data from Terra and Aqua (separately) collocated to the nearest 1 km * 1 km grid centroid and the closest observation in time (no more than 60 minutes) with cloud-screened (level 2.0) CWV from the AERONET network of sun photometers over the Northeast US. The study period included 10,247 observations (from 75 AERONET stations) for Terra (2000-2015) and 8,536 observations (from 71 stations) for Aqua (2002-2015). AERONET stations in the Northeast are largely urban and coastal. Overall agreement of MAIAC CWV and AERONET CWV was quite good with a Pearson’s correlation of 0.972 and 0.978 for Terra and Aqua, respectively. However, outlying values and a positive bias in Terra-derived MAIAC CWV particularly indicate a potential for improvement in MAIAC CWV relative to AERONET when appropriately trained in combination with other datasets. We defined our target modeling parameter as the difference between MAIAC and AERONET CWV (ΔCWV = MAIAC CWV - AERONET CWV) which is approximately symmetrically distributed and had a mean of 0.035cm and a standard deviation of 0.26cm for the collocated Terra dataset.

Statistical methods:

Data were split prior to machine learning by withholding a test set of complete days (to avoid overfitting on shared meteorologic or overpass/acquisition characteristics) of ~15% of the data. The analysis included 22 predictors with MAIAC variables including an uncertainty parameter related to blue band surface reflectance and AOD as well as time trend (integer date) and several land use terms from the National Land Cover Database 2011 (aggregated to proportions within 1 km * 1 km grid cells). Feature engineering calculated spatial patterns in non-missing MAIAC data including the number of contiguous non-missing grid cells (clump size) and the number of non-missing observations in focal windows of side lengths from 30-510 km. No external meteorology or assimilated data were added. Hyper-parameter tuning of the XGBoost model used random five-fold cross-validation prior to final model evaluation in the withheld test set. XGBoost (Chen and Guestrin, KDD, 2016) is an efficient and scalable implementation of the stochastic gradient boosting framework (Friedman, Annals of Statistics, 2001), as accessed in the R language. The best tuned model used a tree-depth of 6, a learning rate of 0.01, and subsampling of ½ of the training data in each tree with 10,000 rounds.

Results:

In the test set for Terra data, the predicted ΔCWV explained 75% of the variance in ΔCWV (R²) and reduced the RMSE from 0.26cm (the root mean squared difference between MAIAC and AERONET CWV) to 0.13cm, a 50% decrease in RMSE. Results for Aqua were similar with the XGBoost model explaining 56% of the variance in the difference between the two parameters in test data and reducing the RMSE from 0.23cm (the root mean squared difference between MAIAC and AERONET CWV) to 0.14cm, a 36% decrease in RMSE. Predictor contributions as estimated from game theory based Shapely Additive Explanations (Lundberg, arXiv.org 2018) suggested that the top four key contributing variables to predicting the magnitude of ΔCWV were time trend (even though all of the data in the testing set were from days not included in the training data, there was still clear seasonality when plotting the Shapely estimates), the magnitude of the MAIAC column water vapor itself, the blue band reflectance-based uncertainty estimate from MAIAC, and the MAIAC AOD estimate. For the Aqua test data, the most important predictors were the blue band reflectance-based uncertainty, followed by the MAIAC AOD estimate, the MAIAC CWV estimate, and then the time trend.

Discussion:

Our analysis demonstrates that gradient boosting with XGBoost and features including satellite retrieval quality assurance, aerosol optical depth estimates, land use, and time trends can substantially refine satellite-derived retrievals of 1 km * 1 km resolution column water vapor (CWV) compared with sun photometer measures of CWV on test days that were withheld from training data. Ongoing analyses are examining how this improved CWV estimate, expressed as a ratio with similarly corrected AOD (AOD/CWV), can be used to improve predictions of surface 24-hour PM_2.5 concentrations at 380 EPA monitoring stations across the Northeastern US.

3B.2 Gradient Boosting Machine Learning to Improve Satellite-Derived Column Water Vapor with Implications for Estimating Ground-Level Fine Particulate Air Pollution