In this work, we explore the use of machine learning (ML) for improving the geophysical AQF produced by chemistry-transport models, using ancillary meteorological information in addition to chemical forecasts. As a case study, we consider an operational AQF system of Mexico City (based on WRF and CMAQ models) that continuously produces 48 hour forecasts for the main pollutants, and we focus on O3 and PM2.5. The analysis is performed using several algorithms including the popular gradient boosting machine algorithm. The AQF are corrected at the 35 individual surface monitoring stations available in the agglomeration. Results are compared to both the Kalman filter and analogs methods.
Our first results indicate that the best performance is currently obtained with both the analog and ML methods; most of the bias is efficiently removed, correlations are substantially improved and errors are strongly reduced. However, these average statistics hide deficiencies at the upper edge of the concentration distributions. Such issue is problematic since AQF systems aim at predicting exceedances of ambient air quality standards sufficiently in advance to allow warning the population and eventually setting-up short-term emission reduction plans (e.g. limitation of the traffic, temporary shut-down of some industrial facilities). As pollution episodes remain relatively infrequent, ML models trained with typically unbalanced datasets miss many of these important events. In order to improve their ability to predict highly-polluted situations, we explore the use of various approaches, including and under-/oversampling and sample weighting.