Machine Learning Models for Climate Data

Meyer, David A.; Meyer, David A.

Meteorological observations and pollution measurements are geographically local, at the positions of the necessary instruments. Data assimilation transforms them into estimates of initial conditions for dynamical models, often through an iterative process of estimation, dynamical evolution, re-estimation, ... Full numerical weather models are, of course, computationally expensive, so for any specific purpose one would prefer to use the simplest possible dynamical model. Our goal is determine the extent to which relatively simple machine learning models, informed by physical principles, suffice for geographically and temporally local forecasts of pollution and lightning strikes. These two applications are of interest in the context of the United Nations Data for Climate Action Challenge. We are motivated, specifically, by the question of how reliable reported air quality measurements are in China, and by the challenge of localizing national scale models for lightning.

Each of these applications can be abstracted to a model for a scalar field, φ(z,t), representing some air pollutant or convective available potential energy (CAPE), respectively, which is transported downwind. Thus we also include a wind velocity field, w(z,t) ∈ R². Importantly, the points z lie on the surface of the earth, which is approximately spherical, not flat, except on small scales. Using datasets for pollution measurements, or CAPE, at specific locations and times, at each time we interpolate a continuous scalar field on the using the recently developed regularized orthogonal functional matching pursuit method of Michel and Telschow (2016) to solve the inverse problem of approximating the observed values by evaluating an optimal linear combination of basis functions at the sample points on the sphere. We use the same method to compute an interpolating wind velocity field at each observation time. We integrate the interpolated velocity field numerically to obtain a flow on the sphere.

Our pollution models are finitely parameterized to specify some spatio-temporal distribution of created pollution, e.g., when pollution is measured daily by city, we include two parameters for pollution generation (on weekdays and weekends) per city, together with an annual periodicity (to capture seasonal dependence) and an interpolation rule. The predicted pollution at each sample location is a linear combination of the created pollution at that place and time and the pollution that is one time interval up-flow at the previous time. We choose the pollution generation parameters to minimize the sum of the squared errors between the predicted values and the observed values in a training set of the data, and then test the model against data from a disjoint, later, time interval.

Our interest in applying the same flow estimation to lightning data stems from a recent paper of Romps, Seeley, Vollaro and Molinari (2014) that projects an increase in lightning strikes due to global warming. They develop a model that estimates the number of lightning strikes to be proportional to CAPE times the precipitation rate, but they do so using only global (national) scale data. Instead, we take local estimates of CAPE, interpolate a field as above and flow it downwind according to the numerically computed flow, with parameters that specify how long to flow before comparing with lightning strikes. The latter we have locally in time and space, so again we choose the flow time parameters to minimize the sum of the squared errors between the predicted values and the observed values in a training set of the data, and then test the model against data from a disjoint, later, time period.

These machine learning methods thus estimate parameters describing pollution generation and the effective physics of lightning.

TJ7.3 Machine Learning Models for Climate Data