Simulating RT – i.e., heating of the atmosphere and surface due to the scattering and absorption of radiation – is a key part of numerical weather prediction (NWP). Because process-based RT models are computationally expensive, they are often emulated with ML, especially neural networks (NN). Specifically, we emulate the RRTM with a U-net++ NN architecture. We focus on shortwave radiation (wavelengths of 0.2-12.2 microns), which is of solar origin. The target variables are shortwave radiative heating rates (one for every height in the vertical grid) and boundary fluxes (top-of-atmosphere upwelling flux and surface downwelling flux). We train the NN-based emulator on "real" atmospheric profiles, with predictors from the Global Forecast System (GFS) NWP model, but then we apply the NN to perturbed validation and testing data. In both cases – for the clean GFS predictors and perturbed predictors – NN labels (truth values) come from the RRTM. The perturbed data contain fictitious profiles of key RT predictors, including temperature, humidity, liquid cloud water, ice cloud water, and ozone. The perturbed profiles are unlike anything seen in the training data (clean GFS profiles) and can therefore be considered far out-of-sample. We expect that the NN's mean predictions (of fluxes and heating rates) will not generalize well to these new scenarios, but we are particularly interested in what happens with the NN's uncertainty estimates. Specifically, we look for catastrophic errors, where the observation is far away from both the NN's mean prediction and UQ-informed confidence interval.
We train the NN with five different UQ methods, to ensure that results are not specific to one method. These are the multi-model ensemble (MME), which involves training many NNs with different random seeds; the continuous ranked probability score (CRPS) approach, which involves training one NN to produce an ensemble, constrained by the CRPS loss function; the Bayesian neural network (BNN), which involves replacing traditional NN layers with Bayesian layers, where each weight is characterized by a distribution instead of one value; and two hybrid methods, the MME/CRPS and BNN/CRPS. The validation data (used for model selection) are moderately perturbed, and the testing data (used for final model assessment) are heavily perturbed. For example, in fictitious clouds (both liquid and ice), the maximum water content is 5 g m-3 in the testing data and 2.5 g m-3 in the validation data. In the clean training data the maximum water content is only 1.3 g m-3, so the testing data are farther out-of-sample than the validation data. Thus, the testing data pose a bigger challenge for the NN and whichever UQ method it is coupled with. We find that all five UQ methods, on both the validation and testing data, are extremely overconfident (produce too little spread) and produce catastrophic errors often (where the confidence interval completely misses the observation). This confirms our hypothesis that ML-UQ fails on out-of-sample data; the uncertainty estimate fails to warn the user of the model's large error.
The above results motivated another science question: what happens if the models are trained with lightly perturbed data, instead of clean data? In other words, what happens if the models "see" a light version of the perturbations added to the validation and testing data? Will this help ML-UQ to generalize better? To this end, we repeat the first experiment, training the NNs with lightly perturbed (LP) data instead of clean data. We find that the LP-trained models are much better than the clean-trained models on the testing data, which are still far out-of-sample with respect to both training sets. This result illustrates the power of triggering each mode of variability – even just a little bit – in the training data. Ebert-Uphoff and Deng (2017; https://doi.org/10.1016/j.cageo.2016.10.008) found a similar result for causal discovery on an environmental-science application: if a causal mechanism is not triggered in the training data, it will not be learned by the model.
However, the LP-trained models still have some concerning properties, as they do not generalize fully to the heavily perturbed testing data. We discuss these properties in detail and provide recommendations for ML-UQ in real-world scenarios where the data distribution might shift, such as climate change.
Disclaimer: The scientific results and conclusions, as well as any views or opinions expressed herein, are those of the author(s) and do not necessarily reflect those of NOAA or the Department of Commerce.

