3.2 Identifying Data Sources and Physical Strategies Used By Neural Networks to Predict TC Rapid Intensification

Monday, 29 January 2024: 2:00 PM
345/346 (The Baltimore Convention Center)
Ryan A. Lagerquist, CIRA and NOAA/ESRL/GSL, Boulder, CO; and J. Knaff, C. Slocum, K. Musgrave, and I. Ebert-Uphoff

Tropical cyclone (TC) warning centers around the world issue forecasts several times a day for active TCs, including track and intensity. Within the second variable, rapid intensification (RI) is a key priority. While RI remains a difficult problem, RI forecasts have improved considerably since the late 2000s, thanks partly to machine learning (ML). A popular ML algorithm in atmospheric science is convolutional neural networks (NN). When spatial data (e.g., satellite images) are available, a NN can ingest the images directly, whereas traditional ML relies on scalar metrics summarizing the images. This allows a NN to freely "decide" which spatial features are important for RI prediction, which could be missed by human-derived scalar metrics. However, two questions remain: (1) What is the influence of different data sources on RI skill? For example, does the use of full satellite images lead to a significant improvement in RI skill? (2) When NNs are applied to full image data, what strategies do they use for RI prediction? Can we learn anything new? We use the 30-kt-in-24-hours definition of RI.

To answer question #1, we perform an ablation experiment, training NNs with and without different data sources. The data sources used are TC-centered satellite imagery from the Cooperative Institute for Research in the Atmosphere (CIRA) infrared (IR) archive and scalars from the Statistical Hurricane Intensity Prediction Scheme (SHIPS) developmental data. We split SHIPS variables into three categories: satellite-based (scalar statistics based on GOES imagery), historical (current and recent TC intensity), and environmental (describing the near-TC atmosphere/ocean). We find that NN performance is controlled mainly by the amount of SHIPS data used; denying any set of SHIPS variables (satellite-based, environmental, or historical) worsens performance. NN performance is also controlled weakly by the amount of IR data used -- improving slightly as the sequence length is increased from 0 to 1 time steps, but not as the sequence length is increased beyond 1 time step. In other words, a single IR image improves RI prediction slightly, but a more extensive IR-image evolution does not.

We focus on three particular NN models: the best NN overall (regardless of which data sources it uses), the best NN without IR data, and the best NN without SHIPS data. We find that the best NN overall -- which uses one IR image and all SHIPS predictors -- performs only slightly better than the NN without IR data but much better than the NN without SHIPS data. In other words, SHIPS data are more important than IR data. Also, we compare the three NNs to three baselines, all existing models used in operations: the SHIPS Rapid Intensification Index (RII), SHIPS consensus, and Deterministic-to-probabilistic Statistical Model (DTOPS). The two NNs with SHIPS data produce better yes/no forecasts (indicated by, e.g., area under the ROC curve [AUC]), but worse probabilistic forecasts (indicated by the attributes diagram and Brier skill score [BSS]), than the baseline models. Meanwhile, the NN without SHIPS data performs worse than the baseline models in all aspects. However, all NNs perform uncertainty quantification (UQ) -- outputting a 5000-member ensemble of RI probabilities -- while the baseline models are deterministic, producing a single probability. The ensembles are well calibrated -- e.g., ensemble spread is highly correlated with error -- meaning that ensemble spread could be useful as a "go/no-go" in operations, to decide when to trust vs. override a NN. Furthermore, the NN without SHIPS data performs surprisingly well, given that its predictors include only one satellite image. Namely, on independent testing data, it achieves an AUC of 0.80, CSI of 0.16, and BSS of 0.21. Since the NN without SHIPS data is least correlated with (most independent of) the other NN and baseline models, it could therefore be a useful addition to an ensemble like the SHIPS consensus.

To address question #2 from the introduction, we use explainable artificial intelligence (XAI), namely the DeepSHAP method. DeepSHAP produces an attribution map for each sample, including one attribution value ("RI evidence") at every IR image pixel. But there is a problem: DeepSHAP, like most XAI methods, produces pixel-level noise. Our solution is to use eigenanalysis to find patterns in the Shapley maps from the whole testing dataset. Specifically, we use maximum-covariance analysis (MCA) to find the leading modes of covariability between the predictor (brightness temperature) and attribution (Shapley) value. We find that the leading mode (explaining 93.2% of covariance) involves nearly axisymmetric coverage of deep convection around the TC center, which the NN uses as positive evidence for RI, while the second and third modes involve non-axisymmetric coverage, biased to one side of the TC center. Although certain configurations of these two modes may suggest future TC evolution (it is known that convection often evolves from non-axisymmetric to axisymmetric during RI), Shapley values in these modes indicate that the NN sees warm cloud tops -- on the side of the TC center without deep convection -- as evidence against RI. In other words, the NN looks for cold cloud tops everywhere and leverages only axisymmetric patterns of deep convection. The areal coverage of cold cloud tops around the TC center is already described in several satellite-based SHIPS predictors; this is likely a key reason that NNs without full IR images perform nearly as well as those with the images.

Overall, our results suggest that exploiting full IR images with NNs yields only a small gain in the quality of RI predictions. However, the NN without SHIPS data -- using only a single IR image -- performs surprisingly well and could be a valuable addition to operational ensembles. Also, the uncertainty quantification provided by our NNs could be a valuable tool in assessing their trustworthiness on a case-by-case basis.

Disclaimer: The scientific results and conclusions, as well as any views or opinions expressed herein, are those of the author(s) and do not necessarily reflect those of NOAA or the Department of Commerce.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner