A Self-Organizing Map is an unsupervised clustering algorithm that, when applied to gridded atmospheric variables, determines common daily weather patterns (the attached figure shows a trained SOM for 500hPa geopotential height anomalies). We train SOMs on ECMWF Reanalysis v5 (ERA5) daily anomalies for the months June and July from the years 1959 to 2022 for the following variables: 500hPa geopoential height, sea level pressure, 2 meter temperature, 850hPa temperature, convective available potential energy, and total column water vapor.
In this study, we classify days as low, medium, and high by tercile of daily lightning counts computed from the Alaska Lightning Detection Network historical lightning dataset. Each daily record is associated with a weight in the 2D SOM network. This reduces the dimensionality of the raw gridded data from about 200,000 - 34,001 pixels and 6 variables - to just 12. These results are then used to train a random forest classifier which uses an 80-20 train test split and 5-fold cross-validation for hyperparameter tuning. Table 1 is the confusion matrix for the test dataset.
Predicted |
||||
Low |
Middle |
high |
||
Actual |
Low |
73 |
28 |
24 |
Middle |
34 |
37 |
47 |
|
High |
12 |
19 |
81 |
Our model shows skill in classifying low and high tercile lightning days, with mean AUROC and F-1 scores of 0.7 and 0.53, respectively (climatology results in scores of 0.5 and 0.33 respectively). Classification of middle tercile days marginally outperforms the baseline scores. True upper tercile are correctly predicted at the highest rate. Table 2 summarizes the classification metrics for each class.
Class |
Precision |
Recall |
F1-Score |
AUROC |
Low |
0.613 |
0.584 |
0.598 |
0.767 |
Medium |
0.440 |
0.314 |
0.366 |
0.559 |
High |
0.533 |
0.723 |
0.614 |
0.785 |
Mean |
0.529 |
0.540 |
0.526 |
0.704 |
Test scores improved over validation suggesting the model will perform similarly given new data. Future work will conduct an in-depth model evaluation including the examination of feature importance and identifying sources for model error. We also plan to apply this methodology to classify lightning-days from days without lightning. Finally, we plan to employ this model with seasonal dynamical forecasts to construct a multi-model seasonal outlook.

