These classifications are typically done by hand or simple computational schemes. For example, a threshold for level of background is defined, and all scans above the threshold are discarded. The method can miss good scans and pass bad ones. During a measurement period, atmospheric conditions can vary abruptly, causing scans to get inaccurately flagged.
We have applied machine-learning techniques to classify lidar scans using The University of Western Ontario’s Purple Crow Lidar (PCL), a large power-aperture Rayleigh-Raman lidar located in London, Canada. The lidar has been operational since 1992, and has two Rayleigh channels, a Raman Nitrogen channel, and a Raman water vapour channel. A PCL scan is the sum of laser shots for 60 s (1200 or 1800 shots depending on the year). Moreover, we have tried our machine-learning techniques to classify RAman Lidar for Meteorological Observations (RALMO), the system is located in Payerne, Switzerland. Utilizing eight measurement channels, the RALMO is continually measuring profiles for temperature, aerosol, and water vapour. Unlike PCL which only operates at nights, RALMO, operates in both day and night time. In daytime, the background counts are severely affected by Sun, thus the daytime and nighttime scans are classified separately.
For each of these lidars, we took fewer than 2000 scans, and for each measurement channel we tested different supervised machine-learning algorithms for their classification ability. Among the algorithms, random forests, gradient boosted decision trees and support vector machines successfully classified lidar scans in to ‘’bad scans’’, ‘’cloudy measurements’’, ‘’clear measurements’’, and scans with ‘’lower than usual’’ laser power. The cross-validated classification accuracy of all these algorithms was above 95%.
In machine-learning, a common issue arises when the chosen algorithm becomes too complicated and provides highly accurate results for the training set (often approaching 100% accuracy). However, the algorithm fails to predict well for the test set. This is known as an “over-fitting” problem. To overcome, the ‘’over-fitting’’ problem, we employed cross validation in which the data is divided into subsets of ‘training’ and ‘test’ data where the test data is not used in the training of the classifier, and the accuracy result reported based solely on accuracy on the unseen test data. This process is then iteratively repeated with different partitions of the dataset into ‘training’ and ‘test’ sets.
Machine-learning is a robust method, and can easily be adopted to other lidar systems. Using fewer than 2000 scans (less than a week of measurements), we developed a code which can classify more than 20 years of measurements. Furthermore, the training phase for each of these algorithms is in the order of few seconds. Additionally, we are developing a t-distributed Stochastic Neighbour Embedding (t-SNE) based method to cluster our lidar scans. t-SNE is an unsupervised method in which there is no need to provide a training set; the method is completely data driven. One advantage of this method over the supervised method is that, if a specific night has some irregular data scans that has not yet seen, it can be divided and clustered as a separate group and no training phase is needed.