However, some data quality issues, like data drifts over time, remain undetectable by the existing quality control tests, and other issues, like recurring spikes in a time series, might be better handled by automated methods. These can often be manually located by data quality analysts and instrument mentors using visualization plots. Still, considering the extensive scale of the ARM database, manual documentation of time intervals with quality issues is extremely time-consuming. To address these challenges, DQO has leveraged machine learning (ML) and deep learning (DL) algorithms. These innovative modeling solutions enable the detection of data quality issues in archived historical data spanning over 30 years, allowing for the creation of comprehensive and precise data quality reports. Meanwhile, the application of artificial intelligence aids in identifying anomalies in real-time streaming data, providing prompt and custom alerts to DQO and instrument mentors. This proactive approach allows for immediate intervention, reducing the potential for extended data loss.
The ARM user facility maintains a database of Data Quality Reports (DQRs), which can be harnessed as a labeling source for Machine Learning (ML). However, since the DQRs were not initially designed for direct integration with ML, it is imperative to devise a feasible strategy to generate accurate labeling files for supervised learning without the exhaustive task of manually labeling quality issues in DQ-Zoom, a user-friendly web application for visualizing ARM datastreams.
After a comprehensive review and comparison of various unsupervised and supervised solutions, a method known as ‘transfer learning’ has been validated using 30 years of relative humidity data from an ARM datastream. This approach consists of four key steps:
- Manually labeling data spikes in a three-month segment of a meteorological datastream from the North Slope of Alaska (NSA) site, as highlighted by DQR#D201216.2 - Intermittent Drops in Humidity Data.
- Training a binary classification model to detect spikes using this labeled data.
- Utilizing the well-trained model to detect spikes in another datastream from the Southern Great Plains (SGP) site, and then creating the corresponding labels.
- Progressively identifying more spikes over an extensive period of historical data and incrementally enlarging the dataset for ML model training, which enhances the accuracy of the resulting models.
Transfer learning has significantly enhanced our labeling efficiency, exponentially propelling the progress of anomaly detection within ARM data. Once we have generated an adequate amount of labels, deep learning algorithms such as Deep Neural Networks, Convolutional Neural Networks (used for Computer Vision), and Recurrent Neural Networks can be implemented to handle large-scale anomaly detection tasks for ARM.

