Leveraging Machine Learning and Deep Learning for Enhanced Data Quality Control at the Atmospheric Radiation Measurement (ARM) User Facility

Li, Mia; Li, Mia

The Atmospheric Radiation Measurement (ARM) Program user facility has been an invaluable source of benchmark atmospheric data for researchers for over 30 years. Ensuring the accuracy and integrity of ARM data for climate researchers is vital. To achieve this, the ARM Data Quality Office (DQO) has worked in close collaboration with instrument mentors to meticulously monitor and address data quality issues. The focus has been on minimizing data loss through timely interventions and the implementation of custom quality control tests. These tests identify common issues, such as values that exceed valid ranges, and the DQO provides tools to visually review and efficiently exclude contaminated data for end users.

However, some data quality issues, like data drifts over time, remain undetectable by the existing quality control tests, and other issues, like recurring spikes in a time series, might be better handled by automated methods. These can often be manually located by data quality analysts and instrument mentors using visualization plots. Still, considering the extensive scale of the ARM database, manual documentation of time intervals with quality issues is extremely time-consuming. To address these challenges, DQO has leveraged machine learning (ML) and deep learning (DL) algorithms. These innovative modeling solutions enable the detection of data quality issues in archived historical data spanning over 30 years, allowing for the creation of comprehensive and precise data quality reports. Meanwhile, the application of artificial intelligence aids in identifying anomalies in real-time streaming data, providing prompt and custom alerts to DQO and instrument mentors. This proactive approach allows for immediate intervention, reducing the potential for extended data loss.

The ARM user facility maintains a database of Data Quality Reports (DQRs), which can be harnessed as a labeling source for Machine Learning (ML). However, since the DQRs were not initially designed for direct integration with ML, it is imperative to devise a feasible strategy to generate accurate labeling files for supervised learning without the exhaustive task of manually labeling quality issues in DQ-Zoom, a user-friendly web application for visualizing ARM datastreams.

After a comprehensive review and comparison of various unsupervised and supervised solutions, a method known as ‘transfer learning’ has been validated using 30 years of relative humidity data from an ARM datastream. This approach consists of four key steps:

Manually labeling data spikes in a three-month segment of a meteorological datastream from the North Slope of Alaska (NSA) site, as highlighted by DQR#D201216.2 - Intermittent Drops in Humidity Data.

Training a binary classification model to detect spikes using this labeled data.
Utilizing the well-trained model to detect spikes in another datastream from the Southern Great Plains (SGP) site, and then creating the corresponding labels.
Progressively identifying more spikes over an extensive period of historical data and incrementally enlarging the dataset for ML model training, which enhances the accuracy of the resulting models.

Transfer learning has significantly enhanced our labeling efficiency, exponentially propelling the progress of anomaly detection within ARM data. Once we have generated an adequate amount of labels, deep learning algorithms such as Deep Neural Networks, Convolutional Neural Networks (used for Computer Vision), and Recurrent Neural Networks can be implemented to handle large-scale anomaly detection tasks for ARM.

3A.3 Leveraging Machine Learning and Deep Learning for Enhanced Data Quality Control at the Atmospheric Radiation Measurement (ARM) User Facility