9 A transferable framework to evaluate the quality of large data sets

Monday, 12 May 2014
Bellmont BC (Crowne Plaza Portland Downtown Convention Center Hotel)
Derek E. Smith, National Ecological Observatory Network (NEON), Boulder, CO; and S. Metzger and J. R. Taylor

The ability to assess the validity of data is essential to any investigation, and manual “eyes on” assessments of data quality have dominated in the past. Yet, as the size of collected data continues to increase, so does the required effort to assess their quality. This challenge is of particular concern for automated data collection in meteorological networks, and has resulted in the automation of many quality assurance and quality control (QA/QC) analyses. Unfortunately, the interpretation of the resulting data quality flags can become quite challenging with large data sets.

Therefore, we have developed an automated framework to summarize data quality information and facilitate interpretation by the user. Briefly, the framework consists of first compiling data quality information and then presenting it through two separate mechanisms; a quality report and a quality summary. The quality report presents the results of specific quality analyses as they relate to individual observations. The quality summary takes a spatial or temporal aggregate of each quality analysis and provides a summary of the results. Included in the quality summary is a final quality flag, which further condenses the data quality information to assess whether a data product is overall valid or not. In the same framework, also “eyes on” information on data quality can be incorporated, e.g. for physically collected samples. Furthermore, this framework can aid problem tracking and resolution, should sensor or system malfunctions arise.

The National Ecological Observatory (NEON) has implemented this framework for terrestrial sensor data in order to provide transparent data quality information to its users. To put this framework in perspective, there will be over 150 terrestrial sensor observations made at a typical NEON site. Generally, 1 and 30-minute averages are produced from sensor observations that are typically acquired at a rate of 1 Hz and include 8 different QA/QC analyses. Accordingly, the QA/QC results for terrestrial sensors are upwards of 1*108 per day, 2*109 per month, and 3*1010 per year. Through the presented framework, the QA/QC information can be condensed by roughly 4 orders of magnitude for 1-minute averages and 6 orders of magnitude for 30-minute averages.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner