J9.6 Comprehensive Large Array-data Stewardship System (CLASS): Data preservation activities

Thursday, 27 January 2011: 4:45 PM
607 (Washington State Convention Center)
Robert Rank, NOAA/NESDIS, Suitland, MD; and S. McCormick and C. Cremidis

The Comprehensive Large Array-data Stewardship System (CLASS) is the National Oceanic and Atmospheric Administration's (NOAA) enterprise-wide information technology system designed to support long-term, secure preservation and standards-based access to environmental data collections and information. The system is owned and operated by the NOAA National Data Centers and supports ingest, quality control, archival storage of and public access to data and science information.

NOAA has directed CLASS to adopt the principles defined in the Open Archival Information System- Reference Model (OAIS-RM). The OAIS-RM emphasizes the need for data preservation. CLASS is responsible for preservation of the data at the byte level, i.e. ensuring that all data needed to be archived are actually received and submitted for archival storage by CLASS; that data are not corrupted while being transported from the producer to CLASS, that data are properly and redundantly archived; that the archival storage area is continuously monitored, and that consumers have access to sufficient information to help them assess the integrity and completeness of the data received from CLASS. In order to meet these requirements, CLASS has adopted a set of practices, procedures, and tools that together allow CLASS assure data preservation at the byte level.

For Data Acquisition, CLASS has instituted a requirement for Submission Manifests. Submission Manifests are XML formatted files that describe the data that is submitted to CLASS for archival storage; the Submission Manifest contains along other tags, information about the name of the file, the size and checksum value. A Submission Manifest can contain information for one or more files, but every single file submitted to CLASS for archival storage is always included in a Submission Manifest. CLASS uses the information in the Submission Manifest to validate the integrity of the file that it is being ingested. Along with Submission Manifests, the producer also sends to CLASS two activity reports: (1) the file activity report that contains the list of all files that were submitted for archival during the reporting period and (2) the manifest activity report that lists all Submission Manifests that were submitted for processing by CLASS during the reporting period. Combining the Submission Manifest with the two activity reports helps CLASS ensure integrity and completeness of data submitted for archival storage. For Archival Storage, CLASS had adopted the following procedures: For each file ingested by CLASS, administrative metadata includes a checksum value that can be used for future integrity validation. Files are stored on tape robotics at two locations: the National Climatic Data Center (NCDC) in Asheville, NC and the National Geophysical Data Center (NGDC) in Boulder, CO. Data are transferred from the ingest site to the archival storage sites using Fast Data Transfer (FDT), a tool developed by CERN that along with high performance data transfers includes byte integrity validation (via checksums) therefore ensuring that data are not corrupted during transfers. Additionally, all data before being migrated to tape have their checksum validated. Finally, CLASS has deployed StorSentry, a commercial software product that monitors the performance and health of tape libraries, tape drives, and tapes. It identifies recurrent errors and alerts operators of possible issues with tapes and tape drives thereby allowing operators to take corrective actions before media is corrupted or data are not available. These procedures help CLASS ensure that data have not been corrupted during transit from the receiving location to the archival storage area at two different geographical locations and ensure that the archival storage areas are continuously monitored for any indication of corruption allowing CLASS operators to take early preventive actions.

For Dissemination, CLASS allows consumers to request delivery of Dissemination Manifests along with the data. Dissemination Manifests, similar to Submission Manifests, contain sufficient information to allow the consumer to validate the integrity of the data being delivered from the CLASS system. Additionally, consumers can also request a digital signature for each file delivered, this allows consumers to validate the CLASS as the source of the data. The combination of all these tools and procedures allow CLASS to provide the scientific community with high level of confidence that data received from CLASS is corruption free and faithfully reflects the data submitted by the producer.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner