92nd American Meteorological Society Annual Meeting (January 22-26, 2012)

Thursday, 26 January 2012: 2:30 PM
Event-Driven Data Management and Processing Using An Integrated Rule Oriented Data System (iRODS)
Room 357 (New Orleans Convention Center )
Kenneth Galluppi, Arizona State University, Tempe, AZ; and R. W. Moore, L. Brieger, B. R. Nelson, and A. Hall

The organization, sharing, and analysis of environmental data are greatly facilitated through policy-based data management systems, such as the integrated Rule Oriented Data System (iRODS). The iRODS system is software middleware that organizes distributed data into a shareable collection. The collections can be organized as institutional repositories, such as the National Climatic Data Center (NCDC) archive of climate data records. Through application of management policies, the same collection can be shared with other institutions within a data grid. Processing pipelines can be constructed that retrieve data from the data grid, generate derived data products, and then store the results back into the repository. The data grid middleware facilitates access to the original and derived data through a wide variety of user interfaces, ranging from web browsers, to workflow systems, to digital library interfaces, to file system interfaces. The Renaissance Computing Institute (RENCI) and the Data-Intensive Cyber Environments (DICE) group at UNC, in collaboration with the National Climatic Data Center (NCDC), has constructed a data processing pipeline for Radar data sets. The data are accessed at NCDC through the iRODS data grid, replicated to storage resources at RENCI, re-analyzed to generate new estimates of precipitation, stored locally at RENCI, and replicated back to NCDC.

The National Climate Data Center is adopting Q2, the NSSL's 2nd-generation Quantitative Precipitation Estimate that provides 1-km spatial resolution with 5-min temporal resolution. This will allow NOAA to build up a central archive of accurate, high-quality precipitation information, addressing issues in the Stage IV, Multi-sensor Precipitation Estimate product and supporting local-scale climate analysis and climate monitoring. As a pilot exploratory project, NCDC and RENCI seek to expand the current 3-year Q2 reanalysis to a 10-year period of record. iRODS is being utilized to handle all data management and for automating computational resources. iRODS allows distributed data to be viewed as a global collection, supports controlled data sharing, enables server-side computational data services, and offers a framework for implementing data management and policy. iRODS provides data infrastructure that can be specialized to a community's particular data support needs; it uses event-driven actions to automate data-oriented services.

In this project, sets of NCDC data are sent to RENCI where Q2 reanalyses are computed, and results are sent back to NCDC as well as stored in local data grid. A specialized iRODS data grid at RENCI is set up to receive incoming data from NCDC; data ingestion from NCDC (one trigger event) prompts the automatic unpacking of the incoming tarballs and the sorting of the files into (possibly pre-existing) directories, using NCDC's own sorting scripts. At the end of the Q2 workflows, the arrival of final results into an output iRODS collection (another trigger event) causes the results to be automatically sent to NCDC's ftp servers, from which data are collected for ingestion into archives behind a firewall.

By automating the preparation and sorting of incoming data, the iRODS data grid has eliminated the need for special attention to data preparation as a prerequisite for the Q2 processing; preparation is simply done as soon as new data arrives. Similarly, since output results are automatically sent to their destination at NCDC by the data grid, no special attention is required for that, and results arrive at NCDC as soon as they are generated at RENCI, with no delay or requirement of human intervention. Such actions, triggered by events such as arrival or deletion of data in the iRODS data grid, allow us to implement many other policies or management processes as well, including control of computational processes (workflows). The iRODS infrastructure must simply be customized to implement the particular requirements of a given community. This is a very powerful tool for supporting data-centric environments.

Through use of the iRODS data grid, the multiple steps can be automated, eliminating much manual data management labor. At the same time, assertions can be made about the resulting data products that are based upon the data processing pipeline. Since all of the data within the record series are examined, properties of the collection can be analyzed and recorded. Examples are assertions about the completeness of the data coverage, the accuracy (error bars) on the precipitation estimates, and the algorithms used in the analysis. The provenance information describing each data product can be stored as metadata with each file, and registered into a database to support discovery.

The ability to organize distributed data, share data between institutions, track the derivation of analyses on the data, and then discover relevant environmental data has very strong implications for national and international environmental research. Through policy-based data management systems, the governing policies and procedures can be implemented based on a consensus by the designated user community. The policies can evolve to track the actual user community requirements. Re-processing of the collection can be automated to apply next generation analysis algorithms. The environmental data can then sit within an active data management environment, which continuously updates its indices and derived data content to track the future research initiatives

Supplementary URL: