Data access and storage in the LEAD cyberinfrastructure
A major goal of LEAD is to allow users to query, import, and manage data in the LEAD domain for purposes such as visualization, storage, and as input to LEAD applications. LEAD data can be individual files, collections of files, or streams. LEAD data sets may be very large, necessitating storage in a distributed manner or on a mass storage system. These large sizes also require support for subsetting and aggregation of data.
There are three classes of data that the LEAD data storage subsystem must handle. Personal data is data that a user has brought into their space within the LEAD Data Repository. Users may store data and other resources here, and also the orchestration will store intermediate results in a user's personal space. Public LEAD data is data that is made available to the LEAD community by cooperating data providers, such as universities interested in sharing the data that they receive or generate. External data is any other data that a LEAD user knows about and would like to access. NCDC archival data is an example of this data.
Both personal and public LEAD data are catalogued within LEAD. Via the catalogs, users can query over the metadata to discover data that meets specified time, spatial, and field requirements. Other relevant requirements pertain to metadata generation, data quality, metadata quality, access control, and handling of proprietary data.
This paper presents some of the requirements for data access, acquisition, storage, and retrieval that are shaping the architecture of the LEAD Data Subsystem. It also gives a high level view of that architecture through a canonical use case, as well as its current state of development.