J1B.5 ASDC’s Python-Based Metadata Extraction Pipeline for Suborbital Campaigns

Monday, 29 January 2024: 9:30 AM
336 (The Baltimore Convention Center)
Abraham Stephen Porter, ADNET Systems, Inc., Hampton, VA; and N. Jester, M. E. Buzanowicz, S. Leavor, G. Mojica, J. Kusterer, and C. Gao

The FAIRness of data products, especially findability and accessibility depend on rich metadata which, when extracted, can allow for proper curation. Over the past few years, the Atmospheric Science Data Center (ASDC) suborbital science support team has developed a metadata extraction pipeline to ensure the required metadata can be retrieved systematically, effectively, and efficiently to ensure the data can be used by a broad community. The development of a pipeline has presented many, but necessary, challenges to support archival and distribution of ASDC’s 30+ suborbital missions. Though sufficient metadata is provided by instrument scientists, the metadata may not be readily machine actionable due to different formats and templates. Further complicating metadata extraction, our team has found that the nature of metadata can be quite diverse given the difference in measurement types, instruments, and measurement platforms.

A metadata extraction pipeline has been developed to provide an efficient, plugin-in based, method for adding new parsers, a configuration system that lets non-developers customize how files are processed, and a system for identifying and logging metadata quality issues to ensure they are readily found and addressed. The metadata extraction pipeline identifies critical pieces of metadata that are needed to promote data FAIRness, including location, file revision, measurement start/end datetime and can be easily modified to extract further information (such as variables). Given the wide-ranging datasets, the pipeline has been modified to accommodate multiple file formats, including multiple versions of ICARTT (International Consortium for Atmospheric Research on Transport and Transformation), HDF (Hierarchical Data Format), netCDF (network Common Data Form), and multiple versions of the Ames File Format. The pipeline also supports building metadata for file formats that cannot have metadata easily extracted from them, such as PDF (Portable Document Format) and GIF (Graphics Interchange Format). The pipeline has allowed our team to maintain a consistent flow of data and metadata to archival and distribution services, ensuring the ASDC meets the needs of the suborbital science community. This presentation will highlight the ASDC’s suborbital metadata extraction pipeline, its development, how it’s been modified to support data FAIRness, and plans for maintaining the pipeline and adding new features.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner