J8A.2 Promoting Open Science: A Better Tool for Version Controlling Gigantic Binary Data Files - Git-Qing

Tuesday, 30 January 2024: 4:45 PM
337 (The Baltimore Convention Center)
Guoqing Ge, CIRES and NOAA/GSL, Boulder, CO; and M. Hu, T. T. Ladwig, S. S. Weygandt, S. Liu, and J. R. Carley

In recent years, the drive towards open science has been gaining momentum, with researchers recognizing the importance of transparent and accessible data sharing. However, version controlling large binary data files has remained a challenging hurdle in achieving comprehensive open science practices. Existing solutions such as Git LFS have provided some relief, but they struggle to handle gigantic binary data (a few hundred gigabytes to a few terabytes) with problems such as network traffic jams, long waiting times to clone repositories, consuming a large amount of disk space with multiple copies, costing too much money for hosting data. This talk introduces "git-qing," a cutting-edge tool specifically developed to meet the needs of version-controlling gigantic binary data files.

By capitalizing on the latest advancements in data storage and retrieval, git-qing efficiently manages large binary files within Git repositories, ensuring they remain lightweight and effortlessly accessible. Leveraging unique algorithms, git-qing minimizes the impact of large binary files on version control operations, enhancing repository cloning speed and responsiveness.

Key features of git-qing include intelligent data deduplication, seamless data synchronization with existing storage systems, and comprehensive metadata management. Through its user-friendly interface, researchers can effortlessly integrate git-qing into their existing workflows, eliminating the complexities often associated with handling gigantic binary data.

Moreover, git-qing adheres to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, fostering a culture of data sharing and collaboration within the scientific community. Researchers can confidently share their binary data files, knowing that git-qing facilitates effortless discovery, accessibility, and reproducibility of scientific findings.

This talk discusses the architecture, implementation, and performance evaluation of git-qing, comparing it with existing solutions to showcase its superiority in managing large binary data files. By promoting transparent data sharing and fostering collaboration, git-qing brings researchers one step closer to realizing the full potential of open and reproducible scientific endeavors.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner