Integrated Cloud and High-Performance Computing Platform for Interactive Analysis of ARM Data

Hills, Spencer; Hills, Spencer

Large scale scientific experiments and observation networks around the world have provided an unprecedented wealth of data providing insights in Earth and environmental processes at regional to global scale and over long time scales. For example, the archived data volume at the Department of Energy (DOE)'s Atmospheric Radiation Measurement (ARM) facility, collected from a range of sensors across distributed across the globe, recently reached the 1.4 petabyte mark and continues to grow. Analysis and scientific discovery from such large data sets requires computationally efficient algorithms and tools. High performance computing (HPC) has enabled new scientific discoveries through efficient and scalable analysis of these large scale datasets. Scientific discovery, however, is a creative, exploratory and iterative process and often requires interactive analysis and development. JupyterLab, developed by Project Jupyter, provides an easily customizable environment for interactive development and analysis in a web-based user interface that supports Python and a range of other languages. In addition to containing computer code, Python notebooks, such as JupyterLab, also support rich text elements enabling a well documented, reproducible and shareable research.

To enable new discoveries in atmospheric sciences using ARM data, the ARM Data Center is developing computational infrastructure that combines the enhanced interactive development environment of JupyterLab within Cloud computing infrastructure with traditional High Performance Computing and petascale data archive. In this talk, we will demonstrate a Python notebook based data analytics workflow for cloud type classification on long time series using ARM data.

1.2 Integrated Cloud and High-Performance Computing Platform for Interactive Analysis of ARM Data