Data-Proximate Computing, Analytics, and Visualization Using Cloud-Hosted Workflows and Data Services

Ramamurthy, Mohan K.; Ramamurthy, Mohan K.

Data services, software, and user support are critical components of geosciences cyber-infrastructure to help researchers to advance science. The deluge of growing data volumes and the increasing diversity and complexity of those data bring challenges along with the opportunities for discovery and scientific breakthroughs. While the potential for big data to transform the geosciences is enormous, realizing the next frontier depends on effectively managing, analyzing, and exploiting these heterogeneous data sources, extracting knowledge and useful information from heterogeneous data sources in ways that were previously impossible, to enable discoveries and gain new insights. At the same time, there is a growing focus on the topic of “Reproducibility or Replicability in Science” that has important implications for Open Science.

The maturity of cloud computing technologies and tools have opened new avenues for addressing both big data and Open Science challenges to accelerate scientific discoveries. With the advent of data-centric workflows that can be optimized for clustered architectures, scale-out storage is now an attractive option for storage professionals across multiple industries. Software-defined storage solutions have become especially popular because of the power and flexibility they can provide. But deployment isn’t trivial. It needs careful preparation and planning.

There is broad consensus that as data volumes grow rapidly, instead of moving data to processing systems near users as is the tradition, one will need to bring processing, computing, analytics and visualization to data – so called data proximate workbench capabilities, sometimes also referred to as server-side processing. It is recognized that data movement should be kept to a minimum. In addition, data providers also need to give scientists an ecosystem or a workbench that not only includes data, but also an array of tools, workflows, and other end-to-end applications and services needed to perform analysis, integration, interpretation, and synthesis - all in the same environment or platform. Unidata’s ongoing efforts include:

* Providing remote access to many types of data from a cloud environment (e.g., via the THREDDS Data Server, RAMADDA, and EDEX servers);

* Developing and providing a range of pre-configured and well-integrated tools and services that can be deployed by any university in their own private or public cloud settings. Specifically, Unidata has developed “containerized applications", using Docker, for portability, easy deployment and reuse.

* Exploring the use of Kubernetes for automatically deploying and scaling containerized applications and associated resource provision. Containerized applications developed by Unidata include applications for data transport, access, analysis, and visualization: THREDDS Data Server, Integrated Data Viewer or IDV, Advanced Weather Information Processing System or AWIPS, Local Data Manager, RAMADDA Data Server, and Python tools;

* Leveraging Jupyter as a central platform and JupyterHub with its powerful set of interlinking tools to connect interactively data servers, Python scientific libraries, scripts, and workflows, using Siphon, a Unidata developed tool for linking Jupyter notebooks to data servers. Jupyter notebooks are a powerful approach to enabling reproducibility and Open Science.

In this presentation, we will present our ongoing and exploratory work to facilitate a new paradigm for doing science by offering a suite of tools, resources, and platforms to leverage cloud services for addressing both big data and Open Science/reproducibility challenges, and engage in a community dialog on the opportunities and challenges of the new paradigm. Along with the early results from our experience thus far we will discuss the opportunities as well as the technical, social, fiscal, and organizational challenges of the cloud-enabled paradigm for advancing the geoscience community.

6B.2 Data-Proximate Computing, Analytics, and Visualization Using Cloud-Hosted Workflows and Data Services