The JASMIN Data Analysis Facility for the Environmental Sciences Community and the Role of Data-as-a-Service

Kershaw, Philip J; Kershaw, Philip J

JASMIN is a multi-petabyte data analysis facility for the UK environmental science community and their international collaborators. In operation since 2012, it is built around the paradigm of bringing the compute to the data as a response to the challenges of Big Data encountered in this and other research domains. Computing resources are provided around central archive representing four data centres which CEDA’s operates covering data predominantly from atmospheric sciences and Earth observation disciplines. This infrastructure is realised through the provision of a global file system optimised for maximum i/o performance collocated with a range of computing services to meet different needs. These services include the LOTUS batch compute cluster, a virtualisation infrastructure for pre-provisioned VMs for a range of custom applications, and a community cloud enabling external users to self-provision tailored computing environments of their own virtual machines. All three computing environments share high speed access to the managed data as well as allocation of their own storage space.

Data access and data access services which we collectively term here as Data-as-a-Service, are critical to this model. Ideally, data access should be both performant and ubiquitous to services and applications consuming it, be they local or external to the JASMIN infrastructure. As we see change in terms of the growth of data and of the user community supported, this is driving change in how data access is implemented. There are two key factors for consideration, the network architecture - enabling performance and isolation - and the interfaces used to access data.

A first example is external bulk data access, expedited by a dedicated Data Transfer Zone (DTZ) based on the ESnet concept of a “Science DMZ” outside the institutional firewall. Experience with CMIP5 and other large projects has demonstrated the importance of such customisation, which has become even more important as (1) the UK climate modelling community has been increasingly developing workflows which migrate data directly from HPC to JASMIN, and (2) the UK Earth observation community is increasingly relying on large scale transfers of data into JASMIN. Though developed for inter-institutional transfers, the DTZ concept is now being extended for download services for all classes of user. This pattern is being piloted for the deployment of ESGF (Earth System Grid Federation) software, with data download services hosted in the DTZ and other user-facing services such as portals and web services deployed in the private cloud environment where they can be more conveniently administered.

A second example of change has been triggered by the need to address a fundamental incompatibility between parallel file systems at scale and the cloud. JASMIN’s community cloud environment allows users to provision virtual machines via a web portal using an IaaS (Infrastructure as a Service) model. The POSIX interface of our existing file system assumes a global uid/gid space under a single administrative authority whereas the IaaS model respects no such bounds, enabling tenants to provision their own UNIX user management from the ground up. A solution has been developed in which IaaS provision is segregated into an isolated network enabling full autonomy for tenants but with access to the data archive mediated through FTP and HTTP interfaces (such as OPeNDAP).

Even so, experience with the cloud has led us to believe that to really deliver bringing compute to the data we cannot continue with existing parallel file system technology. This is also being driven by the availability of so-called hyper-converged solutions in which compute hardware and storage are combined providing the ability to readily expand storage capacity.

Looking forward, a key development will be to attempt to migrate away from parallel file systems towards object stores for primary storage. The two main motivations for using a parallel file system in the first place were (1) performance for massive data handling, and (2) ease of management for petascale storage. However, before we can migrate, we need to ensure that we can deliver enough performance with object stores, and more importantly, do so in a way that fits with typical environmental workflows and codes. To those ends we are progressing, in partnership with colleagues in the European ESIWACE project, to firstly develop a HDF (and hence NetCDF4) server system that can be deployed over object stores. This must have a high enough performance that users interfacing with it via a RESTful API can get comparable performance to that of direct access to a file system. Adoption of such an interface also provides the potential for interoperability with public clouds thereby facilitating cloudbursting scenarios, with remote users having access to JASMIN data via the RESTful API.

Finally, another important consideration in the adoption of object stores is how to address legacy scientific applications and their access to the file system via hierarchical directories. For them, we need an interface layer to abstract the flat key value pair object interface beneath a POSIX wrapper. The development of faceted search services such as that created for ESGF (Earth System Grid Federation) have an important role to play. The ESGF DRS (Data Reference Syntax) defines a set of vocabulary terms indexed from datasets which together uniquely describe it. In this respect it mimics the concept of a POSIX directory path but without the limitation of a fixed hierarchy. We intend to build a library which exposes an API which can be relatively easily used by both existing and new applications to mimic directories, but with these faceted extensions.

Any such steps will require a careful stepped approach to their implementation, from deployment to full adoption but it is hoped that the transition will enable the evolution of Data-as-a-Service that can more readily scale with the demands of Big Data and the needs of a multi-tenancy hosted computing environment.

J3.2 The JASMIN Data Analysis Facility for the Environmental Sciences Community and the Role of Data-as-a-Service