Addressing the Variety Challenge to Ease the Systemization of Machine Learning in Earth Science

Kuo, Kwo-Sen; Kuo, Kwo-Sen

There is little doubt that machine learning techniques possess the potential to timely extract information, knowledge, and even prediction from seas of Earth Science (ES) data and help in answering some of the pressing questions regarding the intricate systems of our planet. However, it is reported that data preparation, necessitated by subsetting and homogenizing diverse varieties of data, often takes the majority of the time spent on machine learning exercises. To effectively address this “variety challenge” we propose and implement a solution composed of three interlinked components and report our experience.

The first component of our solution is to leverage the distributed parallelism of a Big Data technology based on shared nothing architecture (SNA). We observe that many of the data preparation tasks are pleasingly parallel, which SNA is especially suited for. Moreover, SNA can utilize commodity hardware without suffering loss of performance, making it more cost-effective than the conventional high performance computing (HPC) architecture.

The value potential of Big Data technologies is almost always optimized by first indexing the datasets that they ingest and operate on. With MapReduce (MR) or its open-source version Hadoop, this is manifested in the sequence files of key-value pairs, and with distributed database systems, in their internal indexing implementation. Indexed data allow for faster data retrieval and subset than repeatedly reading in from the data files.

We have selected SciDB as the target Big Data system. SciDB is an open-source all-in-one data management and advanced analytics platform that features complex analytics inside a next-generation parallel array database. It is based on shared-nothing architecture for data parallelism, data versioning and provenance. Because it is array-based, SciDB is more suitable for scientific data analytics than traditional Relational Database Management Systems (RDBMS’s). It provides extensive and flexible operators that can be efficiently “wired” together for more complex operations. SciDB is also extensible through User Defined Types (UDTs), User Defined Functions (UDFs) and User Defined Operators (UDOs).,

The second component of our solution is the use of Hierarchical Triangular Mesh as a unified data model and array indexing scheme to serve as the basis for homogenization of various ES data expressed in primarily three (3) data models, i.e. Grid, Swath, and Point.

Grid is a mesh with fixed latitude and longitude spacing and thus a simple linear relation exists between array indices and latitude-longitude geolocation coordinates. Swath retains the spaceborne instrument’s observation geometry (e.g., cross- × along-track), where no simple relation exists between array indices and geolocations. Rather, geolocation is specified individually for each Swath array element, i.e. Instantaneous Field of View (IFOV). The Point model is used mostly for in situ observations made at irregularly distributed locations, which are encased in a vector (1D array). Similar to Swath, geolocation for each point is also specified individually. The dissimilarities among these data models give rise to difficulties in integrative analysis. For example, simply determining the (approximate) common area covered by two Swath arrays can become algorithmically involved, especially if the swaths are from satellites with different orbit characteristics.

These data models, however, have one commonality: data values associated with geolocations, which can thus serve as a basis for a unified data model. Indeed, at the heart of HTM is an array indexing scheme that essentially assigns an “address” (index) to every surface element (up to a desired resolution) of Earth (i.e., geolocation). When all data are stored in arrays indexed by this "address", it allows quick retrieval of data values associated with any given geolocation, regardless of their original data models. HTM not only speeds up the comparison and integration of data with different geometries, but also supports partitioning placement co-alignment (PPC) that can render the most common requirement of ES analysis, i.e. spatiotemporal coincidence, pleasingly parallel to maximize efficiency/performance.

The third and final component of our solution is a regridding or remapping toolset that provides the capability, as UDOs or UDFs in SciDB, for users to conveniently remap data from one geospatial geometry to another.

Imagine the use case in which we would like to compare model output in Grid data model with satellite observations in Swath data model. With data arrays indexed by HTM and implemented in SciDB, we could quickly retrieve where the two datasets intersect for comparison. However, since their underlying geospatial geometries are different, meaningful comparisons can only be carried out after we map Swath to Grid, Grid to Swath, and/or both to a common third geometry. This regridding/remapping toolset is implemented exactly for this purpose.

With these three essential components we have a solution that effectively addresses the variety challenge to ease the systemization of machine learning for achieving greater productivity.

3.1 Addressing the Variety Challenge to Ease the Systemization of Machine Learning in Earth Science