5.4 Respective Strengths and Weaknesses of SciDB, MapReduce-HDFS, and a Custom Technique for a Data-Intensive Analysis System

Tuesday, 8 January 2013: 4:15 PM
Room 12A (Austin Convention Center)
Kwo-Sen Kuo, NASA GSFC, Greenbelt, MD; and J. Rushing, R. Ramachandran, G. Fekete, D. Duffy, and T. Clune

A large portion of Earth Science investigations is phenomenon- or event-based, such as the studies of Rossby waves, mesoscale convective systems, and tropical cyclones. However, except for a few high-impact phenomena, e.g., tropical cyclones, comprehensive records are absent for the occurrences, or events, of these phenomena. Phenomenon-based studies therefore often focus on a few prominent cases while the lesser ones are overlooked. Without an automated means to gather the events, comprehensive investigation of a phenomenon is at least time-consuming if not impossible.

Thus, a project is recently funded by NASA Earth Science Technology Office (ESTO) to create an Automated Event Service (AES), based initially on NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) data, that will provide 1) an intuitive web interface for basic event definitions, 2) an Event Specification Language (ESL) modeled after popular scripting languages (e.g. Python) for more sophisticated event definitions, 3) a social component for scientists of like interests to collaborate on the definitions of events, 4) a database to catalogue potential event definitions and query results, and 5) a linkage to find corresponding data in NASA's vast store of remote sensing observations. It is the intention of the project team to make the service interactive; that is, we aim to return event query results in real time. Thus, it necessitates the application of data-intensive techniques.

In search for an efficient technique for AES, we have evaluated the following data-intensive techniques: SciDB, MapReduce (MR) combined with Hadoop File System (HDFS), and a custom technique, on a “junior” AES with a reduced set of capabilities and for a subcollection of the intended datasets. In this presentation, we report respective strengths and weaknesses of the techniques listed above and lessons learned for each.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner