J1.4 Exploring Methods to Leverage Apache Spark in Meteorological Applications Through the Creation of Scispark

Tuesday, 12 January 2016: 11:45 AM
Room 354 ( New Orleans Ernest N. Morial Convention Center)
Kim D. Whitehall, JPL/California Institute of Technology, Pasadena, CA; and R. Palamuttam, R. Marroquin Mogrovejo, B. D. Wilson, C. Mattmann, L. J. McGibbney, R. Verma, and P. Ramirez

Apache Spark is a large-scale data processing system that builds upon the Apache big data stack to include Hadoop, Mahout, HBase, YARN, Storm, etc. The SciSpark project (PI Mattmann 2015) leverages Apache Spark's in memory calculations and converged analytics platform to deliver scientific Resilient Distributed Datasets (sRDDs) created by space/time operations on earth science datasets, and/or other operations during analysis. The sRDD is an extension of Apache Spark's RDD, a fault-tolerant, in-memory, distributed array that promotes low-latency computations for terabyte-scale datasets (Zaharia 2015). Such computational abilities applied to meteorological applications can significantly improve data subsetting, ingestion and analysis. SciSpark will enable scalable climate model evaluations in systems such as the Regional Climate Model Evaluation System (RCMES), will promote machine-learning based clustering algorithms in the earth-sciences such k-means clustering of atmospheric variables' probability distribution functions (PDFs) in climate studies, and will facilitate analysis of events in large dataset records with algorithms such as the graph-based algorithm for identifying and tracking Mesocale Convective Complexes known as the Graph ‘em Tag ‘em Graph ‘em method – GTG (Whitehall et al. 2014).

This talk will focus on leveraging the SciSpark system for the implementation of the GTG. Metrics quantifying the parallel speedups and memory & disk usage associated with the SciSpark implementation of the GTG will be presented. Further details of the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, and methods to integrate them in climate science algorithms will also be explored.

References Mattmann, C. A., 2015. http://esto.nasa.gov/files/solicitations/AIST_14/ROSES2014_AIST_A41_awards.html#mattmann

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association.

Whitehall, K., C. Mattmann, G. Jenkins, M. Rwebangira, B. Demoz, D. Waliser, J. Kim, C. Goodale, A. Hart, P. Ramirez, M. Joyce, M. Boustani, P. Zimdars, P. Loikith, and H. Lee, 2014. Exploring a graph theory based algorithm for automated identification and characterization of large mesoscale convective systems in satellite datasets. Earth Science Informatics. DOI: 10.1007/s12145-014-0181-3

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner