Emerging Cyber Infrastructure for NASA's Large-Scale Climate Data Analytics

Spear, Carrie; Spear, Carrie

The resolution of NASA climate and weather simulations have grown dramatically over the past few years with the highest-fidelity models reaching down to 1.5 KM global resolutions. With each doubling of the resolution, the resulting data sets grow by a factor of eight in size. As the climate and weather models push the envelope even further, a new infrastructure to store data and provide large-scale data analytics is necessary.

The NASA Center for Climate Simulation (NCCS) has deployed the Data Analytics Storage Service (DASS) that combines scalable storage with the ability to perform in-situ analytics. Within this system, large, commonly used data sets are stored in a POSIX file system (write once/read many); examples of data stored include Landsat, MERRA2, observing system simulation experiments, and high-resolution downscaled reanalysis. The total size of this repository is on the order of 15 petabytes of storage.

In addition to the POSIX file system, the NCCS has deployed file system connectors to enable emerging analytics built on top of the Hadoop File System (HDFS) to run on the same storage servers within the DASS. Coupled with a custom spatiotemporal indexing approach, users can now run emerging analytical operations built on MapReduce and Spark on the same data files stored within the POSIX file system without having to make additional copies. This presentation will discuss the architecture of this system and present benchmark performance measurements from traditional TeraSort and Wordcount to large-scale climate analytical operations on NetCDF data.

2.4 Emerging Cyber Infrastructure for NASA's Large-Scale Climate Data Analytics