PySpark for \

Banihirwe, Anderson; Banihirwe, Anderson

Using NCAR’s high performance computing systems, scientists perform many kinds of atmospheric data analysis with a variety of tools and workflows. Some of these, such as climate data analysis, are time intensive from both a human and computer point of view. Often these analyses are “embarrassingly parallel,” and many traditional approaches are either not parallel or excessively complex for this kind of analysis. Therefore, this research project explores an alternative approach to parallelizing them. We used the PySpark Python interface to Apache Spark, a modern framework aimed at performing fast distributed computing on Big Data.

We have been successful installing, configuring, and utilizing PySpark on NCAR’s HPC platforms, such as Yellowstone and Cheyenne. For this purpose, we designed and developed a Python package (spark-xarray) to bridge the I/O gap between Spark and scientific data stored in netCDF format. We applied PySpark to several atmospheric data analysis use cases, including bias correction and per-county computation of atmospheric statistics (such as rainfall and temperature). In this presentation, we will show the results of using PySpark with these cases, comparing it to more traditional approaches from both the performance and programming flexibility points of view. We will show comparison of the numerical details, such as timing, scalability, and code examples.

2.1 PySpark for "Big" Atmospheric Data Analysis