We have been successful installing, configuring, and utilizing PySpark on NCAR’s HPC platforms, such as Yellowstone and Cheyenne. For this purpose, we designed and developed a Python package (spark-xarray) to bridge the I/O gap between Spark and scientific data stored in netCDF format. We applied PySpark to several atmospheric data analysis use cases, including bias correction and per-county computation of atmospheric statistics (such as rainfall and temperature). In this presentation, we will show the results of using PySpark with these cases, comparing it to more traditional approaches from both the performance and programming flexibility points of view. We will show comparison of the numerical details, such as timing, scalability, and code examples.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner