Thursday, 14 January 2016
As computational resources continue to increase in performance and scale, the limits on high-resolution atmospheric modelling continue to be pushed further to take advantage of these hardware enhancements. With this continuing trend, data output sizes will continue to grow as well, placing greater demands on data storage systems and the algorithms used to analyze the output. One example of this trend can be found in the area of boundary layer meteorology where Large Eddy Simulation (LES) is extensively used. Using the Weather Research and Forecasting (WRF) model configured as LES at resolutions of 12.5m over a 9 km x 9 km x 4 km domain can create over 2 terabytes (TB) of data with just 4 hours of simulated time. As these simulations are pushed towards larger domains extending over mesoscale regions, data output will only continue to grow. While the models used to generate these datasets are designed to scale and utilize the parallel resources available in High Performance Computing (HPC) environments, oftentimes the tools used to analyze and explore the datasets are not. One solution to this problem is to take the tools used for data analysis and create them for traditional parallel environments using MPI so that they are able to scale and take advantage of the same computational resources used to create the data, decreasing data analysis time and helping to speed up scientific innovation. In this work, a method of parallelization is presented that takes advantage of the inherent spatial dependencies of various data analysis methods. This method determines whether a horizontal or vertical decomposition of the dataset across multiple parallel computing resources shows the greatest impact on data analysis runtime by creating many smaller independent data analysis tasks. By finding these “embarrassingly-parallel” solutions, performance is increased up to 92% for some problems. Additional attention is given to finding solutions that enhance the performance of data-limited or computationally-limited computations. In data-limited problems, the transfer time of the data necessary for a calculation outweighs the time needed to perform the calculation. In computationally-limited problems, the time needed to perform the calculation outweighs the data transfer time. The results to be presented show that by controlling the ratio of processors to nodes, each of these problems are impacted differently. By simply increasing processor count, the execution time of computationally-limited problems shows improvement. However, for data-limited problems, increasing node count creates the greatest improvement. Lastly, to further improve computationally-limited problems, utilization of a GPU and CUDA environment to put the power of massively-threaded GPUs to use is compared versus the traditional MPI environment. This method shows that the GPU implementation offers up to 86% further improvement over the MPI version of the analysis methods tested.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner