4.3
Utilization of the GPU and CUDA Framework to Improve the Analysis Time of "Big Data" from Mesoscale WRF-LES Simulations
Massively-threaded architectures like the Graphics Processing Unit (GPU) and libraries such as NVIDIA's Compute Unified Device Architecture (CUDA) – which offers an interface for general purpose use of NVIDIA GPU hardware – present researchers with a unique way to leverage many threads of execution without the disadvantages of very large clusters of conventional Central Processing Units (CPUs). Rather than large racks containing servers with only up to 16-24 CPU cores with 1-2 threads of execution each, costly interconnects, and large industrial cooling systems, GPUs can be added to existing servers using common PCIe interconnects, offer thousands of threads of execution, and be run with the standard cooling systems of an office desktop computer.
This study explores the benefits of estimating boundary layer heights using GPU hardware rather than conventional CPUs when confronted with “Big Data”. The output used was generated using the WRF model configured as LES with various horizontal and vertical grid resolutions over a domain with a horizontal extent characteristic of a mesoscale model. The analysis method used for benchmarking is the common maximum vertical potential temperature gradient method for calculating the Atmospheric Boundary Layer (ABL) height (zi) proposed by Sullivan, et al. (1998). This method was chosen because it requires reading the entire three-dimensional model-generated potential temperature field, and calculating the maximum vertical gradient in the column above each horizontal gridpoint. In serial implementations, this entire computation process can take between a few hours to a few days to complete for an entire model integration period depending upon the size of the domain used. In traditional parallel methods using clusters of CPUs, this requires the partitioning of the data among nodes, adding to overhead. For the GPU, the complete three-dimensional potential temperature field can be loaded into on-device memory and run at each gridpoint simultaneously. Given these strengths and weaknesses, three implementations of this method are used for this comparison: 1.) A serial version using a conventional CPU. 2.) A parallel method created for use with a cluster of CPU's. 3.) A parallel method written for use on a GPU using CUDA. The traditional CPU runs were completed using Texas Tech University's Hrothgar computing cluster comprised of server nodes with dual 2.8 GHz hex-core processors. The GPU runs were completed using a single NVIDIA Tesla M2090 with 512 CUDA processing cores and 6GB of onboard GDDR5 memory. To assess performance, a simple runtime comparison was used to determine performance increases of the CUDA/GPU method versus the single-threaded and parallelized version for use on a conventional CPU cluster.
Results will be presented showing a greater than 95-times speedup in execution time of the gradient method applied to “Big Data” for the CUDA/GPU implementation over the single-threaded implementation. The presentation will also detail how the thousands of GPU execution threads were leveraged to achieve this speedup.
This work was completed with financial support provided by the Korea Meteorological Administration Research and Development Program under Grant Weather Information Service Engine (WISE) project, 153-3100-3133-302-350. Computing hardware and support resources were provided by Dr. Yong Chen (GPU nodes) and the High Performance Computing Center (HPCC) at Texas Tech University at Lubbock (CPU nodes).