Utilization of the GPU and CUDA Framework to Improve the Analysis Time of "Big Data" from Mesoscale WRF-LES Simulations

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner
Thursday, 8 January 2015: 4:00 PM
128AB (Phoenix Convention Center - West and North Buildings)
Timothy S. Sliwinski, Texas Tech University, Lubbock, TX; and S. Kang and Y. Chen

As computational power continues to increase, numerical models are able to simulate atmospheric flows at very fine resolution. Large-Eddy Simulation (LES) is a numerical method by which most energy-containing turbulent eddies in the atmospheric boundary layer are explicitly resolved. The Weather Research and Forecasting (WRF) model can be used as an LES model as well as a mesoscale model, which allows the investigation of interactions between mesoscale and microscale atmospheric flows. However, the extension of a LES domain causes massive amounts of model output on the order of terabytes, and subsequently increases the analysis time to explore the dataset. This raises a “Big Data” issue for atmospheric scientists for which new parallel processing methods will be needed if these methods are to ever be viable for operational forecasting.

Massively-threaded architectures like the Graphics Processing Unit (GPU) and libraries such as NVIDIA's Compute Unified Device Architecture (CUDA) – which offers an interface for general purpose use of NVIDIA GPU hardware – present researchers with a unique way to leverage many threads of execution without the disadvantages of very large clusters of conventional Central Processing Units (CPUs). Rather than large racks containing servers with only up to 16-24 CPU cores with 1-2 threads of execution each, costly interconnects, and large industrial cooling systems, GPUs can be added to existing servers using common PCIe interconnects, offer thousands of threads of execution, and be run with the standard cooling systems of an office desktop computer.

This study explores the benefits of estimating boundary layer heights using GPU hardware rather than conventional CPUs when confronted with “Big Data”. The output used was generated using the WRF model configured as LES with various horizontal and vertical grid resolutions over a domain with a horizontal extent characteristic of a mesoscale model. The analysis method used for benchmarking is the common maximum vertical potential temperature gradient method for calculating the Atmospheric Boundary Layer (ABL) height (zi) proposed by Sullivan, et al. (1998). This method was chosen because it requires reading the entire three-dimensional model-generated potential temperature field, and calculating the maximum vertical gradient in the column above each horizontal gridpoint. In serial implementations, this entire computation process can take between a few hours to a few days to complete for an entire model integration period depending upon the size of the domain used. In traditional parallel methods using clusters of CPUs, this requires the partitioning of the data among nodes, adding to overhead. For the GPU, the complete three-dimensional potential temperature field can be loaded into on-device memory and run at each gridpoint simultaneously. Given these strengths and weaknesses, three implementations of this method are used for this comparison: 1.) A serial version using a conventional CPU. 2.) A parallel method created for use with a cluster of CPU's. 3.) A parallel method written for use on a GPU using CUDA. The traditional CPU runs were completed using Texas Tech University's Hrothgar computing cluster comprised of server nodes with dual 2.8 GHz hex-core processors. The GPU runs were completed using a single NVIDIA Tesla M2090 with 512 CUDA processing cores and 6GB of onboard GDDR5 memory. To assess performance, a simple runtime comparison was used to determine performance increases of the CUDA/GPU method versus the single-threaded and parallelized version for use on a conventional CPU cluster.

Results will be presented showing a greater than 95-times speedup in execution time of the gradient method applied to “Big Data” for the CUDA/GPU implementation over the single-threaded implementation. The presentation will also detail how the thousands of GPU execution threads were leveraged to achieve this speedup.

This work was completed with financial support provided by the Korea Meteorological Administration Research and Development Program under Grant Weather Information Service Engine (WISE) project, 153-3100-3133-302-350. Computing hardware and support resources were provided by Dr. Yong Chen (GPU nodes) and the High Performance Computing Center (HPCC) at Texas Tech University at Lubbock (CPU nodes).