239 Development of a GPU-Capable MYNN Surface Layer Parameterization Scheme for the Common Community Physics Package (CCPP)

Monday, 29 January 2024
Hall E (The Baltimore Convention Center)
Timothy Sliwinski, CIRA, Boulder, CO; and I. Jankov and D. S. Abdi

As part of the NOAA Software Engineering for Novel Architectures (SENA) project, work has been ongoing at NOAA’s Global System Laboratory to enhance the Common Community Physics Package (CCPP) with code capable of being offloaded to Graphical Processing Units (GPUs) to realize potential improvements in computational performance. The CCPP is a collection of physics parameterization schemes used by state-of-the-art atmospheric models which underly weather forecast systems such as the Unified Forecast System (UFS). Previous work on this effort has successfully realized computational performance improvements in the Thompson microphysical scheme as well as the Grell-Freitas cumulus scheme (Abdi, et al. 2023, 9th Annual AMS High Performance Computing Symposium, Denver, CO). Both of these efforts have relied upon using OpenACC, which is a directive-based framework that allows existing FORTRAN code to be enhanced with specific statements describing subsections of code suitable to be run in parallel on the host CPU or – more importantly for this effort – GPU accelerator devices.

The work described here is an extension of these previous efforts to an additional parameterization scheme: the MYNN surface scheme. The MYNN is a relatively simple scheme to port to GPUs algorithmically as it largely depends upon large loops over a large number of independent 1D columns comprising the lowest 2 model layers. Because of the large, independent nature of these loops, data is easily mapped to single-instruction multiple-data (SIMD) architectures such as the GPU. Validation of the scheme was done using output from the CCPP Single Column model as a baseline, with results showing only a small change in the least significant digit for real-valued variables primarily due to differences in rounding between the CPU and GPU. Computational performance of this new GPU-capable code was evaluated in parallel on 1 10-core Intel Haswell CPU vs 1 discrete Nvidia P100 PCIe GPU. To replicate real-world load on the scheme, the number of vertical columns was varied from 150,000 to 750,000 as may be encountered in a modern high-resolution global model. With data movement between the host and device being a significant bottleneck in GPU computing, performance ranged from a 2-3x slowdown with unoptimized data movement to a 12-42x speedup with fully optimized data movement.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner