1.5
Next Generation HPC and Forecast Model Application Readiness at NCEP
Detailed analysis of column-physics components such as RRTMG radiation and other model components shows that threads executing these packages are state-heavy and thereby exhaust local storage: cache memory on conventional CPU and MIC cores or shared memory on GPUs. Fine-grained parallelism (vectorization on Xeon and thread parallelism on GPUs) is inhibited by data-dependencies in the vertical dimension of column physics. And while there is fine-grained parallelism over dependency-free horizontal dimensions of weather model domains, processing a vector of state-heavy grid columns per thread only exacerbates aforementioned per-thread pressure on local storage. In spite of these constraints, various code and data restructuring techniques have yielded performance gains for RRTMG on next-generation processors. Better still, these changes also result in improved performance on the host processor.
We will also touch on efforts to improve software architecture and processes at NCEP for developing, maintaining, and using high-performance codes for operations and research, touching on the inherent trade-offs between modularity and performance. And we will provide an overview of efforts underway to strengthen connections to the research community as we evaluate new dynamics, grid systems, and scale-appropriate physics informing development of a Next Generation Global Prediction System (NGGPS) at NCEP.