Targeting Atmospheric Climate Models for Massively Parallel, Accelerated Computing Platforms
The stringent throughput requirements of climate are the largest source of scaling difficulty. Climate temporal scales require simulating at roughly 5 Simulated Years Per Day (SYPD), or about 2,000× realtime, to obtain the needed solutions in a feasible amount of time. Getting this throughput requires strong scaling a problem size out to many nodes, but this reduces the data simulated per node, a key scaling metric. At 43K cores on Titan, CAM-SE’s dynamical core is only simulating 600Kb of data per node. The tracer transport routines have a somewhat improved 3.5 Mb of data per node with the roughly 30 tracers of the CAM5 physics package. Yet, compare this against many Gb per node for other science domains, and it’s clear how atmospheric climate simulations are indeed spread thin.
Low data per node means higher overhead in latency and bandwidth for MPI message passing between nodes relative to local computation costs. Grid refinement, in fact, will further degrade the situation because the time step refines with the grid. This means that a 2x reduction in horizontal grid spacing requires 8x more work, but it only provides 4x more data. Spreading 4x more data across 8x more nodes to keep the 5 SYPD throughput means 2x less data per node. If the scaling is non-ideal, then the situation degrades even more.
Also, future atmospheric climate models must adapt to accelerators such as Nvidia’s Graphics Processing Unites (GPUs) and Intel’s Many Integrated Cores (MIC) chips. Leading a task at ORNL for porting CAM-SE to use GPUs efficiently has led to a number of lessons learned in terms of what will port efficiently to GPUs as well as intelligent software engineering practices that can adapt with a rapidly changing codebase.
Three issues will be addressed: (1) lessons learned from porting and profiling CAM-SE on Titan, (2) algorithmic approaches to improving MPI scaling and accelerator efficiency, and (3) merging our scientific goals with the requirements of the computing tools we use to reach them.
Regarding lessons learned in the porting and profiling of CAM-SE on Titan, most of the discussion will focus on the packing, exchanging, and unpacking of data between nodes, a dominant consideration for CAM-SE at scale. This discussion will include our approaches to overlapping GPU kernel execution, CPU code execution, PCI-e transfers, and MPI transfers as well as threading efficiency considerations. These lessons came from the Center for Accelerated Application Readiness (CAAR) effort at the Oak Ridge Leadership Computing Facility (OLCF).
Regarding algorithmic approaches to to MPI and accelerator difficulties, there will be some mention of grids and classes of numerical methods. However, the focus will be on several promising spatial operators and their interactions with time-explicit time integration. Probably the dominant consideration in terms of MPI scaling and accelerator efficiency is the order of accuracy of the scheme. The higher-order the scheme, the more efficiently data is reused, and the greater the local computation on a node becomes as compared to the overhead of communication. However, the spatial operator matters. Galerkin schemes, while optimally local, suffer severe (nearly quadratic) time step reductions with increasing order, and they still require extra communication when limiting is inevitably applied. Finite-Volume schemes, while requiring a half-stencil to be transferred between nodes, suffer no time step reduction with increasing order, and they can be robustly limited without any added parallel communication. Time integration matters as well because some methods, such as ADER, can simulate to any high-order accuracy, non-linearly, over a large time step without any stages of intermittent parallel communication.
Finally, we should consider our science goals in terms of the capabilities of the computers we will use to achieve them. For a while, the trend of increasing the spatial resolution of atmospheric models has advanced faster than the inclusion of new physics. This trend is surely near an end if accelerated computing continues to advance as it has thus far. Higher spatial resolution means less data per node, not more, and we are already near the limits of what will give a speed-up on GPUs as well as MPI efficiency. Two potential improvements will be discussed: (1) inclusion of new, previously untenable physical phenomena, and (2) the use of significantly more ensembles for advances in uncertainty quantification.