3.1 Discussing Climate's Strong Scaling Challenges and Computationally Sustainable Approaches to Further Increasing Predictive Capability

Tuesday, 8 January 2019: 3:00 PM
North 123 (Phoenix Convention Center - West and North Buildings)
Matthew R. Norman, ORNL, Oak Ridge, TN

Climate has a rich history in High Performance Computing and has consistently been among the top HPC users at various computing centers. However, high-resolution climate simulation is now at a difficult place computationally. It is a unique HPC application because simulations must generally run at roughly 2,000x realtime in order to complete the centuries of simulation needed in a feasible amount of time. In order to reach this high throughput, extreme strong scaling is required. However, the computational path climate simulation has taken toward increased physical realism is asymptotically unsustainable in light of this throughput constraint.
Every time you double the grid's resolution in the horizontal dimension, you have 4x more data but 8x more work because of time step reduction. If you need to complete the work in a fixed amount of time, then you must use 8x more processors. This means every time you double your grid's resolution, your workload per node cuts in half. Having thinner workloads per node causes two problems. First, the overhead of exchanging data between compute nodes becomes relatively larger, meaning you get less benefit from scaling to more nodes. Second, accelerator devices, which are currently the only obvious path forward to exascale, have less work to do and therefore are less efficient. Because of data transfer overheads and accelerator inefficiency, straightforwardly increasing grid resolution will inevitably hit a barrier at some point.
To illustrate the problem, on the Oak Ridge Leadership Computing Facility's (OLCF's) Titan computer, traditional Energy Exascale Earth System Model (E3SM) production runs spend roughly half of their time exchanging data rather than doing useful computations. In certain parts of the code, over 90% of the runtime is spent exchanging data. Keep in mind that the E3SM uses extremely efficient, scalable, large-time-step methods and that the parallel performance has been heavily optimized. These bottlenecks serve to demonstrate just how challenging high-resolution climate simulation is. Further, under the DOE Exascale Computing Project (ECP), which has developed a much more scalable model configuration, the E3SM GPU workloads are small enough that the overhead of launching work on the GPU is often equal to or greater than the time spent performing the work itself. While current efforts under the ECP are sufficient to gain allocations for OLCF's Summit supercomputer, it is not clear that this paradigm will suffice for exascale computing.
The goals of this presentation are: (1) to discuss the nature of this challenge, (2) to discuss the current steps E3SM is taking in order to overcome it, and (3) to bring up some alternative approaches to increasing climate model predictive capability that are potentially more sustainable going forward. What is clear is that this challenge is not going to be resolved by GPU porting, alternative programming models, or algorithmic tweaks. It seems that some intelligent means of dimensionality reduction in the mathematical problem itself is necessary in order for climate to make its way past this barrier and to thrive on exascale machines.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner