616
Optimizing performance of the Weather Research and Forecasting model at large core counts: a comparison between pure MPI and hybrid parallelism and an investigation into domain decomposition
Optimizing performance of the Weather Research and Forecasting model at large core counts: a comparison between pure MPI and hybrid parallelism and an investigation into domain decomposition
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner
Thursday, 6 February 2014
Hall C3 (The Georgia World Congress Center )
Handout (3.4 MB)
Two aspects of performance and scaling of the Weather Research and Forecasting (WRF) model were investigated in this study: Pure MPI parallelism was compared against hybrid parallelism and scaled from 512 to 64K cores, and the effects of domain decomposition on performance were investigated. On the Yellowstone supercomputer, different combinations of MPI tasks per node and OpenMP threads per task were tested. The optimal hybrid combination was scaled up, and performances comparable to the MPI-only case were noted. Additionally, a comparison of performance scaling of a variety of runs with differing resolution and parameterizations is presented. Domain decomposition effects were investigated by varying the patch (horizontal subset of original domain) aspect ratio for a constant number of processors. The patch aspect ratio which results in optimal performance has a dependency on the total number of grid points assigned to each patch. For a decomposition resulting in patches with large numbers of grid points, a more horizontally rectangular patch aspect ratio is optimal, with performance differing only up to 10% with aspect ratio. For a decomposition with smaller numbers of grid points, a less rectangular patch aspect ratio is optimal, with performance decreasing by as much as 50% compared to a poor domain decomposition. The relationship between optimal aspect ratio and performance is attributed to a trade off between optimal memory access (most important in the former case) and minimal MPI communication between patches (most important in the latter case).