4.3 WRF Performance Optimization Targeting Intel Multicore and Manycore Architectures

Thursday, 14 January 2016: 4:00 PM
Room 344 ( New Orleans Ernest N. Morial Convention Center)
Samuel J. Elliott, NCAR, Boulder, CO; and N. Sobhani, D. Del Vento, and D. Gill

The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. WRF is computationally expensive and with such a large user base, performance optimization of the model is of high priority for the WRF community. This study targets a variety of performance optimizations to inform users how to best run their simulations in modern supercomputing environments. The benchmarks used in this study were run with WRF-ARW version 3.7 on the Texas Advanced Computing Center (TACC) Stampede cluster, which utilizes Intel Xeon E5-2660 CPUs and Xeon Phi SE10P coprocessors. It has been shown that when properly configured, hybrid MPI and OpenMP parallelization of WRF on host CPUs gives consistent performance benefits and has significantly stronger scaling properties relative to the standard, pure-MPI implementation. Previously, running WRF on Xeon Phi coprocessors has shown disappointing performance results. We have shown that for large workloads per core, Xeon Phi can outperform standard host performance. Due to consistent efficiency per grid point and low MPI-overhead in high workload per core simulations, symmetric WRF execution utilizing both Xeon CPUs and Xeon Phi coprocessors can be used for highly efficient WRF runs. On systems such as TACC's Stampede with two Xeon Phi coprocessors per node, WRF users could therefore achieve 1.5 times more efficient usage of their allocations. Performance analysis on both Xeon and Xeon Phi expose issues with the WRF model that can be improved. One such issue is WRF's OpenMP tiling strategies. Default OpenMP tiling in WRF produces imbalances between threads that hinder performance in low workload per core simulations. Better tiling strategies show consistent 10-20% speedup. Although we have found better strategies that can be determined and explicitly set through namelist options, these strategies are specific to each combination of gridsize, number of available threads per CPU/coprocessor, MPI task tiling, and number of threads per task. This complexity and the required understanding of the methodology makes explicit optimized MPI and OpenMP tiling strategies impractical for the majority of WRF users. This calls for WRF to implement these strategies itself at runtime.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner