Allinea MAP, TAU (Tuning and Analysis Utilities) and Intel Vtune Amplifier XE were used for profiling and detecting the hotspots of the WRF codes. The profiling results show that advection is the most time-consuming routine in the ARW (Advanced Research WRF) dynamic solver for many WRF and WRF-Chem simulations. For further investigating the hotspots, the advection code is extracted from the WRF Fortran code as a separate kernel. WRF model uses the Runge-Kutta (RK) 3rd order for integrating the time, and 2nd through 6th order advection schemes in both horizontal and vertical directions. This basic RK scheme will produce negative values and oscillations near sharp gradients due to the numerical dispersion errors. Positive definite and monotonic flux limiter schemes are used for reducing the negative values and spurious oscillations from the dispersion errors.
We designed a testing framework for optimizing, compiling, profiling, and evaluating the kernel with different Fortran compilers including Intel, GNU, and PGI compilers. Several optimization techniques including exposing more vectorization and improving the cache hit rates are utilized on the advection modules, and the outputs are verified. Reasonable speed-up can be achieved on different compilers using the modified advection kernel. These source code modifications were tested for bit-for-bit results and integrated back to the WRF source code. These optimizations will widely benefit the large community of research and operational forecasters by reducing the computation costs.