Performance Tuning of the JMA-NHM for the Super High-Resolution Experiment using the K Super Computer
In Japan, localized torrential rainfalls sometimes cause severe disasters which impact on the society. (e.g., the urban flash flood disaster at the Toga River in Kobe city in July 2008, and the debris flow disaster in Izu Ohshima Island in 2013). In these events, the precipitation amounts were very different in the local areas, and they were likely strongly affected by geographical features. The flash flood in Kobe city case, about 70 % of the initial flow of the flash flood is from the urban area that covers only about 30 % of the entire catchment area (14 square kilo meters). In the Izu Ohshima Island debris flow case, two meteorological observation stations are in the northern part and the middle part (near the damaged area) in the island, and the distance between the two stations is only 4 km. However, the quantity of observed precipitation in the middle in the island was about twice as much as north. To understand these phenomena, super-high resolution (several hundred meters scale resolution) numerical weather simulation is necessary. Super high-resolution experiments have been made by previous studies such as a tornado for limited domains, however, a numerical weather simulation with wide domain is very few due to limitation of the computational resources. K computer has enough performance for the numerical weather simulation for wide domains. NHM has developed under a vector-type computing system, and K computer is a scalar-type computing system. Therefore, performance tuning of NHM for K computer is necessary for efficient operation. We improved the time integration part of NHM's codes, MPI communication, memory allocation, and files I/O system for the super high-resolution experiment.
The authors and Fujitsu Co. Ltd, which is vendor of K computer, conducted tuning the time integration in the source codes of NHM since 2011. In the tuning process, we investigated the calculation cost and status of thread parallelization in each loop, and obtained information on SIMD from compile process. Then we listed heavy calculation cost of a top of 160 loops, which occupy about 77% of the computational cost, and applied mainly five techniques. The five techniques are as follows: (1) To change partitioning method of parallelization threads: block partitioning were altered to cyclic partitioning. (2) Loop partitioning and prefetch: to apply cyclic partitioning and prefetch to a L1/L2 high demand miss rate loop for increasing memory throughput. (3) To merge several do loops for sharing array: to reduce number of reading array element and increase performance. (4) To reduce reference frequency of list array. (5) To reduce if sentence and to facilitate SIMD. As the result, we improved 144 loops. Compared with improved NHM and original NHM, computation speed of time integration process is 15%, increasing and peak performance is 5.7% (23% increasing). We validated the NHM using 800 nodes and 82,944 nodes of K computer under the same experimental condition. Both outputs from each NHM were completely the same.
K computer has 88,128 nodes (Computational node: 82,944 nodes, I/O node: 5184 nodes). The entire K computer's nodes are necessary to perform high-resolution experiment. We optimized MPI communication and file I/O system of NHM for K computer specifications. The main problems and solutions are as follows: (1) Buffer error (MRQ Overflow): MRQ occurs in point-to-point MPI communication in case of using more than 50,000 nodes. The solution is to change point-to-point MPI communication to group communication. (2) File I/O system: K computer can't handle the files that length of sequential access file's record exceeds more than 2GB. To solve this problem, we improved the file I/O system. NHM has decentralized input file system, but the output file system was centralization. We developed the centralized output file system, and develop a system for unifying the decentralized output file.
After the tuning, we performed two experiences that were super high-resolution experiment (horizontal resolution is 250m, HIGH) and low resolution experiment (horizontal resolution is 2km, LOW). The experiment conditions are as follows: (1) Case study is Heavy rain in Izu Ohshima, October 2013. (2) Experiment domain: coveres most part of Japan. (3) Number of grids: LOW is 800x550x60 and HIGH is 3200x2200x85. (4) Planetary Boundary Layer schemes, LOW uses Mellor-Yamada level3, HIGH uses Deardorff. As the result of the experiment, HIGH shows better results than LOW. The radar -AMeDAS precipitation map (observation data) shows the strong rain band completely covered the Island. In the LOW case, the rainband does not cover the island completely, however HIGH case, the rainband covers the island. We will make presentation about details of tuning and the experiment.