1.14 MM5 Performance on an IBM SP: SMP and MPP Comparisons

Monday, 15 January 2001: 2:00 PM
Keith H. North, IBM, Omaha, NE; and J. Tuccillo and J. D. Benson

This paper examines the performance of MM5 on an inter and intra node basis with recommendations for optimal system performance on POWER2, POWER-PC, and POWER3 SP systems.

SMP nodes use shared system resources (memory, CPU, and disk) that are shared equally among multiple processors. An IBM Winterhawk-2 node for example can have up to 4 processors that all share the same resources. A single version of the Operating System (AIX 4.3.3 in our case) schedules processes and threads on the system. MPP systems consist of nodes connected by a high-speed network. Each autonomous node has its own processor, memory, and disk. A copy of the operating system is running on each node, so each node can be considered a workstation. Since MPP systems can be constructed with SMP nodes these systems are considered hybrid systems. Air Force Weather Agency runs MM5 on two such hybrid systems.

Multi-threaded applications work best in an SMP environment because threads within a process share available resources. The application developer can either explicitly create a multi-threaded executable using POSIX thread's libraries (pthreads) or allow the compiler to generate threaded executables. Usually better performance and control are obtained with explicit coding however, this obviously requires more of the developer's time. A middle ground involves using compiler directives in conjunction with the automatic parallelization capability of the XL Fortran compiler to assist in cases where the compiler is unable to detect independent pieces of work. Andersen (1998) provides details about SMP features and thread coding. Since all the threads in the application share the same address space it is easy to reference data that other threads have updated.

In the hybrid environment there are two approaches to running applications: (1) multiple single-thread processes per node or (2) one multi-thread process per node. Using the first approach you simply increase the number of processes to match the number of processors per node. Processes still communicate with each other using MPI instructions whether the processes are on the same or different nodes. On the Production-3 system at AFWA, which consists of 40 4-CPU POWER-3 nodes (Winterhawk nodes) we run MM5 with 4 MPI tasks per node using a single-threaded executable for optimal performance. The second approach uses a single MPI task per node so that intranode communication uses shared memory and internode communication uses message passing. On the Production-2 system at AFWA, which consists of 108 4-CPU POWER-PC nodes (silver nodes) we run MM5 with 1 MPI task per node using a multi-threaded executable for optimal performance.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner