Running MITgcm on 70,000 cores at NAS – Challenges and Results

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner
Wednesday, 7 January 2015
Harper Pryor, NASA, Moffett Field, CA; and D. Chan, R. Ciotti, C. Henze, R. Hood, and B. Nelson

Meeting scientists need to run ocean, weather and climate models at ever increasing resolution with ever more complex physics representation is requiring these models to execute on substantially larger High Performance Computing (HPC) systems than ever before. We are now running on tens of thousands of cores and hundreds of thousands are on the horizon. Implementing these complex models on such large systems presents challenges both in hardware and software. This poster will discuss the challenges faced in implementing a model developed by the consortium for Estimating the Circulation and Climate of the Ocean (ECCO), a joint venture led by the Massachusetts Institute of Technology (MIT) and NASA's Jet Propulsion Laboratory (JPL) to study ocean currents and their interactions with Earth's atmosphere, sea ice, and marine terminating glaciers. The objective is to help monitor and understand the ocean's role in climate variability and changes, as well as to improve the representation of ocean-climate interactions in Earth system models. The ECCO project team uses the MIT general circulation model (MITgcm), a numerical model designed to study ocean, atmosphere, and sea ice circulation. The MITgcm is combined with observational data from NASA satellites and in-situ ocean probes measuring sea level, temperature, salinity, and momentum, as well as sea-ice concentration, motion, and thickness. This model-data combination requires the solution of a huge, non-linear, estimation problem. The result of this estimation is a realistic description of how ocean circulation, temperature, salinity, sea level, and sea-ice interact on a global scale. The ECCO project team set a goal to run simulations at an unprecedented global resolution of 1/48° using the Pleiades supercomputer at the NASA Advanced Supercomputing (NAS) facility at NASA Ames Research Center, Moffett Field, CA. Pleiades is NASA's most powerful supercomputer. It has over 11,000 nodes each with two Intel Xeon processors having from 6 to 10 cores/processor, interconnected by two separate InfiniBand fabrics organized as partial 12-D hypercubes connected with over 65 miles of copper and fiber cables - the largest InfiniBand-connected cluster in the world. It has about 500 TB of memory and a peak performance of over 3.5 Pflops/sec. It uses three generations of Xeon processors and three generations of InfiniBand technology. As the code was implemented on increasing numbers of cores, multiple challenges arose and were overcome. These included I/O bottlenecks, issues with libraries, hardware problems in InfiniBand and, importantly, a need to provide advanced visualization capability to support scientists in understanding the massive amounts of data produced by a model running at such high resolution. In order to have any chance to see reasonable performance for a 70,000-core MITgcm job, the code's I/O had to be completely redesigned to eliminate a serious bottleneck. Instead of having all I/O go through a single MPI rank, the code was changed to add auxiliary processes to handle data compositing and I/O. For example, suppose there is a large MITgcm run that needs N MPI ranks for computing. The domain decomposition takes place in the horizontal plane (x, y) but not in z—that is, each rank is responsible for the full range of vertical points. The desired output is a collection of M full-range horizontal slices. The new approach is to use M auxiliary ranks for I/O. Then each of the N compute ranks will send data to a subset of the M I/O ranks. The I/O ranks will then shuffle data among themselves so that each has the data for the slice it's responsible for. It then outputs the plane. This is essentially a custom implementation of the level-3 collective-buffering optimization of an MPI_File_write_all using auxiliary processes to do the data re-shuffling and output. This minimizes the impact of the I/O, and maximizes the overlap of I/O with computation. Several compute system issues had to be addressed to get MITgcm to run at scale on Pleiades cluster. To solve InfiniBand issues that had plagued HPC sites for some time, several hundred marginal InfiniBand cables had to be identified and replaced and the systems team had to install new firmware for the InfiniBand fabric and compute nodes. A new MPI library, SGI's MPT (Message Passing Toolkit) was provided to improve congestion controls on the InfiniBand fabric that was impacting the MITgcm runs. To support scientists' need to visualize model output, the HECC Visualization group at NAS wrote a hyperwall application for visualizing 3-petabyte (PB) dataset that resulted from 1/48-degree simulation – the largest individual output dataset produced at NAS to date. Resolution of the ocean surface in the simulation is so high that it can only be displayed using the full 128- screen hyperwall. The application's key feature is its ability to use MPEG-compressed, pre-computed visualizations to compress the data 25:1, which reduces the required bandwidth to a manageable amount. Without the compression, it would be difficult to show the visualizations since getting the required bandwidth from the Lustre filesystem would be difficult. ECCO scientists can examine different ocean depths and scalar values by selecting a different MPEG animation in the application interface. With the success of this effort, ECCO researchers are contemplating even higher resolution runs on Pleiades. The poster will include additional examples of hyperwall output as well as performance data for the 70,000 core model run.