Supporting the Data Flow of High Resolution Climate Modeling

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner
Thursday, 8 January 2015: 3:45 PM
128AB (Phoenix Convention Center - West and North Buildings)
Amy R. Langenhorst, NOAA/GFDL, Princeton, NJ; and C. Wilson, K. Paffendorf, E. Mason, J. Durachta, and V. Balaji

As climate model resolution increases, the potential to be overwhelmed by data produced by these models is also increasing. A recent high resolution configuration at the Geophysical Fluid Dynamics Laboratory (GFDL) generates 2TB of diagnostic data per simulation year, and automated post-processing, launched in tandem with the model run, can increase the data volume by 4x. Hardware and software architectures are also growing more complex. New approaches are needed to achieve the scale, reliability and robustness necessary for a climate scientist to understand, run, and manage a large model run and its output.

A redesign of the workflow software supporting GFDL's climate models is underway, starting with the automated post-processing of model output.  Automated post-processing includes generation of regridded time series and climatological averages, as well as graphical output for a scientific audience, entry into a database which stores metadata about the experiment, and publication of data on a public portal.

A survey of tools used in the community will be presented, highlighting Cylc, the open source software project GFDL has chosen to build its framework around. Cylc (http://cylc.github.io/cylc) is an engine designed to run and manage distributed suites of interdependent, cycling tasks using a dependency graph.  Cylc will aid in monitoring and visualization of the workflow status, as well as handling task dependencies and recovery from failed tasks.

Challenges, goals, and lessons learned will be presented, based on GFDL's experience with workflows. Some goals are straightforward: aside from Cylc, we are reducing memory and disk space requirements, separating and parallelizing independent tasks, and implementing standard logging, retries, and error reporting. Larger challenges include dealing with distributed computing, aiming for node affinity, capturing provenance data, and addressing unexpected failure modes.