Supporting the Data Flow of High Resolution Climate Modeling
A redesign of the workflow software supporting GFDL's climate models is underway, starting with the automated post-processing of model output. Automated post-processing includes generation of regridded time series and climatological averages, as well as graphical output for a scientific audience, entry into a database which stores metadata about the experiment, and publication of data on a public portal.
A survey of tools used in the community will be presented, highlighting Cylc, the open source software project GFDL has chosen to build its framework around. Cylc (http://cylc.github.io/cylc) is an engine designed to run and manage distributed suites of interdependent, cycling tasks using a dependency graph. Cylc will aid in monitoring and visualization of the workflow status, as well as handling task dependencies and recovery from failed tasks.
Challenges, goals, and lessons learned will be presented, based on GFDL's experience with workflows. Some goals are straightforward: aside from Cylc, we are reducing memory and disk space requirements, separating and parallelizing independent tasks, and implementing standard logging, retries, and error reporting. Larger challenges include dealing with distributed computing, aiming for node affinity, capturing provenance data, and addressing unexpected failure modes.