85th AMS Annual Meeting

Monday, 10 January 2005
S4P-based data processing solutions at the Atmospheric Sciences Data Center
R. J. Walter, NASA, Hampton, VA; and F. Y. Wang
The NASA Langley Atmospheric Sciences Data Center (ASDC) has successfully implemented the Simple Scalable Script-based Science Processor (S4P) to schedule complex Earth system science processing schemes. S4P was developed by Goddard Earth Sciences (GES) DAAC as a simplified, data-driven processing system which uses a factory assembly line paradigm to drive data processing. A series of “stations” are configured to represent the various stages of data production. When a “work order” arrives at a given station, an executable is run and an output work order is sent to a downstream station. Multiple concurrent instances of a job can be run at any given station and output work orders can be sent to more than one downstream station. Each station executes independently of the others to provide S4P with a high level of modularity. The key components of each station are: 1)A daemon called stationmaster that monitors the station directory for work orders. A single stationmaster daemon is started for each station; and 2)A station configuration file. The two essential items in the configuration file are a map of tasks for stationmaster to execute in response to incoming work orders and a map of downstream stations to which output work orders are to be sent.

Following a through evaluation, ASDC determined that S4P would provide an excellent foundation for implementing data processing systems for two of its major projects. The first of these is the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observations (CALIPSO) project. The CALIPSO satellite is scheduled for launch in April 2005. Building upon S4P, the ASDC developed a customized data processing system for CALIPSO. This system manages an intricate set of rules and dependencies that governs the automatic execution of 10 different science software executables. Due to some unique characteristics and data persistence requirements of the CALIPSO processing scenario, a backend database was integrated into the system for managing persistent information. The open source database MySQL is currently being used for this purpose.

The second of these projects is the Surface Radiation Budget (SRB) Release 2.5. SRB processing requires several thousand files to be staged from the ASDC archive in order to initiate processing for a given month. Unlike the CALIPSO system, no backend database is required. The system is able to track all of the files as well as initiate and monitor execution of the science software without the use of a database.

The CALIPSO and SRB systems both use the Sun Grid Engine software, an open source project sponsored by Sun Microsystems, for managing their data production resources. This software was integrated easily with S4P.

The ASDC has successfully leveraged the S4P software to construct two very different data processing systems. S4P is not an “out of the box” solution. Additional development is almost certainly required by anyone who wants to use it. However, the modular and flexible nature of S4P allows for the implementation of low cost, customizable data processing solutions with almost no artificial constraints.

Supplementary URL: