87th AMS Annual Meeting

Thursday, 18 January 2007: 9:15 AM
Monitoring and orchestrating a mesoscale meteorological cyberinfrastructure
216AB (Henry B. Gonzalez Convention Center)
Lavanya Ramakrishnan, Renaissance Computing Institute, Chapel Hill, NC; and D. Reed
The Linked Environments for Atmospheric Discovery (LEAD) project is developing a web service-based dynamic adaptive workflow orchestration system. Adaptation in LEAD can be triggered by events of many types, including changing weather conditions, forecast model analysis, inefficiencies in an ongoing workflow execution, or in response to changing computational or network loads. The complexity and diversity of LEAD's dynamic characteristics make monitoring and understanding behavior both more critical and challenging. When reacting to crucial weather changes or changing resource availability, the LEAD system must proactively assess and detect anomalies, enable recovery and ensure continued operation.

To support real-time adaptation of distributed LEAD workflows, we are developing a new monitoring and orchestration infrastructure to collect workflow status, application performance, web service monitoring and resource monitoring data and apply that data for adaptive control. The monitoring infrastructure utilizes local sensors that collect data and monitor hardware and software resources for significant changes that should trigger adaptation. The resource sensors collect data on resource queues and status through Globus services and the Network Weather Service (NWS).

Sensors embedded in LEAD application services collect real-time load data, and the workflow monitor tracks workflow progress. This data is analyzed locally to detect anomalies, and decisions are broadcast to other components where adaptation is required. Actuators at critical infrastructure points (e.g., the workflow engine, service factory and resource broker) implement adaptations based on the policy rules. The components communicate through the LEAD event broker, and all relevant components can subscribe to monitoring data and react to the event based on the stated policy. For example, if a resource fails, the resource monitoring component might broadcast a message that would trigger the workflow engine and associated services to react and take appropriate action locally.

We are also constructing a performance model of the LEAD workflow to estimate resource requirements and understand workflow behavior on a diverse resource set, including the resources in the NSF TeraGrid. Using the monitoring data, the performance model and failure models, the adaptive LEAD infrastructure can assess and detect system performance anomalies, ensure recovery and guarantee continued operation of weather forecasting. Monitoring also enables the LEAD system to respond to additional resource requests based on varying weather conditions.

Acknowledgements: LEAD is funded by the National Science Foundation under the following Cooperative Agreements: ATM-0331594 (University of Oklahoma), ATM-0331591 (Colorado State University), ATM-0331574 (Millersville University), ATM-0331480 (Indiana University), ATM-0331579 (University of Alabama in Huntsville), ATM03-31586 (Howard University), ATM-0331587 (University Corporation for Atmospheric Research), and ATM-0331578 (University of Illinois at Urbana-Champaign)

Supplementary URL: