7A.1 DyNamo: Scalable Weather Workflow Processing in the Academic Multicloud

Tuesday, 14 January 2020: 3:00 PM
157C (Boston Convention and Exhibition Center)
Eric Lyons, Univ. of Massachusetts Amherst, Amherst, MA; and M. Zink, A. Mandal, C. Wang, P. Ruth, C. Radhakrishnan, G. Papadimitriou, E. Deelman, K. Thareja, and I. Rodero

The utility of operational sensor networks is very often not exclusively derived from the delivery of raw measurements to the end users, but rather in derived products thereof, multi-instrument fusion, image generation, decision support notifications, and in general, groups of processes that convert data into user friendly, task specific information. In designing these workflows the system engineer must balance available compute and networking infrastructure with voluminous, often asynchronous data streams, arriving constantly and needing to be relayed in relatable ways as fast as possible. In the case of mesoscale weather analysis, CPU load, network utilization, and algorithmic processing times can be a function of the meteorological phenomena contained therein, with user need greatest when the system is most heavily taxed. The traditional approach of estimating maximum loads on a per algorithm basis and then dedicating hardware capable of processing such is expensive and difficult to scale as a sensor network grows and/or data rates increase. Previous work has been presented at AMS discussing the use of automatically detected meteorological triggers to spawn VMs preconfigured with algorithms and data requests for on demand processing of an associated weather phenomena. An example is hail detection, a process that is rarely necessary until, at the very least, some precursor signs of convective precipitation is observed. While that approach reduces overall dedicated processing by acquiring resources only when certain meteorological conditions are present, processing chains still need to be specified in advance for some static workload, with all the challenges related to varying input data rates and compute loads that exist in dedicated systems. Here we present DyNamo, a more robust network and compute solution that can be easily instantiated on heterogeneous academic multiclouds, with a simplified structure for process chain definition, and a scalable method to apply processes to arbitrary amounts of input data. The DyNamo framework has been adopted for use by the CASA radar program, and used operationally to process data from a network of seven X-band Doppler weather radars located in the Dallas/Fort Worth metroplex in North Texas, along with the KFWS Nexrad S-band radar. DyNamo leverages: (1) A MultiCloud provisioning system called Mobius providing Python based command line utilities; (2) Pegasus, a workflow management system for process chain definition, monitoring, and efficiency; (3) The use of HTCondor for scalable distributed parallelization; (4) Containerized data processing algorithms served by Docker and/or Singularity; (5) Configurable Layer 2 networking for a dedicated data plane, shared virtual networks between remote sites, and bandwidth provisioning tunable to the I/O requirements of the workflows. The CASA use case for DyNamo can be applied on single and multicloud environments. In 2019, we have demonstrated DyNamo on ExoGENI, a federated model OpenStack academic cloud, with hardware located on ~20 university campuses, virtual machines instances available to researchers, and flexible networks, and on Chameleon, an NSF supported academic cloud located at the University of Chicago and at the Texas Advanced Compute Center (TACC) in Austin, TX. Chameleon allows for deeply programmable infrastructure, including CPUs, memory, disks, and wide arrays of node types (GPUs/FPGAs/SDN switches/etc). Standard server type compute nodes at Chameleon are delivered as bare metal, with much higher performance and are preferred for CPU intensive meteorological algorithms, whereas smaller, more accessible ExoGENI VMs are used for workflow management, storage, faster data processing routines such as simple spatial comparisons and the delivery of notifications. The Mobius provisioning tool can create multi-node virtual infrastructures across clouds and connected with high bandwidth layer 2 stitchports networks all with a single API call and easy to configure JSON definition files. To evaluate the performance benefits of the DyNamo framework, we selected a set of meteorological workflows previously instantiated on dedicated hardware for CASA radar network operations. Our initial evaluation used Nowcasting data as input. Once per minute the nowcasting algorithm produces a forecast of gridded radar reflectivity for every minute out to 30 minutes. As part of the user driven workflow, these 31 grids (min 0-30) are plotted into PNG image files, and contoured at various reflectivity levels which are used as heuristic proxies for severity levels. GIS style geoJSON polygons are created and used for automatic notifications to users based on their current locations and their distance preferences. Contouring is a CPU intensive task when thunderstorms are widespread in areal coverage. Substantial differences in algorithmic runtimes were found dependent on weather regimes, with worst cases being ~200x slower than average runtimes. Pegasus optimization techniques were found to greatly speed up overall net processing time by efficiently balancing compute, and HTCondor worker nodes could be acquired and released as resource demand changed without any modifications to the overall workflow structure. More recent testing used voluminous radar moment data as primary input, along with much smaller radiosonde data to run hail identification algorithms. HTCondor worker nodes were deployed on Chameleon bare metal nodes, and VLANs were set up with 500Mbps dedicated throughput from the CASA data portal to the cloud location via an AL2S connection established for this purpose. Prior to DyNamo, the CASA radar operations center only had enough available processing to derive hail from one PPI file per radar every two minutes (1/6th the total data) without queuing processing and rapidly increasing latency. With only a few Chameleon worker nodes, we were able to process 100% of the PPI data for hail, generate images for live online display, extract contours represented as geoJSON polygons describing areas of identified hail, and relay these to our alert control server for notification. Much like Nowcasting, a significant difference in runtime was identified with the generation of hail data depending on the weather, and as such it also lent itself to the seamless elasticity of the DyNamo approach. We believe based on our experiments that many meteorological workflows are a good fit for DyNamo, and that the open source, well documented nature of the toolset contained is an attractive option for researchers balancing equipment costs with variable resource demand, multiple processing steps, and with an eye on expanding domains.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner