Abstract: Dataflow of a Multiple Instrument On-Demand Processing Engine with Python and DPLKit (95th American Meteorological Society Annual Meeting)

399
Dataflow of a Multiple Instrument On-Demand Processing Engine with Python and DPLKit

- Indicates paper has been withdrawn from meeting

- Indicates an Award Winner

Monday, 5 January 2015

Joseph P. Garcia, Univ. of Wisconsin, Madison, WI; and E. Eloranta and R. K. Garcia

The University of Wisconsin LIDAR Group has implemented the High Spectral Resolution LIDAR on-demand data website and processing codebase using Python to be more easily deployable, extensible, and reusable. The processing engine uses a data flow programming toolkit and convention, known as DPLKit, developed together with other atmospheric research programmers at the University of Wisconsin - Madison. These techniques introduce a natural data flow end-to-end, moving incrementally toward encapsulated data actors with simple interfaces which can be independently tested. The objects created can be rearranged and reused as needed, and allow a researcher to focus on coding just the module he/she wants, unencumbered by data intake or outflow concerns. The system already includes shared input code layers for ease in supporting multiple instrument sources in different formats, resampling filters for gridding multiple sources onto a single coordinate system, image creation, processed file creation, and can be plugged into an interactive UI for active data sources. These demonstrate ways to create modules to replace what would otherwise be single-use ancillary code. The website, hsrl.ssec.wisc.edu, uses networks of actors to process data on-demand from HSRL deployments since 2003. It also includes an expanding codebase for advancements in instrumentation and processing, such as integrating products from co-located instruments in order to create new innovative measurements.

Understanding Frame Streams.

Many traditional scientific processing engines operate on files directly, neglecting proper handling of file borders and I/O optimization until it has already infiltrated the science code. In order to separate these tasks, as well as to address both real-time and retrospective processing, a general approach to the task interfaces is necessary. The fundamental idea of data flow involves thinking of data as a unidirectional flow of measurement frames in time. I/O levels need only retrieve data in chronological order, presenting every time step once and only once. File borders disappear from functional code, leaving only the more approachable and repeatable framing boundaries. The specific content of the frames themselves is flexible, and only relevant to the narrow requirements of the task consuming them. Any repeatable task operating on these frames may be broken out into its own “actor,” allowing for the input of the actor to be from any compatible source, familiar or otherwise. Each actor in the pipeline advertises what its output frame will contain as a dictionary attribute, available to the immediate downstream actor at initialization time. In adapting existing code, the manner in which actors are chained together may start out as a fixed construct, but will evolve into a more shared interface as new actors are added that plug in without other actors' awareness, upstream or downstream.

Using Python: Actors as Pythonic generators.

The natural implementation of repetitive processing is a for-loop, which for our purposes means using iterators. Python generators add elegance by creating an iterable object with surprisingly little code overhead. In Python, the simplest actor will simply iterate upon its source (upstream) actor, complete its task on each frame, and yield the product for the downstream actor. Any parameters needed by the actor should be provided at initialization time. During an actor's initialization, its parameters and the source actors' frame descriptions allow it to configure itself completely, as well as present expected errors at that moment instead of mid-operation. All runtime state is maintained within the actor as necessary, or within the actor's __iter__ function scope when operating as a generator. Because iterators may operate in a disjoint rhythm, more complex actors may accumulate multiple frames to operate on each frame in the context of those in close proximity in time and space. These actors would appear to yield a time deferred product when compared to the input frames, but this is not at all a problem so long as chronological order is maintained. Furthermore, an actor can accumulate and assemble multiple frames into a multi-dimensional compound-frame or a single high-level observation as its product.

Modular Design Benefits.

By separating tasks into different actors that flow data in one direction, operating in its own context, we have also effectively created an efficient computational pipeline. Minimization of side effects from cross talk makes it more feasible to parallelize their tasks. Output of one actor can be diverted to another thread or process, or entirely separated within an actor. By integrating a method of transport and resource management, this becomes a clear direction for efficiency. With each actor doing a very deliberate and finite task, generalized modules can be written with full focus on the purpose of that module. This avoids duplication and allows you to benefit from robust implementations, having only to provide the proper initialization. Building upon this, the actor can also easily operate by calling any separately developed code. Heritage code, library functions, and scientific code written by an expert are fair game, as Python already has many community libraries and language bridges. Actors can serve as a sandbox for any environment or scientific workspace.

Steering and Containing Divergent Code Evolution.

In the process of adapting and separating existing code into DPL constructs, discrete tasks will have a tendency to evolve in place. When these segments are found to be useful to more than one role, they can be extracted into library functions and used where needed. This allows existing modules to maintain the same stable and mature functionality, while new development code can reuse these functions, extending the API minimally to be more use-agnostic. This approach keeps dramatic changes localized without compromising core functionality. Properly allowing development and experimentation to independently diverge from stable code produces a much more approachable and versatile system as a whole. In time, the experimental code can stabilize and mature to contribute its own reusable functionality back into the stable system.

Acknowledgements.

This work was partially supported by NSF Grant# ARC-0946359.

95th American Meteorological Society Annual Meeting

January 04 - 08, 2015

399
Dataflow of a Multiple Instrument On-Demand Processing Engine with Python and DPLKit

Meeting Information

Additional Information

95th American Meteorological Society Annual Meeting January 04 - 08, 2015

399 Dataflow of a Multiple Instrument On-Demand Processing Engine with Python and DPLKit

Meeting Information

Additional Information

95th American Meteorological Society Annual Meeting

January 04 - 08, 2015

399
Dataflow of a Multiple Instrument On-Demand Processing Engine with Python and DPLKit