A Python-Based Automatic Data Aggregation Framework for Hydrology Models

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner
Tuesday, 4 February 2014: 9:30 AM
Room C302 (The Georgia World Congress Center )
L. C. V. Real, IBM Brazil Research Laboratory, Soã Paulo, Brazil; and F. Liu and T. Osiecki

Typical environmental modeling (e.g., flood modeling in hydrology and numerical weather forecasts in meteorology) requires a large amount of data to enable suitable initial and boundary conditions. Hydrological models, for instance, may require a vast amount of input such as topography (i.e., digital elevation model data), land use data, soil and vegetation type, as well as the locations of water bodies and river channels. Quite often the relevant data sources are scattered in both public and private domains, on different servers, and in different formats. Moreover, the spatial resolution, geo-coordinates and coordinate systems, and the data's acquisition date can vary depending on the locations of the areas to be modeled, as well as the data acquisition techniques used. Therefore, before the data can be used by the hydrological models, the data sets from multiple sources have to be properly transformed and aligned – a task which can be very time consuming and error-prone when manually performed. In this presentation, we introduce “DomainBuilder”, a python-based framework to perform automatic data aggregation and data transformation.

DomainBuilder has a modular architecture, with a set of core classes implementing essential routines and a group of classes dealing with specific data sources. The core classes' functionalities include performing conversions between coordinate systems, the manipulation of archives (e.g., zip files), message logging and thread manipulation. A specific class deals with communication with the external data sources through the ftp and http(s) protocols using the versatile urllib2 library. Another class interfaces with grids in raster format using the gdal and the numpy numerical libraries. One particular feature which is highly valuable is numpy's abstraction to access binary files using arrays. That feature is implemented very efficiently by the library through the memory map system call, allowing the processing of large files (of the order of gigabytes) with no need for an external subroutine written in a compiled language such as C.

The classes that download and process data from different data sources have been designed so that they can be invoked stand-alone (i.e., they have a “main” function) as well as being imported into other Python code. That decision was beneficial for unit testing and for the downloading of certain data without having to maintain an excessive number of option switches in the main DomainBuilder utility. Most of these classes also maintain a local cache of data downloaded in the past, which is useful when two or more areas of interest fall within the geographic extent covered by that data set.

The input to DomainBuilder is a collection of latitude and longitude points at the upper left and lower right corners which determine the area of interest. These points are also used to lookup the country name through a reverse geocoding operation. If any classes to obtain high resolution data for that country exist, then they are chosen over the defaults by the tool.

To facilitate integration with the simulation models, the data obtained by DomainBuilder is converted to a raster format and cropped, rescaled, rotated and/or translated so they all match the same geographic extents and grid size. The digital elevation model (DEM), which describes the terrain elevation, is used as the reference for those operations. Some of the data may also require post-processing, which is necessary for datasets which are prone to noise or artifacts. The delineation of the watershed basins, used to determine the scope in which some flood models may act upon, is performed as a post-processing step as well.

Various data sources are incorporated in DomainBuilder, ranging from NASA's public repositories to OpenStreetMap, the latter being the primary source for information on land use. OpenStreetMap's database includes detailed classified information for a number of regions around the globe. Data is obtained from any server which implements the Overpass API. From various map features, including buildings, highways, natural areas, land use, and water bodies, DomainBuilder estimates the surface roughness coefficients which are needed in the hydrological models. The base layer of the vegetation map is also taken from the same data source. Additionally, other layers, derived from remotely sensed data sets and in optional files provided by the user, are used to produce a unified map. The results may then be used for canopy interception. An additional notable data source is the Harmonized World Soil Database, which combines existing regional and national updates of soil information worldwide with the FAO-UNESCO soil map of the world.

Another important feature of DomainBuilder is its capability to download historical precipitation estimates and historical flood data. The source of the first feature is NASA's TMP-A (TRMM Multi-satellite Precipitation Analysis) database, which is available at a 3-hour interval, starting in January 1998, with a spatial resolution of 0.25x0.25 degrees. The data are enhanced by the daily estimates from the G-WADI database, which has a higher resolution of 0.04x0.04 degrees. The combined results are used to compute the return periods of severe precipitation events by using techniques from extreme value theory. The second feature is based on the historical flood data from the Flood Observatory database at the University of Colorado, allowing one to pair flood events with the precipitation grids.

The choice of Python over other languages enabled efficient code management (e.g., merging of new code and redesign) and a gained portability across multiple deployment platforms. DomainBuilder features today about 20 modules (or 80% of the total) written in pure interpreted language. The remaining modules use Python as a glue layer between external utilities (e.g., SAGA GIS, ImageMagick, Subversion) or subroutines written in other languages (e.g., R and C++). Such modular arrangement allowed the continued extension of DomainBuilder, through the incorporation of other data sets that may cover specific regions with a better fidelity, resulting in an easily adjustable tool to a multitude of domains.