92nd American Meteorological Society Annual Meeting (January 22-26, 2012)

Monday, 23 January 2012: 11:15 AM
Timeseries Data Management : Acquire, Store and Visualize Using Open Source Python
Room 346/347 (New Orleans Convention Center )
Jonathan Rocher, Enthought Inc, Austin, TX

Analyzing and visualizing time-series efficiently are recurring though difficult tasks in meteorological or climate research. In this talk we will explore the Python ecosystem for doing this effectively using only open source packages and projects that are mature yet still under active development.

We will first discuss the data structures to efficiently load and hold time-series using standard Numpy arrays, the recently modified datetime package of Numpy and some domain specific libraries like scikits.timeseries, Pandas, Larry, etc. We will walk through the strength and use cases for each of these tools.

Storing time-series has evolved drastically in the last years. Regular relational databases written in all common languages (sqlite, PostgreSQL, Oracle, MySQL, ...) can be used and interfaced from Python using a set of standard modules. Python also provides access to standard hierarchical data formats (HDF5, netcdf, ..) with the needed packages to read and write them efficiently (pytables or h5py). We will illustrate with a few examples how relational databases and hierarchical datasets can be used for storing time series. The focus will be on trade-offs between the two data models.

Data analysis in Python can again leverage many open source packages in SciPy and the scikits ecosystem around. We will review packages in SciPy that can be useful for statistical analysis, and Monte Carlo simulations to deal with forecasts and statistical knowledge. Additional regression and statistical tools can also be found in the statsmodels scikits which will be illustrated if time permits.

The final part of the talk will present tools to visualize these time-series in a powerful 2D visualization library: Chaco. Part of the Enthought Tool Suite, Chaco is an open source package that focuses on dealing with large datasets. It allows to quickly develop custom tools to interact with the plot: selection tools, overlays, ... We will illustrate this and will demo if time permits how to embed these plotting functionality inside a full application that provides a full user interface using the Traits package.

Combining the Python language with some powerful packages like Numpy, Pytables and Chaco allow one to create a time-series data management platform that is robust, fast, maintainable and open.

Supplementary URL: