92nd American Meteorological Society Annual Meeting (January 22-26, 2012)

Monday, 23 January 2012: 11:45 AM
Lessons From Deploying the USHCN Pairwise Homogenization Algorithm in Python
Room 346/347 (New Orleans Convention Center )
Daniel Alexander Rothenberg, MIT, Cambridge, MA; and N. Barnes

Much of the software used for research in the atmospheric sciences - from analysis software and simple numerical algorithms to highly complex codes such as general circulation models - is freely available and open for public consumption. While the publication of these codes is meant to increase transparency, openness, and reproducibility of research in our field, ironically, they are often met with scrutiny and characterized as incomprehensible black boxes. An example of software suffering this fate is the pairwise homogeneity adjustment routine used to detect and adjust undocumented inhomogeneities in the US Historical Climatology Network. While the code behind this routine is freely available, is shipped with documentation and a ready-to-go test case based on synthetic data, and is described in detail in the peer-reviewed literature, the homogenization process is still a source of consternation among those critical of the observed surface temperature record.

A potential way to reduce this consternation is to independently deploy the algorithm as a new piece of software. By rigorously documenting the development of the new code, adhering to a set of software engineering principles, and regularly subjecting working code to peer review, a finished product which produces results similar to the original code should increase confidence and trust in the method. Using a modern, widely-adopted language (such as Python) and well-documented libraries can also help de-mystify the code, leading to more transparency.

I performed this development process on the aforementioned pairwise homogenization code and implemented the core algorithm in pure Python. In this talk, I will detail considerations that went into this development, both numerical (precision and adjustments to methods necessary when switching from Fortran to Python) and computational (improving performance through better design philosophy, the multiprocessing library, and potential use for NumPy). I will also detail how the independent development process helped identify bugs and glitches in the original code which had previously been overlooked, and the practicality and potential benefits of porting complex legacy codes - like the homogenization routine - to a modern codebase.

Supplementary URL: http://code.google.com/p/ccf-homogenization/