J1.5 Making earth science data more accessible: experience with compression and chunking

Wednesday, 9 January 2013: 5:00 PM
Room 11AB (Austin Convention Center)
Russell K. Rew, UCAR, Boulder, CO

Best practices for data providers typically include use of standard data formats (such as netCDF or HDF5), standard conventions (such as the Climate and Forecast (CF) metadata conventions), and standard coordinate system representations (such as WKT, EPSG, OGC, or CF-proposed).

Following these best practices may be necessary to provide discoverable and accessible data suitable for current users and applications, but it is not sufficient for future users and unanticipated applications. Providing compression and chunking to datasets is an under-appreciated step that adds value to earth science data, whether the data is stored on servers for remote access or in local files for intensive analysis. Providers of observational data or the output of numerical models can improve the usefulness, accessibility, interoperability, and value of their data by understanding the benefits of compressing and chunking the data for efficient use. We describe some of the recently developed and improved tools for accomplishing this important step.

Huge datasets that contain missing values, fill values, special flag values, or more precision than is needed for all applications may be too unwieldy to use by investigators other than those who collect the data. Applying appropriate data compression techniques can make it possible for others to use the data in a wider variety of contexts. Remote access to subsets of such data is often practical only if the compression strategy is carefully considered for both flexibility and performance. For example, whole-file compression is not a practical strategy when most uses of the data will require small subsets of a file.

Chunking, sometimes called multidimensional tiling, reorders gridded data to support efficient access along multiple axes. For example, judiciously rechunking a dataset that stores data with time as the most slowly-varying dimension can still support relatively efficient access by geographical area while speeding up time series retrieval tremendously. Experience with rechunking large datasets has led to some insights and interesting heuristic algorithms that have been implemented in command-line tools.

The combination of compression with chunking multiplies the choices for data providers, but also increases opportunities for optimizing accessibility of the data. Choice of chunk shapes and sizes influence compression performance and efficiency. Locality in the data can affect compression, which in turn can constrain the chunk shapes worthy of consideration.

We demonstrate some of the benefits and tradeoffs of simple tools such as the netCDF nccopy utility and the HDF5 h5repack utility for compressing and rechunking data to significantly improve its value to users. We also report on benchmark results, showing how well these techniques work with real data, making it practical to access time series at selected points from large satellite datasets in which time was originally the most slowly varying dimension.

- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner