Climate Data Science: A Framework for Improving Computational Climate Analysis

Rothenberg, Daniel; Rothenberg, Daniel

Data analysis and modeling in climate science is hard. But, it shouldn't be.

Most students pursuing atmospheric science degrees are exposed to the basics of computer programming. For instance, many undergraduate programs require introductory courses where students can learn the basics of flow control, decomposing problems, and the rudiments of "computational thinking". Regardless of what tools are introduced in these classes (Fortran, MATLAB, Python, Java, C/C++, etc), they provide a critical first exposure to the basic toolkit which powers research and advances within our field of science. But a student's practical education in coding and data analysis is more often ad hoc, and leans heavily on informal mentoring; you probably start out with the language, tools and approaches that your research group has adopted, simply because that's how things have always been done.

This ad hoc, fragmented approach to learning analysis techniques in the atmospheric/climate/ocean sciences is not sustainable given the scientific and computational challenges on the horizon. A lack of standard practices and tools leads to inefficiency as different groups re-invent the wheel for the same core analyses. Reproducibility and transparency greatly suffers when there are no standards for publishing and documenting code. Fragmentation of methods and techniques complicates collaborations. Many traditional techniques simply do not readily apply to peta-scale analyses or distributed/cloud computing. But worst of all, a lack of "best practices" divorces our community from the much larger data science world, which is actively solving very similar problems to the ones we face in our research.

In this talk, I introduce the idea of "climate data science" - a framework for approaching computational and analysis problems which draws on techniques and lessons learned in the broader data science community, but with best practices tailored for the kinds of data commonly found in the climate domain. The climate data science framework is designed to minimize the friction between our data and tools and paradigms readily available for scientific analysis, but developed in other fields. Adopting these paradigms could help foster interdisciplinary collaborations and facilitate knowledge transfer between fields. Ideally, this framework could serve as the basis for workshops or a short under/graduate level course. All of the examples I present are based on Python libraries and toolkits, but the fundamental techniques could be readily applied to other analysis langauges such as R.

5.5 Climate Data Science: A Framework for Improving Computational Climate Analysis