Managing Large Datasets and Computation Workflows with Disco and Blaze (invited)

Zaitlen, Ben; Zaitlen, Ben

Data collection and generation during the last decade has become increasingly easier -- so much so that almost all scientific communities need new solutions for processing and managing this new deluge of data. In particular, there exist two main problems: 1) the computational power of a single machine is not enough to complete a task in a reasonable amount of time; and 2) the memory of a single machine is not large enough to load and process all the data. In this talk, I will step through two Python-based tools and services that can help to solve these problems -- tools which elevate the novice and give power to the domain expert. The first tool is Disco with DDFS that provides a distributed data store accessible from disparate Python scripts. The second tool is Blaze which provides the user with a array-oriented, data-centric programming environment where simple scripts can be used to transform arrays in a universal memory space.

8.3 Managing Large Datasets and Computation Workflows with Disco and Blaze (invited)