Large-scale and distributed data processing platforms

Involved members: 

Many platforms for big data processing have emerged in recent years, and many are still emerging with different platforms focussing on different types of data processing such as memory-based data processing, graph processing or stream-based processing.

However, most of them still require much expertise from the programmer to construct workflows that execute efficiently. We are interested in the challenge to make this easier and offer the data analyst a high-level declarative workflow language that comes with a workflow optimizer that can transform this to an efficient workflow execution on possibly different distributed data-processing back-ends.

There are currently two approaches that are being investigated:

  • Logic-based workflow specifications: The query-language Datalog is a very simple and well-understood, both in terms of its expressiv power as well as how it can be optimized, query language that is based on first-order logic. The reserach focuses on extensions of Datalog where arithmetic and aggregation is added, which defines a very powerful language that preserves most of the pleasant properties of Datalog.
  • Graph-based and Comprehension-based workfow specifications: Often simple workflows can be specified as directed acyclic graph (DAGs) or extended versions thereof. These give a very intuitive interface for used, yet can be relatively expressive when extended with constructs for iteration, selection, et cetera. There is also a close connection here with resarch on (subsets of) functional programming languages that can descrive very concisely the desired data processing workflow in a declarative manner.

In both approaches there are several questions that need to be addressed, such as expressive power and usability, but also about the reasoning over the equivalence of different workflow specifications and the optimization of their execution.