conferences | speakers | series

Real-time scalable graph analytics

home

Real-time scalable graph analytics
FOSDEM 2016

I'll introduce differential dataflow, an open-source analytics platform, and describe how it enables fundamentally new approaches to large-scale graph processing. Specifically, we'll see how to fairly easily write and run standard graph analyses, whose output results are automatically updated as their inputs changed. On billion-edge graphs this approach can both be more efficient than platforms like GraphX and provide sub-second update times.

(stolen from https://github.com/frankmcsherry/differential-dataflow)

Differential dataflow is a data-parallel programming framework designed to efficiently process large volumes of data and to quickly respond to changes in input collections.

Like many other data-parallel platforms, differential dataflow supports a variety of data-parallel operators such as group_by and join. Unlike most other data-parallel platforms, differential dataflow also includes an iterate operator, which repeatedly applies a differential dataflow subcomputation to a collection until it converges, especially useful in the context of graph processing.

Once you have written a differential dataflow program, you then update the input collections and the implementation will respond with the correct updates to the output collections. These updates (both input and output) have the form (data, diff), where data is a typed record and diff is an integer indicating the change in the number of occurrences of data. A positive diff indicates more occurrences of data and a negative diff indicates fewer. If things are working correctly, you never see a zero diff.

Differential dataflow is efficient because it communicates only in terms of differences. At its core is a computational engine which is also based on differences, and which does no work that does not correspond to a change in the trace of the computation as a result of changes to the inputs. Achieving this property in the presence of iterative subcomputations is the main "unique" feature of differential dataflow.

Speakers: Frank McSherry