Using Cascalog and Hadoop for rapid graph processing and exploration

FOSDEM 2012

Graphs are becoming increasingly popular as ways of modeling a wide variety of systems. As such, the label "graph processing" also covers a range of objectives and architectural constraints. At [Linkfluence][http://us.linkfluence.net/], we use graph processing on datasets produced with very different systems (Web crawler, Twitter and Facebook API, feed aggregator, etc.) We spend a lot of time doing exploratory programming, trying to use our eclectic datasets in interesting ways, and processing our data in asynchronous workflows. We have come to see [Hadoop][http://hadoop.apache.org/] and the processing framework [Cascalog][https://github.com/nathanmarz/cascalog] as essential tools in our toolbox when dealing with graphs, since it gives us architectural flexibility, scalability and the possibility of rapid prototyping. Cascalog is an open source framework built on top of Hadoop and [Cascading][http://www.cascading.org/]. Its syntactic and semantic roots come from Datalog and Prolog, which have been succesfully applied for a long time in the manipulation of graphs. Also, its ability to directly embed the expressive [Clojure][http://clojure.org/] language allows to very easily define custom operations and ad-hoc processing. In this talk, we will present the framework, consider its advantages and drawbacks when compared to other approaches, show concrete exemples of usage for graph processing and how we use them to complement graph databases.

Speakers: Nils Grünwald