Traditional searches take the form of individual queries over large mostly-stable corpuses of documents. In this talk, we'll show how we invert this paradigm to allow for searching over streams of documents by combining Samza, a distributed stream-processing framework, with Luwak, a library for efficiently running large numbers of queries over individual documents.
Real-time searching over streams is useful in a number of contexts. For example, companies may want to detect whenever they are mentioned in a news feed; or a Twitter user might want to see a continuous stream of tweets for a particular hashtag.
Luwak provides a mechanism for running many thousands of queries over a single document in a highly efficient manner, by filtering out queries that it can detect will not match. Luwak is designed to run on a single node, holding all registered queries in RAM. Scaling to higher document throughput, or to more queries, requires parallelization across multiple machines.
Samza provides a framework for such parallelization, by partitioning and recombining both the document streams and the query set (which can be treated as just another stream), and also provides fault-tolerance mechanisms that allows swift recovery from machine failure, without losing documents or queries.