Effective spelling correction with term relation graphs using Lucene FSTs

FOSDEM 2015

Our approach for doing spelling correction has deviated considerably from the default approach Lucene is offering. While Lucene's FuzzyQuery uses a compiled Automaton to filter the dictionary rather efficiently, we directly use normal Automatons and intersect that automaton with a separate FST. This proves to be more efficient for our use case, since it saves memory and time to compile the automaton and also part of the time to identify the matching terms in the dictionary.

This approach gives us the possibility to store any meta-information with each term, which is used then to pick the top N spellcorrections. We build a term co-occurence graph, where each vertex is a possible spelling correction of a term and each link is the co-occurence in the same document. Each vertex and link get a score based on the meta-information and edit distance. Then we use graph reduction techniques until the graph contains the desired number of spellcorrections.

Speakers: Anna Ohanyan