Regular expressions are commonly implemented using a backtracking algorithm, or by converting them into a nondeterministic finite automata using Thompson's construction. An alternate approach is to use Brzozowski derivatives to directly generate deterministic finite automata. This talk describes an implementation of Brzozowski derivatives in Python, based on a paper by Owens, Reppy and Turon. This implementation is used for generating efficient Unicode lexers.
Owens, Reppy and Turon [1] describe how Brzozowski derivatives may be used to convert an extended regular expression into near optimal deterministic finite state automaton. They observe that "RE derivatives have been lost in the sands of time, and few computer scientists are aware of them". Despite that, this technique is simple, elegant and straightforward to implement. The author has used Brzozowski derivatives to build a lexer generator in Python that supports Unicode.
The structure of the talk is:
- Overview of partial derivatives of regular expressions.
- How to deal with large character sets.
- Key design decisions for a Python implementation.
- Quick tour of the Python code.
- Real world results.
[1] Owens, S., Reppy, J. and Turon, A., 2009. Regular-expression derivatives re-examined. Journal of Functional Programming, 19(2), pp.173-190.