Writing efficient data pipelines in Python can be tricky. The standard recommendation is to use vectorized functions implemented in Numpy, Pandas, or the like. However, what to do, when the processing task does not fit these libraries? Using plain Python for processing can result in lacking performance, in particular when handling large data sets. Rust is a modern, performance-oriented programming language that is already widely used by the Python community. Augmenting data processing steps with Rust can result in substantial speed ups. In this talk will present strategies of using Rust in a larger Python data processing pipeline with a particular focus on pragmatism and minimizing integration efforts.
One common strategy is to wrap the Rust part as a Python extension module. With enough care, the extensions module can have a pythonic feel and substantially improve performance. While libraries, such as PyO3, offer streamlined APIs, this task can still require lot of work. An often simpler alternative is to package the Rust part as an executable and communicate via files or network. This talk will focus on JSON messages exchanged via stdin / stdout or dataframe-like data in Arrow-compatible files. JSON is broadly supported in both Python and Rust and serialization can easily be handled with libraries such as SerDe (Rust) or cattrs (Python). The Arrow in-memory format supports complex data types, such as structs, lists, maps, or unions. These files can then be efficiently processed in Python by ever an growing list of libraries, most prominently pandas and polars. I will discuss the different strategies using real-world use cases and offer tips on how to implement them. Finally I will end by summarizing the respective strengths and weaknesses of the approaches.
Speakers: Christopher Prohm