Raised by Pandas, striving for more: An opinionated introduction to Polars

PyCon DE & PyData Berlin 2023

Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars. Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers? In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)

Pandas and Polars are both popular open-source libraries for data manipulation and analysis in Python. While both libraries offer a range of powerful tools for working with data, there are several key differences that users should be aware of when choosing which library to use. One of the main differences between Pandas and Polars is the way that they handle data processing and evaluation. Pandas uses a traditional, eager evaluation model, in which operations are immediately evaluated and the results are returned. In contrast, Polars offers optional lazy evaluation, which allows users to delay the evaluation of certain operations until they are actually needed. This can be especially useful for large or complex datasets, as it can improve performance by reducing the amount of data that needs to be processed at any given time. Another key difference between the two libraries is the way they handle data storage and indexing. Pandas is built around a powerful indexing system that allows users to quickly access and manipulate specific rows or columns of data. However, this indexing system can be complex and can sometimes lead to slower performance. In contrast, Polars does not use indexes, which can simplify the underlying data structure and improve performance. In terms of functionality, Pandas has a number of features that are not currently available in Polars. For example, Pandas offers built-in plotting functionality, which can be useful during explorative data analysis for visualizing and interpreting data. Additionally, Pandas has a much stronger integration in the PyData ecosystem and is more widely used in data analysis and scientific computing. This can make it easier for users to find resources and support when working with Pandas. One notable difference between the two libraries is the syntax and API. Polars is inspired by the popular distributed computing library Apache Spark, but uses a column-based API in contrast to the row-based API within Spark. Generally the polar's syntax will be more familiar to spark users. Overall, both Pandas and Polars are powerful libraries with a lot to offer for data manipulation and analysis in Python. Which library is the best choice will depend on the specific needs and goals of the user. By understanding the differences between the two libraries, users can make an informed decision about which one is best suited for their needs.

Speakers: Nico Kreiling