The Python data ecosystem has matured during the last decade and there are less and less reasons to rely only large batch process executed in a Spark cluster, but with every large ecosystem, putting together the key pieces of technology takes some effort. There are now better storage technologies, streaming execution engines, query planners, and low level compute libraries. And modern hardware is way more powerful than what you'd probably expect. In this workshop we will explore some global-warming-reducing techniques to build more efficient data transformation pipelines in Python, and a little bit of Rust.
When one looks at the architecture diagram for the big data ecosystem of most corporations, there's a Spark cluster in the center. Even some of these corporations have adopted Spark as the "de facto" platform for ETL. If you have a Spark cluster, it's fine to use it, but maybe there are other ways to extract, transform, and load large volumes of data more efficiently and with less overhead.
Some of the technologies that we'll cover are:
* Duckdb. Probably the hottest piece of technology of this decade.
* Polars.
* Datafusion, and a little bit or Rust.
* Microbatching.
* Statistical tests.
* We'll dive a little into what makes Parquet datasets so great.
* Filter pushdown and predicate pushdown.
* Overlapping communications and computation.
We'll work on a synthetic use case where we'll try to find find out if an online casino is trying to manipulate the roulette boards. To make things harder, we'll use an old and crappy low-power desktop PC with the equivalent computing power of a modern Raspberry PI to crunch around half a terabyte of data.