WALD: A Modern & Sustainable Analytics Stack

PyCon DE & PyData Berlin 2023

The name **WALD**-stack stems from the four technologies it is composed of, i.e. a cloud-computing **W**arehouse like Snowflake or Google BigQuery, the open-source data integration engine **A**irbyte, the open-source full-stack BI platform **L**ightdash, and the open-source data transformation tool **D**BT. Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under [waldstack.org](https://waldstack.org).

The current zeitgeist is that the data lake concept from classical data engineering and modern data warehousing from business intelligence are converging more and more. This is also driving the shift from ETL to ELT, and so tools such as [dbt] are becoming increasingly important in combination with modern Big Data warehouses such as [Snowflake] and [Google BigQuery]. For typical data and MI engineers, this is quite a departure from familiar tools like [Spark]. Having a pure Spark and ETL background myself, this trend motivated me to explore the foreign realms of ELT, data warehousing and especially the fuzz about [dbt]. In this talk I want to share my key insights with classical data / ml engineers that might have only heard about [Snowflake], [dbt], [Airbyte] and [Lightdash] but have never cared to dig deeper. My talk is structured like this: * short introduction to the differences of data lake vs data warehouse, ETL vs ELT * high-level introduction of Snowflake, Airbyte, dbt, and Lightdash * demonstration based on the [Kaggle Formula 1 World Championship dataset] to see those four tools in action * my main take-aways and key insights After this talk, you will have learned the differences between ETL & ELT, what these four tools do and in which cases you should consider the WALD stack. Also, you will know how to use Python instead of SQL to define models in dbt, which is a brand-new feature. The WALD-stack is sustainable since it consists mainly of open-source technologies, however all technologies are also offered as managed cloud services. The data warehouse itself, i.e. [Snowflake] or [Google BigQuery], is the only non-open-source technology in the WALD-stack. In my talk, I will focus on the open-source parts of the WALD-stack. [dbt]: https://www.getdbt.com/ [Snowflake]: https://www.snowflake.com/ [Lightdash]: https://github.com/lightdash/lightdash [Airbyte]: https://airbyte.com/ [Google BigQuery]: https://cloud.google.com/bigquery [Kaggle Formula 1 World Championship dataset]: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 [Spark]: https://spark.apache.org/

Speakers: Florian Wilhelm