The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams.
Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this...").
This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack.
In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.
This light-hearted talk will aim to introduce the audience to the theory and terminology of data pipelines and architectures past, present and future. The "Modern Data Stack" set of interoperable tools introduced a shift in how organisations can rapidly construct a data architecture that can combine multiple data sources into a single unified data warehouse with clean analytics-ready tables for plugging BI tools, self-serve analytics dashboards, and ML models into.
Until recently, the complexity of data transformation and modelling was limited to what can be done with SQL, leaving the rich ecosystem of Python tooling for complex transformations, geospatial analytics, time series modelling, data validation tools and clean tested CI-enabled codebases mostly uninvited to the Modern Data Stack party. A recent trend has been a number of tools that launched Python integrations in 2022 (most notably by dbt), opening up a world of productivity and fast scalable data processing for the PyData-savvy Pythonista.
Another recent trend is an explosion of jargon, with analytics engineers getting into heated debates around whether data observability or metadata-capture should be prioritised within a data mesh architecture. These are all important concepts, especially for organisations operating at a scale where reliable data governance is mission-critical. Not all organisations are operating at that scale, and every organisation large or small is own its own data maturity journey.
My goal with this talk is to bring these concepts together, introduce attendees to these recent trends, and provide a framework they can take back into their organisations for accelerating their own data maturity journey using the latest tooling & best practices.