Max Kahan

Maxim Danilov

Martin Christen

Joris Van den Bossche

Noa Tamir

Guillem Borrell Nogueras

Luis Fernando Alvarez

Lea Petters

Paul Elvers

Paula Gonzalez Avalos

Jens Nie

AbdealiLoKo

Kolja Maier

Emeli Dral

Suzin You

Sven Oehler

Etzik Bega

Felix Wick

Jonathan Brandt
Jonathan Brandt

Nikolai

Leonard Püttmann

Alicia Bargar

Cole Bailey

Yasin Tatar

PyCon DE & PyData Berlin 2023

home

start: 2023-04-17

end: 2023-04-19

venue: Berlin Congress Center Wikidata

city: Berlin Wikidata

country: Germany 🇩🇪

Talks

113 talks

🎤 Honey, I broke the PyTorch model >.< - Debugging custom PyTorch models in a structured manner
Speakers: 👤 Clara Hoffmann
📅 Mon, 17 Apr 2023 at 10:50 show details

When building PyTorch models for custom applications from scratch there's usually one problem: The model does not learn anything. In a complex project, it can be tricky to identify the cause: Is it the data? A bug in the model? Choosing the wrong loss function at 3 am after an 8-hour coding session? In this talk, we will build a toolbox to find the culprits in a structured manner. We will focus on simple ways to ensure a training loop is correct, generate synthetic training data to determine whether we have a model bug or problematic real-world data, and leverage pytest to safely refactor PyTorch models. After this talk, visitors will be well equipped to take the right steps when a model is not learning, quickly identify the underlying reasons, and prevent bugs in the future.

PyTorch models for off-the-shelf applications are easy to build and debug. But in real-world ML applications, debugging can become quite tricky - especially when model complexity is high and only noisy real-world data is available. When our DNN is not learning many factors can be at fault: - Is there a bug in the model structure - for example mixed-up channels or timesteps? - Is our dataset not large or homogeneous enough to learn something? Have we mixed up labels in the preprocessing? - Have we chosen incorrect losses, accidentally skipped layers, or chosen inappropriate activation functions? The plethora of potential reasons can be overwhelming to engineers. This talk will introduce a structured approach and valuable tools for efficiently debugging PyTorch models. We'll start with techniques to check for correct training loops, such as ensuring our model overfits with a single training example. In the second step, we'll investigate how to generate simple, synthetic data for arbitrary input and output formats to validate our model. At last, we'll look at how to avoid model bugs altogether, by setting up universal tests that can be used during development and refactoring to prevent breaking PyTorch models.

🎤 Cooking up a ML Platform: Growing pains and lessons learned
Speakers: 👤 Cole Bailey
📅 Mon, 17 Apr 2023 at 10:50 show details

🎤 Apache StreamPipes for Pythonistas: IIoT data handling made easy!
Speakers: 👤 Tim Bossenmaier 👤 Sven Oehler
📅 Mon, 17 Apr 2023 at 10:50 show details

The industrial environment offers a lot of interesting use cases for data enthusiasts. There are myriads of interesting challenges that can be solved by data scientists. However, collecting industrial data in general and industrial IoT (IIoT) data in particular, is cumbersome and not really appealing for anyone who just wants to work with data. Apache StreamPipes addresses this pitfall and allows anyone to extract data from IIoT data sources without messing around with (old-fashioned) protocols. In addition, StreamPipes newly developed Python client now gives Pythonistas the ability to programmatically access and work with them in a Pythonic way. This talk will provide a basic introduction into the functionality of Apache StreamPipes itself, followed by a deeper discussion of the Python client. Finally, a live demo will show how IIoT data can be easily derived in Python and used directly for visualization and ML model training.

The industrial environment is becoming an increasingly attractive use case for data enthusiasts with challenges ranging from predictive maintenance to robotics to autonomous vehicles. Building a full-fledged IIoT architecture is a big endeavor, especially for small and medium sized companies with limited resources. It requires IIoT specialists with extensive knowledge of industrial protocols, software architects capable of designing an IIoT platform, and cloud specialists able to operate an infrastructure at scale that can handle potentially massive data streams. However, the added value lies not in the technical infrastructure, but in the data itself. Therefore, it should be as easy as possible for data scientists to analyze data to gain new insights without worrying about underlying technical details. But such a project has many pitfalls, which is why many projects are not even initiated because the costs seem too high. These pitfalls are addressed by Apache StreamPipes, an end-to-end toolbox that allows anyone to easily extract, explore and analyze IIoT data. With its new Python client, it targets Python data enthusiasts (e.g., data scientists) who want to work with IIoT data but don't want to get their hands dirty interacting with industrial systems. Via an easy-to-use python client, it is possible for developers to get streaming or historic data from StreamPipes internal data management layer in a pythonic representation like dictionaries or pandas dataframes. This allows data scientists to work with their familiar tech stack and use the extracted data directly for analytics, visualizations, or even machine learning. StreamPipes handles all the infrastructure such as the message broker or time-series storage and provides many out-of-the-box features that ease data analytics of industrial sources: More than 20 data adapters for quickly getting access to a variety of industrial protocols, built-in pre-processing rules to harmonize sensor and other data on the fly and a pipeline editor featuring over 100 algorithms and a rich user interface to interactively build data processing pipelines. Apache StreamPipes is a large and mature open source project which started as a research project in 2015 and made its way to an Apache top-level project in November 2022 with a community of currently more than 25 active contributors. The talk will provide a basic introduction to Apache StreamPipes, followed by a deeper discussion of the Python client focusing on the target audience (Python developers). The main part is about data handling with python, and design decisions within the client for common patterns will be discussed in detail. As a conclusion we will show how IIoT data can be extracted via Apache StreamPipes and used for further analytics within the Python world. Attendees will get familiar with Apache StreamPipes in general, its mission, and its core modules. In addition, common IIoT patterns will be presented and illustrated using the Python client of Apache StreamPipes. The presentation includes an extensive demo with many hands-on examples.

🎤 Pandas 2.0 and beyond
Speakers: 👤 Joris Van den Bossche 👤 Patrick Hoefler
📅 Mon, 17 Apr 2023 at 10:50 show details

🎤 How to teach NLP to a newbie & get them started on their first project
Speakers: 👤 Lisa Andreevna Chalaguine
📅 Mon, 17 Apr 2023 at 10:50 show details

The materials presented during this tutorial are open source and can be used by coaches and tutors who want to teach their students how to use Python for text processing and text classification. (A minimal understanding of programming (in any language) is required by the students)

The materials presented at this tutorial were initially created for high school and university students to help them to get started with their first machine learning project using textual data. Machine learning on textual data is more accessible for beginners because it does not involve missing data imputation, normalisation and scaling. It is also easier to analyse and interpret the results (e.g. why something was misclassified). There are many introductory courses on NLP on the internet, however, they are not for free and they either only cover complete basics¹, or do not cover machine learning algorithms² and treat models as a black box. Also, they do not show how to do research correctly (e.g. setting a baseline, making design decisions based on correct validation etc). These materials in the form of jupyter notebooks can be used by teachers to guide their students through an NLP research project from start to finish. These materials are of course not limited to teachers and tutors at academic institutions. Many companies rely on customer reviews, social media, client records, and various other content created in natural language, but often use sub-optimal solutions to analyse it (like MS Excel). These materials will give working professionals all the tools to get started with text analysis, as well as teach them the fundamentals of machine learning, so they can automate document labelling and other manual tasks with the help of document classification (e.g. Is a customer review positive or negative? Is a certain document about topic X or topic Y?). A minimal understanding of programming (in any language) is required. However, all necessary Python libraries will be covered. The aim of the tutorial would be to present the materials which contains 7 “lectures”, several practical exercises with solutions, and a case study and hence can be covered in either 10 hours (10 weeks) over a term or a 2-day workshop. ¹https://www.udemy.com/course/natural-language-processing/ ²https://www.udemy.com/course/nlp-natural-language-processing-with-python/

🎤 Accelerate Python with Julia
Speakers: 👤 Stephan Sahm
📅 Mon, 17 Apr 2023 at 10:50 show details

🎤 From notebook to pipeline in no time with LineaPy
Speakers: 👤 Thomas Fraunholz
📅 Mon, 17 Apr 2023 at 10:50 show details

The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. The good news is, there's finally a cure! The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it! In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?

The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. You embark on a time-consuming journey of refactoring the notebook. You come across irrelevant and relevant code snippets that are scattered in different cells but you persevere. Midway through your journey, you realize that your refactoring is not immune from the reproducibility issues caused by deleted cells and out-of-order cell executions. We haven't even talked about the creation of a pipeline from that notebook yet! A desperate situation indeed. The good news is, there's finally a cure! The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it! In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?

🎤 Large Scale Feature Engineering and Datascience with Python & Snowflake
Speakers: 👤 Michael Gorkow
📅 Mon, 17 Apr 2023 at 11:40 show details

🎤 AutoGluon: AutoML for Tabular, Multimodal and Time Series Data
Speakers: 👤 Caner Turkmen 👤 Oleksandr Shchur
📅 Mon, 17 Apr 2023 at 11:40 show details

AutoML, or automated machine learning, offers the promise of transforming raw data into accurate predictions with minimal human intervention, expertise, and manual experimentation. In this talk, we will introduce AutoGluon, a cutting-edge toolkit that enables AutoML for tabular, multimodal and time series data. AutoGluon emphasizes usability, enabling a wide variety of tasks from regression to time series forecasting and image classification through a unified and intuitive API. We will specifically focus on tasks on tabular and time series tasks where AutoGluon is the current state-of-the-art, and demonstrate how AutoGluon can be used to achieve competitive performance on tabular and time series competition data sets. We will also discuss the techniques used to automatically build and train these models, peeking under the hood of AutoGluon.

[AutoGluon](http://auto.gluon.ai) is a Python machine learning library which offers cutting edge accuracy and value-for-compute on a wide variety of tasks. These tasks include regression, classification and quantile regression in tabular data, as well as multimodal tasks such as image classification, image-to-text and text-to-text similarity. A recent addition to AutoGluon is AutoGluon-TimeSeries, the library's module for time series forecasting tasks. AutoGluon is organized into modules for Tabular, Multimodal and Time Series tasks all of which share an intuitive scikit-learn-like API for fitting and performing inference with cutting-edge machine learning in as little as three lines of code, with no in-depth understanding of ML. AutoGluon is widely considered the state-of-the-art in tabular tasks as confirmed by the independent [AutoML Benchmark](https://openml.github.io/automlbenchmark/papers.html), and is the current top performer on multimodal tasks on the RAFT leaderboard. In this talk, we will focus on the tabular and time series modules and showcase how the library can be used to get competitive results on competition platforms such as Kaggle. AutoGluon also differs quite significantly under the hood from other AutoML frameworks. The library does not take AutoML to primarily mean hyperparameter optimization, but leans heavily into building (stack) ensembles of strong but varied learning algorithms to achieve superior results. We will also showcase some of the theory and building blocks of AutoGluon, describing how we built an AutoML system that takes model ensembling as a central element.

🎤 Incorporating GPT-3 into practical NLP workflows
Speakers: 👤 Ines Montani
📅 Mon, 17 Apr 2023 at 11:40 show details

🎤 An unbiased evaluation of environment management and packaging tools
Speakers: 👤 Anna-Lena Popkes
📅 Mon, 17 Apr 2023 at 11:40 show details

🎤 Hyperparameter optimization for the impatient
Speakers: 👤 Martin Wistuba
📅 Mon, 17 Apr 2023 at 11:40 show details

In the last years, Hyperparameter Optimization (HPO) became a fundamental step in the training of Machine Learning (ML) models and in the creation of automatic ML pipelines. Unfortunately, while HPO improves the predictive performance of the final model, it comes with a significant cost both in terms of computational resources and waiting time. This leads many practitioners to try to lower the cost of HPO by employing unreliable heuristics. In this talk we will provide simple and practical algorithms for users that want to train models with almost-optimal predictive performance, while incurring in a significantly lower cost and waiting time. The presented algorithms are agnostic to the application and the model being trained so they can be useful in a wide range of scenarios. We provide results from an extensive experimental activity on public benchmarks, including comparisons with well-known techniques like Bayesian Optimization (BO), ASHA, Successive Halving. We will describe in which scenarios the biggest gains are observed (up to 30x) and provide examples for how to use these algorithms in a real-world environment. All the code used for this talk is available on (GitHub)[https://github.com/awslabs/syne-tune].

In this talk we will present simple and practical solutions to perform HPO quickly with results on-par with well-know (and costly) techniques. Our claims are supported by empirical evidence obtained on public standardized benchmarks and our work has been accepted in peer-reviewed workshop (currently under submission to a conference). Specifically, [1] has been accepted at the AutoML Conference Workshop Track and [2] has been accepted at the AutoML workshop at ICML 2021. All the code regarding the algorithms is available in the Syne-Tune package under license Apache 2.0 (https://github.com/awslabs/syne-tune). References: [1] https://arxiv.org/abs/2207.06940 [2] https://arxiv.org/abs/2103.16111

🎤 Keynote - A journey through 4 industries with Python: Python's versatile problem-solving toolkit
Speakers: 👤 Susan Shu Chang
📅 Mon, 17 Apr 2023 at 13:55 show details

🎤 Common issues with Time Series data and how to solve them
Speakers: 👤 Vadim Nelidov
📅 Mon, 17 Apr 2023 at 15:10 show details

Time-series data is all around us: from logistics to digital marketing, from pricing to stock markets. It’s hard to imagine a modern business that has no time series data to forecast. However, mastering such forecasting is not an easy task. For this talk, together with other domain experts, I have collected a list of common time series issues that data professionals commonly run into. After this talk, you will learn to identify, understand, and resolve such issues. This will include stabilising divergent time series, organising delayed / irregular data, handling missing values without anomaly propagation, and reducing the impact of noise and outliers on your forecasting models.

This talk will walk you through 4 common issues with Time Series and illustrate them using the context of energy demand forecasting. For each of these issues you will learn to identify, understand, and resolve them better. These issues are time series instability, delayed and irregular time series data, hard-to-impute missing values, impact of noise and outliers on forecasting models. The talk is therefore split into 4 parts each with some room for questions. Each part will provide some high-level background, explanations, examples and code snippets, while avoiding unnecessary in-depth computations and formulas. Therefore, the whole talk is accessible to both specialists with experience in Time Series analytics as well as those without such experience who nonetheless intend to broaden their understanding of this field and gain some valuable insights for the business problems that they are likely to encounter in the future. Data Scientists / Analysts working with time series data and understanding at least the basics of Pandas / Scikit-learn Python libraries as well as what a time series forecasting problem entails would benefit the most from this talk. However, other less technical specialists (management, product owners etc.) can still gain valuable domain knowledge in this field.

🎤 How to baseline in NLP and where to go from there
Speakers: 👤 Tobias Sterbak
📅 Mon, 17 Apr 2023 at 15:10 show details

In this talk, we will explore the build-measure-learn paradigm and the role of baselines in natural language processing (NLP). We will cover the common NLP tasks of classification, clustering, search, and named entity recognition, and describe the baseline approaches that can be used for each task. We will also discuss how to move beyond these baselines through weak learning and transfer learning. By the end of this talk, attendees will have a better understanding of how to establish and improve upon baselines in NLP.

In this talk, we will explore the role of baselines in natural language processing (NLP) and discuss how to move beyond these baselines through weak learning and transfer learning. First, I will introduce the build-measure-learn paradigm, which is a framework for developing and improving products or systems. This paradigm involves building a solution, measuring its performance, and learning from the results to iteratively improve the solution. Baselines are an essential part of this process because they provide a starting point for comparison and a benchmark to measure against. Next, I will delve into the common NLP tasks of classification, clustering, search, and named entity recognition (NER). For each task, I will describe the baseline approaches that can be used. These baselines may not be the most advanced or sophisticated solutions, but they are often quick and easy to implement, and they can serve as a useful reference and guidance for further improvement. Finally, I will discuss how to move on from these baselines. One option is to use insights from the baselines to build a weak learning system, which is a machine learning model that relies on human-generated rules or patterns rather than a large dataset. Another option is to leverage transfer learning, which involves adapting a pre-trained model to a new task or domain by fine-tuning its parameters on a smaller dataset. In conclusion, this talk will provide a practical guide to establishing baselines in NLP and moving beyond them through weak learning and transfer learning.

🎤 Exploring the Power of Cyclic Boosting: A Pure-Python, Explainable, and Efficient ML Method
Speakers: 👤 Felix Wick
📅 Mon, 17 Apr 2023 at 15:10 show details

We have recently open-sourced a pure-Python implementation of Cyclic Boosting, a family of general-purpose, supervised machine learning algorithms. Its predictions are fully explainable on individual sample level, and yet Cyclic Boosting can deliver highly accurate and robust models. For this, it requires little hyperparameter tuning and minimal data pre-processing (including support for missing information and categorical variables of high cardinality), making it an ideal off-the-shelf method for structured, heterogeneous data sets. Furthermore, it is computationally inexpensive and fast, allowing for rapid improvement iterations. The modeling process, especially the infamous but unavoidable feature engineering, is facilitated by automatic creation of an extensive set of visualizations for data dependencies and training results. In this presentation, we will provide an overview of the inner workings of Cyclic Boosting, along with a few sample use cases, and demonstrate the usage of the new Python library. You can find Cyclic Boosting on GitHub: https://github.com/Blue-Yonder-OSS/cyclic-boosting

🎤 The CPU in your browser: WebAssembly demystified
Speakers: 👤 Antonio Cuni
📅 Mon, 17 Apr 2023 at 15:10 show details

🎤 Staying Alert: How to Implement Continuous Testing for Machine Learning Models
Speakers: 👤 Emeli Dral
📅 Mon, 17 Apr 2023 at 15:10 show details

Proper monitoring of machine learning models in production is essential to avoid performance issues. Setting up monitoring can be easy for a single model, but it often becomes challenging at scale or when you face alert fatigue based on many metrics and dashboards. In this talk, I will introduce the concept of test-based ML monitoring. I will explore how to prioritize metrics based on risks and model use cases, integrate checks in the prediction pipeline and standardize them across similar models and model lifecycle. I will also take an in-depth look at batch model monitoring architecture and the use of open-source tools for setup and analysis.

Have you ever deployed a machine learning model in production only to realize that it wasn't performing as well as you thought it would, or was late to detect a model performance drop due to corrupted data? Proper monitoring can help avoid it. Typically, this involves checking the quality of the input data, monitoring the model's responses, and detecting any changes that might lead to model quality drops. However, setting up monitoring is often easier said than done. First, while it is easy to write a few assertions for data quality checks or track accuracy for a single model you created, it is much more challenging to do so consistently and at scale as the number of models, pipelines, and the volume of data increases. Second, building monitoring dashboards to track many metrics often leads to alert fatigue and does not help with root cause analysis of the problem. In this talk, I will introduce the idea of test-based ML monitoring and how it can help you keep your models in check in production. I will cover the following: - The difference between testing and monitoring and when one is better than other - How to prioritize metrics and tests for each model based on risks and model use cases - How to integrate checks in the model prediction pipeline and standardize them across similar models and model lifecycle - An in-depth look at batch model monitoring architecture, including setup and analysis of results using open-source tools

🎤 Practical Session: Learning on Heterogeneous Graphs with PyG
Speakers: 👤 Ramona Bendias 👤 Matthias Fey
📅 Mon, 17 Apr 2023 at 15:10 show details

🎤 Raised by Pandas, striving for more: An opinionated introduction to Polars
Speakers: 👤 Nico Kreiling
📅 Mon, 17 Apr 2023 at 15:10 show details

Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars. Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers? In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)

Pandas and Polars are both popular open-source libraries for data manipulation and analysis in Python. While both libraries offer a range of powerful tools for working with data, there are several key differences that users should be aware of when choosing which library to use. One of the main differences between Pandas and Polars is the way that they handle data processing and evaluation. Pandas uses a traditional, eager evaluation model, in which operations are immediately evaluated and the results are returned. In contrast, Polars offers optional lazy evaluation, which allows users to delay the evaluation of certain operations until they are actually needed. This can be especially useful for large or complex datasets, as it can improve performance by reducing the amount of data that needs to be processed at any given time. Another key difference between the two libraries is the way they handle data storage and indexing. Pandas is built around a powerful indexing system that allows users to quickly access and manipulate specific rows or columns of data. However, this indexing system can be complex and can sometimes lead to slower performance. In contrast, Polars does not use indexes, which can simplify the underlying data structure and improve performance. In terms of functionality, Pandas has a number of features that are not currently available in Polars. For example, Pandas offers built-in plotting functionality, which can be useful during explorative data analysis for visualizing and interpreting data. Additionally, Pandas has a much stronger integration in the PyData ecosystem and is more widely used in data analysis and scientific computing. This can make it easier for users to find resources and support when working with Pandas. One notable difference between the two libraries is the syntax and API. Polars is inspired by the popular distributed computing library Apache Spark, but uses a column-based API in contrast to the row-based API within Spark. Generally the polar's syntax will be more familiar to spark users. Overall, both Pandas and Polars are powerful libraries with a lot to offer for data manipulation and analysis in Python. Which library is the best choice will depend on the specific needs and goals of the user. By understanding the differences between the two libraries, users can make an informed decision about which one is best suited for their needs.

🎤 A concrete guide to time-series databases with Python
Speakers: 👤 Heiner Tholen 👤 Ellen König
📅 Mon, 17 Apr 2023 at 15:45 show details

🎤 Have your cake and eat it too: Rapid model development and stable, high-performance deployments
Speakers: 👤 Christian Bourjau 👤 Jakub Bachurski
📅 Mon, 17 Apr 2023 at 15:45 show details

🎤 Performing Root Cause Analysis with DoWhy, a Causal Machine-Learning Library
Speakers: 👤 Patrick Blöbaum
📅 Mon, 17 Apr 2023 at 15:45 show details

In this talk, we will introduce the audience to [DoWhy](https://www.pywhy.org/dowhy), a library for causal machine-learning (ML). We will introduce typical problems where causal ML can be applied and will specifically do a deep dive on root cause analysis using DoWhy. To do this, we will lay out what typical problem spaces for causal ML look like, what kind of problems we're trying to solve, and then show how to use DoWhy's API to solve these problems. Expect to see a lot of code with a hands-on example. We will close this session by zooming out a bit and also talk about the PyWhy organization governing DoWhy.

_"Much like machine learning libraries have done for prediction, DoWhy is a Python library that aims to spark causal thinking and analysis. DoWhy provides a wide variety of algorithms for effect estimation, causal structure learning, diagnosis of causal structures, root cause analysis, interventions and counterfactuals."_ The field of causal machine-learning (ML) is not as well-known as typical machine-learning problems and libraries. DoWhy is one of the more popular open-source libraries for causal ML. And not for nothing: DoWhy is based on the two major scientific frameworks, Potential Outcome and Graphical Causal Models and offers a large variety of features. Problems where causal ML can be applied, come from any imaginable domain, be that distributed computer systems, supply chain, workflow management, manufacturing, etc. As long as a complex system can be represented as a causal graph, one can also apply causal ML. In the talk, we will specifically dive into a microservice architecture, as this is an example which an audience like the one at PyCon can most likely relate to. We will present some data and then inject outliers (or anomalies) into that data, see how those propagate through the system, and then use DoWhy's algorithms to show us the root cause. By the end of the talk, the audience should have a good understanding of typical problem domains for causal ML and a good sense of how to use DoWhy to solve such problems.

🎤 WALD: A Modern & Sustainable Analytics Stack
Speakers: 👤 Florian Wilhelm
📅 Mon, 17 Apr 2023 at 15:45 show details

The name **WALD**-stack stems from the four technologies it is composed of, i.e. a cloud-computing **W**arehouse like Snowflake or Google BigQuery, the open-source data integration engine **A**irbyte, the open-source full-stack BI platform **L**ightdash, and the open-source data transformation tool **D**BT. Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under [waldstack.org](https://waldstack.org).

The current zeitgeist is that the data lake concept from classical data engineering and modern data warehousing from business intelligence are converging more and more. This is also driving the shift from ETL to ELT, and so tools such as [dbt] are becoming increasingly important in combination with modern Big Data warehouses such as [Snowflake] and [Google BigQuery]. For typical data and MI engineers, this is quite a departure from familiar tools like [Spark]. Having a pure Spark and ETL background myself, this trend motivated me to explore the foreign realms of ELT, data warehousing and especially the fuzz about [dbt]. In this talk I want to share my key insights with classical data / ml engineers that might have only heard about [Snowflake], [dbt], [Airbyte] and [Lightdash] but have never cared to dig deeper. My talk is structured like this: * short introduction to the differences of data lake vs data warehouse, ETL vs ELT * high-level introduction of Snowflake, Airbyte, dbt, and Lightdash * demonstration based on the [Kaggle Formula 1 World Championship dataset] to see those four tools in action * my main take-aways and key insights After this talk, you will have learned the differences between ETL & ELT, what these four tools do and in which cases you should consider the WALD stack. Also, you will know how to use Python instead of SQL to define models in dbt, which is a brand-new feature. The WALD-stack is sustainable since it consists mainly of open-source technologies, however all technologies are also offered as managed cloud services. The data warehouse itself, i.e. [Snowflake] or [Google BigQuery], is the only non-open-source technology in the WALD-stack. In my talk, I will focus on the open-source parts of the WALD-stack. [dbt]: https://www.getdbt.com/ [Snowflake]: https://www.snowflake.com/ [Lightdash]: https://github.com/lightdash/lightdash [Airbyte]: https://airbyte.com/ [Google BigQuery]: https://cloud.google.com/bigquery [Kaggle Formula 1 World Championship dataset]: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 [Spark]: https://spark.apache.org/

🎤 Polars - make the switch to lightning-fast dataframes
Speakers: 👤 Thomas Bierhance
📅 Mon, 17 Apr 2023 at 15:45 show details

In this talk, we will report on our experiences switching from Pandas to Polars in a real-world ML project. Polars is a new high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will compare the performance of polars with the popular pandas library, and show how polars can provide significant speed improvements for data manipulation and analysis tasks. We will also discuss the unique features of polars, such as its ability to handle large datasets that do not fit into memory, and how it feels in practice to make the switch from Pandas. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python.

The pandas library is one of the most widely used tools for working with data in the Python ecosystem. However, pandas can be slow for medium and larger datasets, and many users have been looking for faster alternatives. In this talk, we introduce the new polars library, a high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will report on our experiences switching from Pandas to Polars in a real-world ML project. We will compare the performance of polars with pandas using various use-cases, and show how polars can provide significant speed improvements for common data manipulation and analysis tasks. Due to its speed it can even be an alternative for cases where people normally use distributed systems like Spark. For example, we will demonstrate how polars can process large datasets with minimal overhead, and how its massive use of parallelization can provide an additional speed boost. We will also discuss how polars compares to other popular options like DuckDB and cuDF. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python. Whether you are a pandas user looking for a faster alternative, or a Spark user interested in a simpler alternative, this talk will provide valuable insights and practical examples.

🎤 Driving down the Memray lane - Profiling your data science work
Speakers: 👤 Cheuk Ting Ho
📅 Mon, 17 Apr 2023 at 15:45 show details

When handling a large amount of data, memory profiling the data science workflow becomes more important. It gives you insight into which process consumes lots of memory. In this talk, we will introduce Mamray, a Python memory profiling tool and its new Jupyter plugin.

In this talk, we will be exploring what memory profiling is, and how it can help with data science work. We will start the talk with a basic explanation of how Python arrange memories for various objects. This lays the foundation explanation of why we need a special tool to memory profile Python programs. Then we will be going through a data science use case where we memory profiles some part of the process with the Memray Jupyter plug-in. This would be a use case that a data science practitioner or learner would be familiar with and they can see how memory profiling could be useful. We will then explain how to interpret the frame diagram in Memray, a commonly used diagram in memory profiling to understand how much memory a process and its sub-process uses. This is something that for a new user, it could be hard to understand and not know what to look into. From this example, audiences can see what they can learn about from the frame diagram. ## Goal This talk is for data scientists, learners or anyone who is interested in memory profiling their Python program. Although the talk will be using a data science use case as an example, the explanation and the tool can be expanded to be used in any Python program. However, for data science practitioners and learners who have been using Python to process data, this may be a step forward for them to improve their data workflow and prevent memory leaks from their programs. ## Outline - Introduction (5 mins) - Why we need a special tool for memory profiling (5 mins) - How to use Memray in Jupyter notebook (5 mins) - Demonstration for using Memray in data science work (5 mins) - How to interpret a frame diagram (5 mins) - Conclusion (5 mins)

🎤 Specifying behavior with Protocols, Typeclasses or Traits. Who wears it better (Python, Scala 3, Rust)?
Speakers: 👤 Kolja Maier
📅 Mon, 17 Apr 2023 at 16:20 show details

🎤 FastAPI and Celery: Building Reliable Web Applications with TDD
Speakers: 👤 Avanindra Kumar Pandeya
📅 Mon, 17 Apr 2023 at 16:20 show details

In this talk, we will explore how to use the FastAPI web framework and Celery task queue to build reliable and scalable web applications in a test-driven manner. We will start by setting up a testing environment and writing unit tests for the core functionality of our application. Next, we will use FastAPI to create an api to perform some long-running task. Finally, we will then see how Celery can help us offload long-running tasks and improve the performance of our application. By the end of this talk, attendees will have a strong understanding of TDD and how to apply it to your FastAPI and Celery projects, and you will be able to write tests that ensure the reliability and maintainability of your code.

1. Introduction (1 min) - Title of the talk and speaker's name: This section introduces the title of the talk and the speaker's name, and current role of the speaker. - Overview of the topics covered in the talk: This section introduces the main themes and goals of the talk, and gives the audience a sense of what they can expect to learn. 2. What is Test-Driven Development (TDD)? (2 min) - Definition of TDD and how it fits into the software development process: This section defines TDD and explains how it fits into the software development process. It will highlight the benefits of TDD such as improved quality, reduced debugging time, and faster development. 3. Setting up a dockerized Development Environment for a Math API (5 min) - Installing the necessary tools and libraries with Docker: This section covers the steps to install the necessary tools and libraries for testing, such as FastAPI, Celery, and a testing framework. - Setting up a testing database with Docker: This subsection explains how to set up a testing database (PostgreSQL) using Docker. It can include instructions for pulling the Docker image, running the container, and configuring the connection. - Configuring the application to use the testing database: This sub-section covers the steps to configure the application to use the testing database during testing. It can include instructions for setting up environment variables or config files to switch between different databases. - Writing a basic test case: This subsection provides an example of a basic test case that verifies the setup of the testing environment. It will include a demonstration for running the test and checking the results. 4. Writing Unit Tests (7 min) - Identifying the core functionality and behavior of the application: This section discusses how to identify the core functionality and behavior of the application, and how to break it down into smaller pieces that can be tested separately. It can include tips on how to prioritize the tests and focus on the most important or risky areas of the code. - Writing test cases to cover the different scenarios and edge cases: This sub-section covers the steps to write test cases that cover the different scenarios and edge cases for the core functionality of the application. It can include examples of different types of tests, such as positive, negative, and boundary tests. - Using mocks and fixtures to isolate the tests: This subsection explains how to use mocks and fixtures to isolate the tests from external dependencies and control the input and output. It can include examples of how to use these techniques to test different parts of the application in isolation. 5. Building the API with FastAPI and Celery (8 min) - Setting up a FastAPI application: This section introduces FastAPI, and explains its key features and benefits. It will include a demonstration of how to use FastAPI to build a simple API using TDD. - Setting up a Celery worker and task queue: This subsection explains how to set up a Celery worker and task queue, and how to configure the application to use them. It can include instructions on how to install Celery, create a Celery instance, and define the queue and backend. - Defining tasks as functions and decorating them with Celery's @task decorator: This sub-section covers the steps to define tasks as functions and decorate them with Celery's @task decorator. It can include examples of how to define tasks and pass arguments and options to them. - Using the Celery client to trigger tasks and receive the results: This subsection explains how to use the Celery client to trigger tasks and receive the results. It can include instructions on how to send tasks and wait for the results, and how to handle errors and exceptions. 6. Conclusion and Next Steps (2 min) - Recap of the main points and takeaways from the talk: This section provides a brief summary of the main points and takeaways from the talk, and highlights the key skills and knowledge that the attendees have learned. - Suggestions for further learning and resources: This subsection provides suggestions for further learning and resources for the attendees to dive deeper into TDD and FastAPI/Celery development. It can include links to tutorials, documentation, and other resources that can help the attendees continue learning and practicing what they have learned in the talk. - Encouragement for attendees to apply these techniques to their own projects: This subsection encourages the attendees to apply the techniques and skills they have learned in the talk to their own projects, and to share their experiences and feedback with the community. 7. Question/Answer (5 min)

🎤 How to build observability into a ML Platform
Speakers: 👤 Alicia Bargar
📅 Mon, 17 Apr 2023 at 16:20 show details

🎤 BHAD: Explainable unsupervised anomaly detection using Bayesian histograms
Speakers: 👤 Alexander Vosseler
📅 Mon, 17 Apr 2023 at 16:20 show details

🎤 Building a Personal Assistant With GPT and Haystack: How to Feed Facts to Large Language Models and Reduce Hallucination.
Speakers: 👤 Mathis Lucka
📅 Mon, 17 Apr 2023 at 16:20 show details

Large Language Models (LLM), like ChatGPT, have shown miraculous performances on various tasks. But there are still unsolved issues with these models: they can be confidently wrong and their knowledge becomes outdated. GPT also does not have any of the information that you have stored in your own data. In this talk, you'll learn how to use Haystack, an open source framework, to chain LLMs with other models and components to overcome these issues. We will build a practical application using these techniques. And you will walk away with a deeper understanding of how to use LLMs to build NLP products that work.

You can apply LLMs to solve various NLP and NLU tasks, such as summarization or question answering. These models have billions of parameters they can use to effectively store some of the information they saw during pre-training. This enables them to show deep knowledge of a subject, even if they weren't explicitly trained on it. Yet, this capability also comes with issues. The information stored in the parameters can’t easily be updated, and the model's knowledge might become stale. The model won’t have any of your custom data, your company’s knowledge base for example. Sometimes, the model makes things up. We call that hallucination. Cases of hallucination can be hard to spot. The model may be very confident while making up a response. It may even make up fake citations and research papers to support its claims. Haystack is an open source NLP framework for pragmatic builders. Developers use it to build NLP applications, such as question answering systems, neural search engines, or summarization services. Haystack provides all the components you need to build an actual NLP application, which differentiates it from other NLP frameworks. It provides document conversion, pre-processing, data storage, vector databases, and model inference. It also wraps all these components in a neat pipeline abstraction. You can use a pipeline to run your application as a reliable and scalable service in production. In this talk, machine learning engineers, data scientists, and NLP developers will learn how Haystack integrates with LLMs, such as GPT-3. We will show how to use the pipeline abstraction and retrieval-augmented generation to address issues like stale knowledge and hallucination. We will also provide a practical example by showing how to create a personal assistant for knowledge workers. Each step will be accompanied with open source code examples. By the end of the talk, you will have seen these concepts applied in practice and you will be able to build an assistant for your own use case.

🎤 Keynote - How Are We Managing? Data Teams Management IRL
Speakers: 👤 Noa Tamir
📅 Tue, 18 Apr 2023 at 09:15 show details

🎤 Aspect-oriented Programming - Diving deep into Decorators
Speakers: 👤 Mike Müller
📅 Tue, 18 Apr 2023 at 10:30 show details

The aspect-oriented programming paradigm can support the separation of cross-cutting concerns such as logging, caching, or checking of permissions. This can improve code modularity and maintainability. Python offers decorator to implement re-usable code for cross-cutting task. This tutorial is an in-depth introduction to decorators. It covers the usage of decorators and how to implement simple and more advanced decorators. Use cases demonstrate how to work with decorators. In addition to showing how functions can use closures to create decorators, the tutorial introduces callable class instance as alternative. Class decorators can solve problems that use be to be tasks for metaclasses. The tutorial provides uses cases for class decorators. While the focus is on best practices and practical applications, the tutorial also provides deeper insight into how Python works behind the scene. After the tutorial participants will feel comfortable with functions that take functions and return new functions.

## Audience This tutorial is for intermediate Python programmers who want to dive deeper. Solid working knowledge of functions and classes basics is required. ## Format The tutorial will be hands on. I will start with a blank Notebook for each topic and develop the content step-by-step. The participants are encouraged to type along. My typing speed is usually appropriate and allows participants to follow. The students will receive a comprehensive PDF with all course content as well Python source code files for all use cases and large code blocks I use. I will load these files in my Notebook. The students can do the same or open the files in their preferred editor or IDE. I also explicitly ask for feedback if I am too fast or things are unclear. I encourage questions at any time. In fact, questions and my answers are often an important part of my teaching, making the learning experience much more lively and typically more useful. So the participants will be active throughout the whole tutorial. There will be two exercises that each participant has to do on its own (or in breakout rooms if the tutorials should be remote) during the tutorial. We will look at the solutions during the tutorial. I also supply a solutions PDF after the tutorial. ## Outline * Examples of using decorators * from the standard library * from third-party packages * Closures for decorators * Write a simple decorator * Best Practice * Use case: Caching * Use case: Logging * Parameterizing decorators * Chaining decorators * Callable instances instead of functions * Use case: Argument Checking * Use case: Registration * Class decorators * Wrap-up and questions

🎤 The State of Production Machine Learning in 2023
Speakers: 👤 Alejandro Saucedo
📅 Tue, 18 Apr 2023 at 10:30 show details

As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges. This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems. This talk will be relevant for any keen python practitioners or seasoned ML practitioners interested to get an updated overview of the state of the production ML ecosystem in the current year, covering a broad range of sub-fields in the space. This talk will benefit the Python ecosystem by providing cross-functional knowledge, bringing together best practices from data scientists, software engineers and DevOps engineers to tackle the challenge of machine learning at scale. During this talk we will shed light into some of the more popular and up-and-coming libraries to watch in this space, and we will provide a conceptual and practical hands on deep dive which will allow the community to both, tackle this issues and help further the discussion.

🎤 What could possibly go wrong? - An incomplete guide on how to prevent, detect & mitigate biases in data products
Speakers: 👤 Lea Petters
📅 Tue, 18 Apr 2023 at 10:30 show details

Within this talk, I want to look at the topic of data ethics with a practical lens and facilitate the discussion about how we can establish ethical data practices into our day to day work. I will shed some light on the multiple sources of biases in data applications: Where are potential pitfalls and how can we prevent, detect and mitigate them early so they never become a risk for our data product. I will walk you through the different stages of a data product lifecycle and dive deeper into the questions we as data professionals have to ask ourselves throughout the process. Furthermore, I will present methods, tools and libraries that can support our work. Being well aware that there is no universal solution as tools and strategies need to be chosen to specifically address requirements of the use-case and models at hand, my talk will provide a good starting point for your own data ethics journey.

Terms like trustworthy, responsible or ethical AI have been popular buzzwords for some time. But while we've seen some startling examples of ‘AI gone wrong’, such as when Facebook falsely classified black persons as ‘Primates’, Amazon’s hiring algorithm discriminated against women or the A-level algorithmic grading fiasco in the UK, for many data projects ethical considerations only come into play as an afterthought - if at all. Experience has shown that more accountability and transparency are needed in AI systems, and regulatory initiatives such as the EU AI Act make it increasingly important to treat the topic as a first-class citizen throughout the whole development process. While the implementation of legal initiatives and ethics guidelines raise awareness and bring the topic into focus, it often remains quite abstract and difficult to translate into our day to day work. Therefore, I want to look at the topic with a practical lens and facilitate the discussion about how we can establish ethical data practices. I will shed some light on the multiple sources of biases in data applications: Where are potential pitfalls and how can we prevent, detect and mitigate them early so they never become a risk for our data product. I will walk you through the different stages of a data product lifecycle and dive deeper into the questions we as data professionals have to ask ourselves throughout the process. Furthermore, I will present methods, tools and libraries that can support our work. Being well aware that there is no universal solution as tools and strategies need to be chosen to specifically address requirements of the use-case and models at hand, my talk will provide a good starting point for your own data ethics journey.

🎤 Geospatial Data Processing with Python: A Comprehensive Tutorial
Speakers: 👤 Martin Christen
📅 Tue, 18 Apr 2023 at 10:30 show details

In this tutorial, you will learn about the various Python modules for processing geospatial data, including GDAL, Rasterio, Pyproj, Shapely, Folium, Fiona, OSMnx, Libpysal, Geopandas, Pydeck, Whitebox, ESDA, and Leaflet. You will gain hands-on experience working with real-world geospatial data and learn how to perform tasks such as reading and writing spatial data, reprojecting data, performing spatial analyses, and creating interactive maps. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing

Geospatial data, which refers to data that has a geographic component, is a crucial part of many fields, including geography, geography, urban planning, and environmental science. In this tutorial, you will learn about the various Python modules that are available for working with geospatial data. We will start by introducing the **GDAL** (Geospatial Data Abstraction Library) and **Rasterio** modules, which are used for reading and writing raster data (data stored in a grid of cells, where each cell has a value). You will learn how to read and write common raster formats such as GeoTIFF and ESRI ASCII, as well as how to perform common raster operations such as resampling and reprojecting. Next, we will cover the **Pyproj** module, which is used for performing coordinate system transformations. You will learn how to convert between different coordinate systems and how to perform common tasks such as converting latitude and longitude coordinates to UTM (Universal Transverse Mercator) coordinates. After that, we will introduce the **Shapely** module, which is used for working with geometric objects in Python. You will learn how to create and manipulate points, lines, and polygons, as well as how to perform spatial operations such as intersection and union. Then, we will cover the **Folium** module, which is used for creating interactive maps in Python. You will learn how to create simple maps, add markers and popups, and customize the appearance of your maps. Next, we will introduce the **Fiona** module, which is used for reading and writing vector data (data stored as individual features, each with its own geometry and attributes). You will learn how to read and write common vector formats such as ESRI Shapefile and GeoJSON, as well as how to access and manipulate the attributes of vector features. After that, we will cover the **OSMnx** module, which is used for working with OpenStreetMap data in Python. You will learn how to download and manipulate street networks, buildings, and other geospatial data from OpenStreetMap. Next, we will introduce the **Libpysal** module, which is used for performing spatial statistics and econometrics in Python. You will learn how to calculate spatial weights, perform spatial autocorrelation tests, and estimate spatial econometric models. Then, we will cover the **Geopandas** module, which is used for working with geospatial data in a Pandas DataFrame. You will learn how to load and manipulate vector data, perform spatial joins, and create choropleth maps. After that, we will introduce the **Pydeck** module, which is used for creating interactive 3D maps in Python. You will learn how to create 3D point clouds, 3D building models, and other 3D geospatial visualizations. Next, we will cover the **Whitebox** module, which is a powerful open-source GIS toolkit for performing geospatial data processing and analysis. You will learn how to use Whitebox to perform tasks such as raster reclassification, terrain analysis, and hydrological modeling. Finally, we will introduce the **ESDA** (Exploratory Spatial Data Analysis) and **LeafMap** modules, which are used for exploring and visualizing spatial patterns and relationships in data. You will learn how to calculate spatial statistics such as Moran's I and local spatial autocorrelation statistics, and how to create interactive choropleth maps. By the end of this tutorial, you will have a solid understanding of the various Python modules that are available for working with geospatial data and will have hands-on experience applying these tools to real-world data. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing.

🎤 Bayesian Marketing Science: Solving Marketing's 3 Biggest Problems
Speakers: 👤 Dr. Thomas Wiecki
📅 Tue, 18 Apr 2023 at 10:30 show details

In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value. In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined together to make optimal marketing budget decisions in complex scenarios.

Marketing data science attempts to answer three main questions: 1. How much does it cost to acquire a customer on a given channel? 2. How much do I earn from an acquired customer over their lifetime? 3. What is the causal impact of my marketing campaigns? While seemingly straight-forward, robust estimation of these quantities on noisy, non-stationary and highly structured data is quite tricky. Moreover, while these questions are intimately related, they are often answered separately. In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value. In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined. Together, these tools demonstrated provide a powerful open-source suite to solve today's biggest marketing analytics challenges.

🎤 Software Design Pattern for Data Science
Speakers: 👤 Theodore Meynard
📅 Tue, 18 Apr 2023 at 10:30 show details

🎤 Improving Machine Learning from Human Feedback
Speakers: 👤 Erin Mikail Staples 👤 Nikolai
📅 Tue, 18 Apr 2023 at 10:30 show details

Large generative models rely upon massive data sets that are collected automatically. For example, GPT-3 was trained with data from “Common Crawl” and “Web Text”, among other sources. As the saying goes — bigger isn’t always better. While powerful, these data sets (and the models that they create) often come at a cost, bringing their “internet-scale biases” along with their “internet-trained models.” While powerful, these models beg the question — is unsupervised learning the best future for machine learning? ML researchers have developed new model-tuning techniques to address the known biases within existing models and improve the model’s performance (as measured by response preference, truthfulness, toxicity, and result generalization). All of this at a fraction of the training cost is very low compared to the initial training cost. This talk will explore these Reinforcement Learning from Human Feedback (RLHF) techniques and how open-source machine learning tools like PyTorch and Label Studio can tune off-the-shelf models using direct human feedback. We’ll start by covering traditional RLHF, in which a model is given a set of prompts to generate outputs. These prompt/output pairs are then graded by human annotators who rank pairs according to a desired metric, which are then used as a reinforcement learning data set to optimize the model to produce results closer to the metric criteria. Next, we’ll discuss recent advances within this field and the advantages they provide. One advance we’ll dive into is the use of Human Language Feedback, in which ranks are replaced with human-language summaries that take full advantage of the “full expressiveness of language that humans use.” This contextual feedback, along with the original prompt and output of the model, is used to generate a new set of model refinements. The model is then tuned with these refinements to match the new output to the human feedback. In a 2022 study, researchers at NYU reported that “using only 100 samples of human-written feedback finetunes a GPT-3 model to roughly human-level summarization ability.” It’s advances like these that are providing advantages in terms of accuracy and bias reduction. Finally, we’ll leave you with examples and resources on implementing these training methods using publicly available models and open-source tools like PyTorch and Label Studio to help retrain models for targeted applications. As this industry continues to grow, evolve, and develop into more widespread applications, we must approach this space with ethics and sustainability in mind. By combining the power and expansiveness of these widely-popular “internet-scale models” with specific, targeted, human approaches, we can avoid the “internet-scale biases” that threaten the legitimacy and trustworthiness of the industry as a whole.

🎤 Rusty Python: A Case Study
Speakers: 👤 Robin Raymond
📅 Tue, 18 Apr 2023 at 11:05 show details

Python is a very expressive and powerful language, but it is not always the fastest option for performance-critical parts of an application. Rust, on the other hand, is known for its lightning-fast runtime and low-level control, making it an attractive option for speeding up performance-sensitive portions of Python programs. In this talk, we will present a case study of using Rust to speed up a critical component of a Python application. We will cover the following topics: * An overview of Rust and its benefits for Python developers * Profiling and identifying performance bottlenecks in Python application * Implementing a solution in Rust and integrating it with the Python application using PyO3 * Measuring the performance improvements and comparing them to other optimization techniques Attendees will learn about the potential for using Rust to boost the performance of their Python programs and how to go about doing so in their own projects.

# Context In the past, C and C++ were the go-to languages for optimizing Python code while still maintaining a high-level interface. This approach was used by well-known numerical libraries such as Numpy and Pandas. However, with the increasing popularity of Rust and the emergence of PyO3, this is no longer the only solution available. Rust's impressive performance and expressive syntax, combined with its comprehensive library ecosystem, make it a viable alternative for optimizing performance-sensitive parts of Python applications. Additionally, Rust's mature support for asynchronous programming gives it an advantage over C foreign function interfaces when interacting with Python coroutines. Some library maintainers are even considering using Rust for their projects, such as Pydantic, which is implementing version 2 in Rust and achieving similar speed improvements to those obtained using C. # Timeplan In minutes * 0-2: Welcome, explanation of title * 2-7: What is Rust and how is it different to other "bare metal" languages * 7-10: Introducing the case study, running the code, getting feel for performance * 10-15: Code profiling, finding of bottle neck * 15-17: Introducing PyO3 * 17-22: Walking through the Rust code that optimizes the bottle neck * 22-25: Running the code live, showing the speedup * 25-28: Mention extensions provided by PyO3, caveats and what code might not be a good goal to optimize. Mention tradeoffs to other foreign function interfaces. * 28-30: Buffer / Q&A

🎤 How Chatbots work – We need to talk!
Speakers: 👤 Yuqiong Weng 👤 Katrin Reininger
📅 Tue, 18 Apr 2023 at 11:05 show details

Chatbots are fun to use, ranging from simple chit-chat (“How are you today?”) to more sophisticated use cases like shopping assistants, or the diagnosis of technical or medical problems. Despite their mostly simple user interaction, chatbots must combine various complex NLP concepts to deliver convincing, intelligent, or even witty results. With the advancing development of machine learning models and the availability of open source frameworks and libraries, chatbots are becoming more powerful every day and at the same time easier to implement. Yet, depending on the concrete use case, the implementation must be approached in specific ways. In the design process of chatbots it is crucial to define the language processing tasks thoroughly and to choose from a variety of techniques wisely. In this talk, we will look together at common concepts and techniques in modern chatbot implementation as well as practical experiences from an E-mobility bot that was developed using the Rasa framework.

🎤 BLE and Python: How to build a simple BLE project on Linux with Python
Speakers: 👤 Bruno Vollmer
📅 Tue, 18 Apr 2023 at 11:05 show details

Bluetooth Low Energy (BLE) is a part of the Bluetooth standard aimed at bringing wireless technology to low-power devices, and it's getting into everything - lightbulbs, robots, personal health and fitness devices, and plenty more. One of the main advantages of BLE is that everybody can integrate those devices into their tools or projects. However, BLE is not the most developer-friendly protocol and these devices most of the time don't come with good documentation. In addition, there are not a lot of good open-source tools, examples, and tutorials on how to use Python with BLE. Especially if one wants to build both sides of the communication. In this talk, I will introduce the concepts and properties used in BLE interactions and look at how we can use the Linux Bluetooth Stack (Bluez) to communicate with other devices. We will look at a simple example and learn along the way about common pitfalls and debugging options while working with BLE and Python. This talk is for everybody that has a basic understanding of Python and wants to have a deeper understanding of how BLE works and how one could use it in a private project.

Slides can be found here: https://drive.google.com/file/d/1rDkSKriobmW71ZMYU6pqdx7Yal1eUgXm/view?usp=sharing The problem that this talk is addressing is the difficulty of using Bluetooth Low Energy (BLE) with Python, particularly for those who are new to the protocol. One issue is that BLE is not necessarily beginner-friendly, with a steep learning curve that can be intimidating for those who are just starting out. Additionally, there are not many examples available for creating a BLE server using Python, which makes it difficult for people to learn and understand the process. This is most likely due to the fact that writing a BLE (GATT) server is often only done in professional contexts. Finally, complexity is added as one has to interact with the system Bluetooth stack which makes it more complicated, particularly on Linux where the use of DBus is required. Overall, these challenges can make it difficult for people to effectively use BLE and Python together. The problem of using BLE with Python is relevant to the audience because BLE is a widely-used technology that allows users to add a variety of peripherals to their projects, both personal and professional. Over the years more devices support a configuration or use through BLE. For example, BLE is often used in home automation systems, wearable devices, and Internet of Things (IoT) applications. By understanding how to use BLE with Python, the audience can take advantage of the many possibilities that this technology offers and create innovative projects that leverage the capabilities of many different types of BLE devices. In this talk, I will introduce the different technologies that are involved in using BLE with Python, including BLE itself, Bluez (the Linux Bluetooth stack), and DBus (a software system for inter-process communication). This is followed by a showcase of a simple GATT server example using Python, which will demonstrate how to use these technologies effectively. In addition to this, I will explain a possible development process for creating BLE projects with Python, including debugging tools and common pitfalls to avoid. Finally, I will point the audience toward further resources that they can use to continue learning about BLE and Python and to help them get started with their own projects.

🎤 “Who is an NLP expert?” - Lessons Learned from building an in-house QA-system
Speakers: 👤 Nico Kreiling 👤 Alina Bickel
📅 Tue, 18 Apr 2023 at 11:05 show details

🎤 Actionable Machine Learning in the Browser with PyScript
Speakers: 👤 Valerio Maggio
📅 Tue, 18 Apr 2023 at 11:05 show details

PyScript brings the full PyData stack in the browser, opening up to unprecedented use cases for interactive data-intensive applications. In this scenario, the web browser becomes a ubiquitous computing platform, operating within a (nearly) _zero-installation_ & _server-less_ environment. In this talk, we will explore how to create full-fledged interactive front-end machine learning applications using PyScript. We will dive into the the main features of the PyScript platform (e.g. _built-in Javascript integration_ and _local modules_ ), discussing _new_ data & design patterns (e.g. _loading heterogeneous data in the browser_), required to adapt and to overcome the limitations imposed by the new operating environment (i.e. the browser).

PyScript is the new open source platform that brings Python to web front-end applications. In fact, PyScript makes it possible to inject *standard* Python code into HTML, which is then _interpreted_ and _executed_ directly in the browser. And all that, with **no server-side** technology needed, and **no installation** required (_not even a local Python interpreter!, ed._) 🔮. But there's more! Thanks to its built-in integration with [`pyodide`](https://pyodide.org/en/stable/), PyScript brings the [full](https://pyodide.org/en/stable/usage/packages-in-pyodide.html) PyData stack into the browser, along with a native integration with the Javascript interpreter, then enabling full support for front-end interactivity. As a result, PyScript has the potential to radically change the way in which interactive data-driven web apps could be designed and developed: the seamless bi-directional integration of **Python** and **Javascript** is complemented by the full support to reliable numerical computation, enabled by the Python scientific ecosystem (e.g. `numpy` `scikit-learn`), using the browser as a ubiquitous virtual machine. In this talk, we will explore how PyScript enables the creation of full-fledged font-end _interactive machine learning_ (`ML`) apps using PyScript. Multiple examples of supervised and unsupervised ML apps will be presented, and analysed in details, in order to fully understand how PyScript works, and what key features are provided (e.g. _built-in Javascript integration_; _local modules_ ). Similarly, we will also discuss new_ data & design patterns (e.g. _loading heterogeneous data in the browser_; _multi-core vs multi-threading; _performance considerations_) which are required to adapt to the new _atypical_ environment in which we operate: the **browser**. No specific prior knowledge is required to attend the talk. Familiarity with Python programming, and the main `pydata` packages (i.e. `numpy`, `scikit-learn`, `Matplotlib` ) is desirable, along with a general understanding of how the web DOM works (for the Javascript interaction part) and basic principles of data processing. **Domain** knowledge: _Novice_; **Python** knowledge: _Intermediate_

🎤 How Python enables future computer chips
Speakers: 👤 Tim Hoffmann
📅 Tue, 18 Apr 2023 at 11:40 show details

🎤 Using transformers – a drama in 512 tokens
Speakers: 👤 Marianne Stecklina
📅 Tue, 18 Apr 2023 at 11:40 show details

“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit. In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers: 1. How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too. 2. I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that. 3. Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research. 4. So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.

🎤 Maps with Django
Speakers: 👤 Paolo Melchiorre
📅 Tue, 18 Apr 2023 at 11:40 show details

🎤 Observability for Distributed Computing with Dask
Speakers: 👤 Hendrik Makait
📅 Tue, 18 Apr 2023 at 11:40 show details

Debugging is hard. Distributed debugging is hell. Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease. However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success. In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild. This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

🎤 5 Things about fastAPI I wish we had known beforehand
Speakers: 👤 Alexander CS Hendorf
📅 Tue, 18 Apr 2023 at 11:40 show details

An exchange of views on fastAPI in practice. FastAPI is great, it helps many developers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation. FastAPI does a great job of getting people started with APIs quickly. This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions. This talk will include the following: ### fastAPI is built on the shoulders of giants I: [pydantic](https://docs.pydantic.dev/) FastAPI makes extensive use of [pydantic](https://docs.pydantic.dev/). [pydantic](https://docs.pydantic.dev/) parses data, can validate (and transform) data, and has built-in interfaces to export OpenAPI definitions among many other features. ### fastAPI is built on the shoulders of giants I: [starlette](https://www.starlette.io) Routes and middleware are managed by [starlette](https://www.starlette.io). In this section we will explore how to create custom middleware and what we learned along the way. ### fastAPI has tutorials, but is this documentation? The fastAPI page provides a good introduction. The more we worked with fastAPI, the harder it was to find accurate documentation. Looking at the source code, we really missed DocStrings! Introspection to the rescue - will probably include a rant about missing DocStrings! ### DRY ("Don't repeat yourself") with pydantic For our use case, we decided to use strict models to validate our data structures, as we work in a highly regulated industry where no mistakes are allowed to happen. Setting up the REST API was much easier than developing consistent models that generalise well. We follow the "single source of truth" paradigm, entering redundant definitions is an absolute no-go. In this section we show how to create highly reusable pydantic model pools with inheritance for use in fastAPI. For testing, we also created models from metadata! ### "The road not taken": pydantic Depends()! API routes often consist of a request model and a response model. But what about cases where the models alone don't work and a model and e.g. query parameters need to be mixed? Apart from flake8 complaining about having callables in the signature, this can be quite a difficult use case. Strategies for resolving model/parameter conflicts. Bonus - if time: ### Integrating fastAPI with Sphinx. Demonstrate how to integrate OpenAPI with your Sphinx documentation. The talk will show how fastAPI is built and how well introspection can help you understand what is going on under the hood and which library is actually doing the heavy lifting where.

🎤 Keynote - Towards Learned Database Systems
Speakers: 👤 Carsten Binnig
📅 Tue, 18 Apr 2023 at 13:15 show details

Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk, I present my vision of Learned DBMS Components 2.0 to tackle these issues. First, I will introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. I thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train model

🎤 Data Kata: Ensemble programming with Pydantic #1
Speakers: 👤 Lev Konstantinovskiy 👤 Gregor Riegler 👤 Nitsan Avni
📅 Tue, 18 Apr 2023 at 14:05 show details

🎤 Let's contribute to pandas (3 hours) #1
Speakers: 👤 Noa Tamir 👤 Patrick Hoefler
📅 Tue, 18 Apr 2023 at 14:05 show details

PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people. pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted! If you don’t finish your contribution during the event, we hope you will continue to work on it after the workshop. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html . ❓Any other requirements ❓ 1. Bring your own laptop 2. Have Github account: https://github.com 3. Have git installed: https://git-scm.com/book/en/v2/Getting-Started-Installing-Gitf • Format for the session: First 15 minutes : an introduction - what you can contribute, how to contribute, and how to set up your development environment or use gitpod; The rest : "office hours", during which you'll be mentored through setting up a development environment and making a contribution to pandas. • Preparation (optional) For those who are more keen on using the workshop to work on their contribution to pandas, you may want to start setting up your development environment in advance. This way, by the time you arrive you are ready to get started on picking issues, and starting to contribute. Please be aware that it could take longer to set up a development on a computer running a Windows operating system compared to MacOS or Unix. We will guide you through the steps, and they are useful to learn for many open source projects. We also offer a development environment on gitpod. It can take some minutes to load, but provides you an instant and fresh development environment for each new task directly from your browser, using VScode. Documentation is in the works and will be provided before the workshop. To get the most out of the session, it's encouraged (but not required) that you have a look at the contributing guide beforehand: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html. Particularly, the development environment instructions: https://pandas.pydata.org/docs/dev/development/contributing_environment.html • Audience level Everyone is welcome to attend this session! If you've never contributed to open source software before, then you will learn how to, and if you have experience contributing, then you can either help mentor other attendees or you can work on more challenging contributions. It is useful to have some pandas, git, and python and experience. If you don't have much experience with them, you might expect to spend time "learning by doing".

🎤 Pragmatic ways of using Rust in your data project
Speakers: 👤 Christopher Prohm
📅 Tue, 18 Apr 2023 at 14:10 show details

🎤 Getting started with JAX
Speakers: 👤 Simon Pressler
📅 Tue, 18 Apr 2023 at 14:10 show details

Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to TensorFlow and PyTorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization. Nevertheless, making your first steps in JAX can feel complicated given some of its idiosyncrasies. This talk helps new users getting started in this promising ecosystem by sharing practical tips and best practises.

Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to Tensorflow and Pytorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization which make JAX and its ecosystem an attractive option for your deep learning projects. Nevertheless, making your first steps can feel complicated. From pure functions and the resulting differences in coding style, to avoiding recompilation, JAX comes with its own set of restrictions and design decisions to be taken by the user. This talk wants to help new and prospective users in their JAX learning journey, by providing guidance regarding practical problems they are likely to encounter when transitioning into the JAX ecosystem. Having recently switched to using Jax and Flax for my daily work this talk shares some of the insights I gained and wants to help them to avoid some of the mistakes I made early on. The talk will have a systematic look at selected situations in which JAX provides users with choices, seeing how they differ, and which is the best option given different circumstances. The talk covers: - Why bother switching to JAX? - A brief introduction to JAX including a list of JAX’s idiosyncrasies - Pure functions and the resulting architectural decisions - To JIT and or not to JIT - A speed and memory comparison of the different iteration options - Memory management and profiling

🎤 Data-driven design for the Dask scheduler
Speakers: 👤 Guido Imperiale
📅 Tue, 18 Apr 2023 at 14:10 show details

🎤 Methods for Text Style Transfer: Text Detoxification Case
Speakers: 👤 Daryna Dementieva
📅 Tue, 18 Apr 2023 at 14:10 show details

🎤 You are what you read: Building a personal internet front-page with spaCy and Prodigy
Speakers: 👤 Victoria Slocum
📅 Tue, 18 Apr 2023 at 14:10 show details

🎤 Visualizing your computer vision data is not a luxury, it's a necessity: without it, your models are blind and so do you.
Speakers: 👤 Chazareix Arnault
📅 Tue, 18 Apr 2023 at 14:45 show details

🎤 Delivering AI at Scale
Speakers: 👤 Severin Schmitt 👤 Anna Achenbach 👤 Thorsten Kranz
📅 Tue, 18 Apr 2023 at 14:45 show details

Everybody knows our yellow vans, trucks and planes around the world. But do you know how data drives our business and how we leverage algorithms and technology in our core operations? We will share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven Company. • Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics, including Computer Vision and NLP • Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data Scientist • Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML • No rusty working mode: small, self-organized, agile project teams, combining state of the art Machine Learning with MLOps best practices • A young, motivated and international team – German skills are only “nice to have” But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our work, deep dive into our largest use cases that impact your everyday life and share our approach for a timeseries forecasting library - combining data science, software engineering and technology for efficient and easy to maintain machine learning projects..

🎤 Accelerating Public Consultations with Large Language Models: A Case Study from the UK Planning Inspectorate
Speakers: 👤 Michele Dallachiesa 👤 Andreas Leed
📅 Tue, 18 Apr 2023 at 14:45 show details

Local Planning Authorities (LPAs) in the UK rely on written representations from the community to inform their Local Plans which outline development needs for their area. With an average of 2000 representations per consultation and 4 rounds of consultation per Local Plan, the volume of information can be overwhelming for both LPAs and the Planning Inspectorate tasked with examining the legality and soundness of plans. In this study, we investigate the potential for Large Language Models (LLMs) to streamline representation analysis. We find that LLMs have the potential to significantly reduce the time and effort required to analyse representations, with simulations on historical Local Plans projecting a reduction in processing time by over 30%, and experiments showing classification accuracy of up to 90%. In this presentation, we discuss our experimental process which used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of the BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss the design and prototyping of web applications to support the aided processing of representations using Voilà, FastAPI, and React. Finally, we highlight successes and challenges encountered and suggest areas for future improvement.

In the United Kingdom, Local Planning Authorities (LPAs) are responsible for creating Local Plans that outline the development needs of their areas, including land allocation, infrastructure requirements, housing needs, and environmental protection measures. This process involves consulting with the local community and interested parties multiple times, which often results in hundreds or thousands of written representations that must be organised and analysed. On average, LPAs receive approx. 2000 written representations per consultation, and each Local Plan requires 4 rounds of consultation. The process of analysing these representations takes approx. 3.5 months per round of consultation to complete. The Planning Inspectorate is tasked with examining Local Plans to ensure they follow national policy and legislation. The Inspectorate examines approx. 60 Local Plans a year, each examination lasting around a year’s time. The volume of information included in each Local Plan significantly outweighs the capacity of the Planning Inspectorate to read and analyse the content in detail. This can lead to important issues being overlooked and potential problems with the review process or legal challenges. Conducting a thorough and meticulous analysis of representations takes a lot of time and effort for both LPAs and the Planning Inspectorate. Together with the Planning Inspectorate, we conducted an AI discovery to explore how Large Language Models (LLMs) can help reduce the time taken to analyze representations, improve resource planning, increase consistency in decision-making, and mitigate the risk of a key issue of material concern being missed. We assessed the performance of competing models and demonstrated their goodness with proof-of-concept apps for both LPAs and the Planning Inspectorate that unify and streamline the aided processing of representations. Our simulations on historical Local Plans resulted in a projected reduction of the time taken to analyze representations by more than 30%, and experiments show that we are able to classify representations to the relevant policy in Local Plans with up to 90% accuracy. In this talk, we share our experimental process based on Python and the experimental results. We delve into how we approached the problem, sourced and cleaned the data, and used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss our strategies for dealing with limited training data. Finally, we present the design and prototyping of two web applications using Voilà, and demonstrate how we iterated on them using FastAPI and React. Throughout the presentation, we highlight the successes and challenges we encountered, and suggest areas for future improvement.

🎤 Writing Plugin Friendly Python Applications
Speakers: 👤 Travis Hathaway
📅 Tue, 18 Apr 2023 at 14:45 show details

In modern software engineering, plugin systems are a ubiquitous way to extend and modify the behavior of applications and libraries. When software is written in a way that is plugin friendly, it encourages the use of modular organization where the contracts between the core software and the plugin have been well thought out. In this talk, we cover exactly how to define this contract and how you can start designing your software to be more plugin friendly. Throughout the talk we will be creating our own plugin friendly application using the [pluggy](https://pluggy.readthedocs.io/en/stable/) library to show these design principles in action. At the end of the talk, I also cover a real-life case study of how the package manager [conda](https://github.com/conda/conda) is currently making its 10 year old code more plugin friendly to illustrate how to retrofit an existing project.

This talk begins with a general discussion about what plugins are and how they are used in software. We cover important theoretical concepts and show just how pervasive plugins are in much of the software we use everyday. With a firm idea about what plugins allow us to do, we will begin creating our own command line application that downloads images via APIs given a search term. We will write our application with plugins in mind so that we can quickly expand and support any number of image searching backends (e.g. Google, Unsplash, etc.). The presentation will focus on everything we have to do to let plugin authors extend our application and add their own backends. A fully functional implementation of this application can be found here: [https://github.com/travishathaway/latz](https://github.com/travishathaway/latz). After building our own application, I will then present how the [conda](https://github.com/conda/conda) project approaches making its software plugin friendly. Much of what I show in the example also applies to conda's plugin architecture. This talk should prepare those interested in writing their own plugin friendly applications to get started with the [pluggy](https://pluggy.readthedocs.io/en/stable/) library. The [example project](https://github.com/travishathaway/latz) will also provide a great starting point and inspiration for new and existing applications.

🎤 When A/B testing isn’t an option: an introduction to quasi-experimental methods
Speakers: 👤 Inga Janczuk
📅 Tue, 18 Apr 2023 at 14:45 show details

🎤 Let's contribute to pandas (3 hours) #2
Speakers: 👤 Noa Tamir 👤 Patrick Hoefler
📅 Tue, 18 Apr 2023 at 15:45 show details

🎤 Data Kata: Ensemble programming with Pydantic #2
Speakers: 👤 Lev Konstantinovskiy 👤 Gregor Riegler 👤 Nitsan Avni
📅 Tue, 18 Apr 2023 at 15:45 show details

🎤 MLOps in practice: our journey from batch to real-time inference
Speakers: 👤 Theodore Meynard
📅 Tue, 18 Apr 2023 at 16:00 show details

🎤 Enabling Machine Learning: How to Optimize Infrastructure, Tools and Teams for ML Workflows
Speakers: 👤 Yann Lemonnier
📅 Tue, 18 Apr 2023 at 16:00 show details

🎤 Introducing FastKafka
Speakers: 👤 Tvrtko Sternak
📅 Tue, 18 Apr 2023 at 16:00 show details

🎤 The bumps in the road: A retrospective on my data visualisation mistakes
Speakers: 👤 Artem Kislovskiy
📅 Tue, 18 Apr 2023 at 16:00 show details

🎤 Neo4j graph databases for climate policy
Speakers: 👤 Marcus Tedesco
📅 Tue, 18 Apr 2023 at 16:35 show details

🎤 Use Spark from anywhere: A Spark client in Python powered by Spark Connect
Speakers: 👤 Martin Grund
📅 Tue, 18 Apr 2023 at 16:35 show details

Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL. Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages. This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.

🎤 Ask-A-Question: an FAQ-answering service for when there's little to no data
Speakers: 👤 Suzin You
📅 Tue, 18 Apr 2023 at 16:35 show details

🎤 Keynote - Lorem ipsum dolor sit amet
Speakers: 👤 Miroslav Šedivý
📅 Wed, 19 Apr 2023 at 09:10 show details

🎤 Building Hexagonal Python Services
Speakers: 👤 Shahriyar Rzayev
📅 Wed, 19 Apr 2023 at 10:00 show details

The importance of enterprise architecture patterns is all well-known and applicable to varied types of tasks. Thinking about the architecture from the beginning of the journey is crucial to have a maintainable, therefore testable, and flexible code base. In We are going to explore the Ports and Adapters(Hexagonal) pattern by showing a simple web app using Repository, Unit of Work, and Services(Use Cases) patterns tied together with Dependency Injection. All those patterns are quite famous in other languages but they are relatively new for the Python ecosystem, which is a crucial missing part. As a web framework, we are going to use FastAPI which can be replaced with any framework in a matter of time because of the abstractions we have added.

In nearly all web applications and Python tutorials we are starting from installing a web framework, and database server, the next step is to build database models and then use ORM, etc. But wait, there is a problem with this classical approach, we lose the core business domain discussions - so-called core domain models just get lost inside some classes and functions. How about changing and reverting our approach? How about first starting by thinking, modeling our business, and core domain, and then testing it properly? Afterward, how about adding an abstraction layer on the database, then adding another abstraction on actual services, and use cases? But wait, how we are going to manage all transactional usage - okay let's add another layer with the Unit of Work pattern to manage our work as units. Sounds cryptic? Here is a step-by-step guide to starting our project: * We are going to start with domain modeling and adding tests for our domain models * The database layer will be abstracted using a Repository pattern * The database transactions will be managed by the Unit of Work pattern * The business logic actions were encapsulated in the Use Cases The question can arise: where are our web framework and database server? Answer: good architecture lets us defer those choices until the end. Because the web framework and the database server are details for our business/core application itself. Web framework will be considered as an entry point for our application and the database layer will be encapsulated using SQLAlchemy ORM, but still, ORM itself is hidden behind Repository and UoW patterns. This allows us to change the ORM library if there will be any need in the future. The most important part here is to understand how we are going to build our application using Ports and Adapters(Hexagonal) pattern and all aforementioned patterns will be divided into Ports(using abstract base classes) and Adapters(the actual implementations), we can think about this as a contract between our actual implementations and abstractions.

🎤 Accelerating Python Code
Speakers: 👤 Jens Nie
📅 Wed, 19 Apr 2023 at 10:00 show details

Python is a beautiful language for fast prototyping and and sketching ideas quickly. People often struggle to get their code into production though for various reasons. Besides of all security and safety concerns that usually are not addressed from the very beginning when playing around with an algorithmic idea, performance concerns are quite frequently a reason for not taking the Python code to the next level. We will look at the "missing performance" worries using a simple numerical problem and how to speed the corresponding Python code up to top notch performance.

We all know how much fun it is to play around with an algorithmic idea in Python. It's very satisfying to see the idea develop, doing what it's supposed to be doing and how simple and elegant the code finally looks like. Python being so feature complete with its standard library and the 3rd party universe of libraries and packages allows development to be very quick. And, we're all very grateful to be able to focus on the problem itself, not on the language specifics, to solve it. But when we're arriving at the point where everything just works, there is this one last step that needs to be mastered: Get it into production to finally let it do what it was supposed to be doing and make life easier for all of us. But at that stage there are those final hurdles, and they usually feel giant, that arise unpleasant questions. Will the algorithm really do what it was supposed to be doing under all circumstances? Will it be safe? What if it fails? Will it actually be fast enough for all the data it needs to process in production? Will it be capable of doing its job in the future, when the amount of work grows? Whilst the first worries usually can be addressed well using established software engineering habits and patterns, the performance related issue is often seen as the killer on the way to production use, as Python is still considered to be slow just based on the fact that it is an interpreted language. Quite often code is rewritten after the prototyping phase in other languages considered to be fast, such as C++ for example, for this very reason. We'll look at exactly this point and explore ways to accelerate Python code by simple modifications and using third party libraries to support us. To do that we will look at some code to solve a simple numerical problem - calculating the Mandelbrot Set - as it is well suited for this and quite simple to follow. Yet it generates stunning and beautiful results entertaining us through the course of the presentation. The strategies shown to accelerate the code, based on concepts taken from standard library, PyPy, numpy, numba and dask, however are transferable to other algorithmic problems as well. We will analyse the advantages as well as the drawbacks for each concept to see the overall effect and where else the solution might apply.

🎤 Advanced Visual Search Engine with Self-Supervised Learning (SSL) Representations and Milvus
Speakers: 👤 Antoine Toubhans 👤 Noé Achache
📅 Wed, 19 Apr 2023 at 10:00 show details

Image retrieval is the process of searching for images in a large database that are similar to one or more query images. A classical approach is to transform the database images and the query images into embeddings via a feature extractor (e.g., a CNN or a ViT), so that they can be compared via a distance metric. Self-supervised learning (SSL) can be used to train a feature extractor without the need for expensive and time-consuming labeled training data. We will use DINO's SSL method to build a feature extractor and Milvus, an open-source vector database built for evolutionary similarity search, to index image representation vectors for efficient retrieval. We will compare the SSL approach with supervised and pre-trained feature extractors.

[Image Retrieval](https://en.wikipedia.org/wiki/Image_retrieval) consists in searching in a large database for the most similar images to one or more query images. It has many applications in various fields, e.g., to validate whether a person's photo is contained in your database of people's photos; to build a visual recommendation system; or to create a video deduplication system. Huge progress in Computer Vision in the deep learning era highlighted [Content-based Image Retrieval](https://en.wikipedia.org/wiki/Content-based_image_retrieval) (CBIR) techniques that use the image contents (features, colors, shapes, etc) rather than metadata (keywords, tags). This gets rid of time-consuming, costly and error-prone human annotations to produce the metadata. A classic CBIR approach consists of three steps: 1. A deep neural network called **the feature extractor** (typically a CNN, or a [ViT](https://arxiv.org/pdf/2010.11929.pdf)) computes a representation of each image of the database in the form of an embedding vector. 2. The same *feature extractor* is used to compute an embedding of a query image. 3. The search is performed by retrieving the **closest** representations in this vector space using a distance metric (cosine, L1, or more complex ones). Thereafter, two main challenges arise: - **Quality of image representations** - the embeddings should capture the visual features that are relevant to your searches/tasks. For instance, if you intend to do face recognition, embeddings should encode eye/hair color, skin texture, nose position, etc. Traditionally, the feature extractor is trained in a supervised way. Therefore, the relevance of the representations hugely depends on 1) how close is the training dataset from the searched query images 2) the potential visual biases in the annotations (see a [famous example here](https://medium.com/hackernoon/dogs-wolves-data-science-and-why-machines-must-learn-like-humans-do-41c43bc7f982)). - **Speed of search in the representation space** - comparing each query image to every single image in the searched database in near real-time is challenging and expensive with large datasets. In this talk, we will build a [Visual Search Engine](https://en.wikipedia.org/wiki/Visual_search_engine): - We will introduce **[Self-Supervised Learning](https://en.wikipedia.org/wiki/Self-supervised_learning) (SSL)** in the context of computer vision and the [data2vec](https://arxiv.org/pdf/2202.03555.pdf) approach. Labelling data can be a time-consuming and expensive process, especially if it requires specialized knowledge or expertise. SSL does not require labelled training data to learn good representations, hence it allows to lower the cost and time to build a model producing good representations for our visual search engine. - As a concrete example for this talk, we will use the [DINO](https://arxiv.org/pdf/2104.14294.pdf)'s SSL method to build a feature extractor. - We will compare the DINO feature extractor with supervised pre-trained feature extractors. We will show the main differences between the obtained representations: SSL ones are generally richer (more visual features are in the representation) whereas supervised learning introduces a natural semantic bias in the representations. In addition, we will present practical tools to understand the visual features encoded in the embeddings (activation maps, grad-cams, self-attention maps for transformers). - We will present [Milvus](https://milvus.io/), a vector database built for scalable similarity search: it’s an open-source search engine tool (14.5k stars on Github) that is suitable for production use cases as it can be easily scaled and managed. Milvus uses [Approximate Nearest Neighbors (ANN) methods](https://milvus.io/docs/v2.0.x/index.md#Selecting-an-Index-Best-Suited-for-Your-Scenario) to build vector indexes that improve retrieval efficiency by sacrificing accuracy within an acceptable range. - We will use the Milvus Python API to index the image representation vectors: as a result, the images the most similar to a query image can be retrieved in a split second, even for datasets containing millions of vectors. By the end of the session, participants will have learned how to build a Visual Search Engine using Milvus with pre-trained self-supervised and supervised models.

🎤 Why GPU Clusters Don't Need to Go Brrr? Leverage Compound Sparsity to Achieve the Fastest Inference Performance on CPUs
Speakers: 👤 Damian Bogunowicz
📅 Wed, 19 Apr 2023 at 10:00 show details

Forget specialized hardware. Get GPU-class performance on your commodity CPUs with compound sparsity and sparsity-aware inference execution. This talk will demonstrate the power of compound sparsity for model compression and inference speedup for NLP and CV domains, with a special focus on the recently popular Large Language Models. The combination of structured + unstructured pruning (to 90%+ sparsity), quantization, and knowledge distillation can be used to create models that run an order of magnitude faster than their dense counterparts, without a noticeable drop in accuracy. The session participants will learn the theory behind compound sparsity, state-of-the-art techniques, and how to apply it in practice using the Neural Magic platform.

By intelligently applying SOTA compound sparsity techniques, we can remove 95%+ of the weights and reduce the remaining 5% to 8-bit precision on modern models such as BERT, while maintaining 99%+ of their baseline accuracy. In this talk, we’ll be covering how we can build up to this extreme sparsity and how to harness it to achieve an order of magnitude speedup for CPU inference. This talk will focus on the success story of utilizing sparsity to run fast inference of modern neural networks on CPUs. We will focus on the popular Large Language Models with the goal of learning how the recent state-of-the-art in model compression can help to dramatically lower the computational budget when it comes to model inference. Today’s ML hardware acceleration is headed towards chips that apply a petaflop of compute to a cell phone-size memory. Our brains, on the other hand, are biologically the equivalent of applying a cell phone of compute to a petabyte of memory. In this sense, the direction being taken by hardware designers is the opposite of that proven by nature. Why? Simply because we don’t know the algorithms nature uses. GPUs bring data in and out quickly, but have little locality of reference because of their small caches. They are geared towards applying a lot of compute to little data, not little compute to a lot of data. The networks are designed to run on them full layer after full layer in order to saturate their computational pipeline. CPUs, on the other hand, have large, much faster caches than GPUs, and have an abundance of memory (terabytes). A typical CPU server can have memory equivalent to tens or even hundreds of GPUs. CPUs are perfect for a brain-like ML world in which parts of an extremely large network are executed piecemeal, as needed. This is the problem Neural Magic set out to solve and the perspective which led to the creation of DeepSparse, a custom computational engine designed to mimic, on commodity hardware, the way brains compute. It uses neural network sparsity combined with the locality of communication by utilizing the CPU’s large fast caches and its very large memory.

🎤 Create interactive Jupyter websites with JupyterLite
Speakers: 👤 Jeremy Tuloup
📅 Wed, 19 Apr 2023 at 10:00 show details

Jupyter notebooks are a popular tool for data science and scientific computing, allowing users to mix code, text, and multimedia in a single document. However, sharing Jupyter notebooks can be challenging, as they require installing a specific software environment to be viewed and executed. JupyterLite is a Jupyter distribution that runs entirely in the web browser without any server components. A significant benefit of this approach is the ease of deployment. With JupyterLite, the only requirement to provide a live computing environment is a collection of static assets. In this talk, we will show how you can create such static website and deploy it to your users.

We will cover the basics of JupyterLite, including how to use its command-line interface to generate and customize the appearance and behavior of your Jupyter website. This will be a guided walkthrough with step-by-steps instructions for adding content, extensions and configuration. By the end of this tutorial, you will be able to create your own interactive Jupyter website using JupyterLite. Outline: - Introduction to Jupyter and JupyterLite - Examples of JupyterLite used for interactive documentation and educational content (NumPy, Try Jupyter, SymPy) - Step-by-step demo for creating a Jupyter website - Quickstart with the demo repository - Adding content: notebooks, files and static assets - Adding extensions to the user interface - Adding packages to the Python runtime - Customization and custom settings - Deploy JupyterLite as a static website on GitHub Pages, Vercel or your own server - Conclusion and next steps for learning more about the Jupyter ecosystem The tutorial will be based on resources already publicly available: - try JupyterLite in your browser: https://jupyterlite.github.io/demo/ - the JupyterLite documentation: https://jupyterlite.readthedocs.io/en/latest/quickstart/deploy.html - the JupyterLite repositories: https://github.com/jupyterlite At the end of the tutorial the attendees will have something very concrete to present and a functioning Jupyter website.

🎤 The Spark of Big Data: An Introduction to Apache Spark
Speakers: 👤 Pasha Finkelshteyn
📅 Wed, 19 Apr 2023 at 10:00 show details

🎤 Monorepos with Python
Speakers: 👤 AbdealiLoKo
📅 Wed, 19 Apr 2023 at 10:00 show details

🎤 Thou Shall Judge But With Fairness: Methods to Ensure an Unbiased Model
Speakers: 👤 Nandana Sreeraj
📅 Wed, 19 Apr 2023 at 10:50 show details

Is your model prejudicial? Is your model deviating from the predictions it ought to have made? Has your model misunderstood the concept? In the world of artificial intelligence and machine learning, the word "fairness" is particularly common. It is described as having the quality of being impartial or fair. Fairness in ML is essential for contemporary businesses. It helps build consumer confidence and demonstrates to customers that their issues are important. Additionally, it aids in ensuring adherence to guidelines established by authorities. So guaranteeing that the idea of responsible AI is upheld. In this talk, let's explore how certain sensitive features are influencing the model and introducing bias into it. We'll also look at how we can make it better.

We cannot escape thinking about fairness through numbers and math. Models are not fair simply because they are mathematical, contrary to popular belief. AI systems are subjected to bias. It may be inherent which is due to historical bias in the training dataset. There may be label bias that occurs when the set of labeled data is not a full representation of the entire universe of existing potential labels. Another potential bias is sampling bias, which occurs when certain people in the intended universe have a higher or lower sampling probability than others. Models learn from such biased datasets which may lead to unfair decisions. As cascading models are developed, this bias continues to spread. Model fairness is an alerting concern. Unfair AI systems can create habitual losses for businesses. It may also contribute unfavorable commercial values to the company, creating situations like customer eroding, slandering, and decreasing transparency. As a result, Model fairness is becoming increasingly necessary. In the proposed talk, I would gently introduce you to the above concepts and some open source libraries that would help us in accessing ML models' fairness. Lastly, I would be walking you through how to assess the fairness of a model for a law school dataset using Fairlearn, an open source library by Microsoft and the measures that can be taken to mitigate the same. My Talk will Focus On 1. What are the metrics that need to be considered for assessing the fairness of an ML model? 2. What are the mitigation measures that can be implemented for the same? 3. Python code to gauge the fairness of a model trained on a law school dataset using Fairlearn and steps to mitigate the model.

🎤 Unlocking Information - Creating Synthetic Data for Open Access.
Speakers: 👤 Antonia Scherz
📅 Wed, 19 Apr 2023 at 10:50 show details

Many good project ideas fail before they even start due to the sensitive personal data required. The good news: a synthetic version of this data does not need protection. Synthetic data copies the actual data's structure and statistical properties without recreating personally identifiable information. The bad news: It is difficult to create synthetic data for open-access use, without recreating the exact copy of actual data. This talk will give hands-on insights into synthetic data creation and challenges along its lifecycle. We will learn how to create and evaluate synthetic data for any use case using the open-source package Synthetic Data Vault. We will find answers to why it takes so long to synthesize the huge amount of data dormant in public administration. The talk addresses owners who want to create access to their private data as well as analysts looking to use synthetic data. After this session, listeners will know which steps to take to generate synthetic data for multi-purpose use and its limitations for real-world analyses.

A vast amount of private data lies dormant in public institutions, hidden from the research community. Synthesizing complex, anonymized data could allow researchers access without disclosing personally identifiable information while keeping information loss minimal. The tools to do this exist, but why is it still difficult to realize synthetic solutions? One challenge is to reach the minimum viable quality to serve as many use cases as possible. Ideally, the synthetic data allows data exploration with equal results as the real data. We will guide you through the challenges of creating synthetic data and shine a light on its lifecycle. We will explore the different levels of quality of generated structured data and discuss their potential. Finally, we will link these issues to the domain of public administration, but the main insights are generally applicable to all kinds of domains. In particular, we will focus on four key questions: 1. How can we create synthetic data from private data? 2. How can synthetic data creation be integrated into institutions that sit on piles of unused highly private data? 3. Can SOTA methods for synthetic data fulfill all needs of the research community? When is access to the actual, private data needed? 4. Which quality measures are adequate for synthetic data? As we address these questions, we'll use the Synthetic Data Vault to create and evaluate synthetic data. After the talk listeners will have understood the concept of synthetic data and will be able to evaluate synthetic data for a plethora of use cases. As a plus, they will also gain a deeper understanding of why open data access is (not yet) solved by synthetic data.

🎤 Teaching Neural Networks a Sense of Geometry
Speakers: 👤 Jens Agerberg
📅 Wed, 19 Apr 2023 at 10:50 show details

By taking neural networks back to the school bench and teaching them some elements of geometry and topology we can build algorithms that can reason about the shape of data. Surprisingly these methods can be useful not only for computer vision – to model input data such as images or point clouds through global, robust properties – but in a wide range of applications, such as evaluating and improving the learning of embeddings, or the distribution of samples originating from generative models. This is the promise of the emerging field of Topological Data Analysis (TDA) which we will introduce and review recent works at its intersection with machine learning. TDA can be seen as being part of the increasingly popular movement of Geometric Deep Learning which encourages us to go beyond seeing data only as vectors in Euclidean spaces and instead consider machine learning algorithms that encode other geometric priors. In the past couple of years TDA has started to take a step out of the academic bubble, to a large extent thanks to powerful Python libraries written as extensions to scikit-learn or PyTorch.

Researchers have hypothesised that a sense of geometry is something that sets the intelligence of humans apart from that of other animals. This intriguing fact motivates why geometric reasoning can be an interesting direction for AI. How can we incorporate geometric concepts into deep learning? We can tap in to the mathematical fields of geometry and topology and see how methods in these fields can be adapted to be used in the context of data analysis and machine learning. This is the aim of Topological Data Analysis. Starting from hierarchical clustering, which many data scientists are familiar with, we gently introduce a method used in TDA, where we look at clustering of a data set at different thresholds and form a topological summary which represents the creation and destruction of clusters (which is an example of a topological feature) at different thresholds. We then look at a few examples where these methods can be useful: - In neuroscience we can use these methods to model neuronal or glia trees, capturing properties of important branching structures and incorporating the invariances that these objects have. - In image segmentation we would like to teach a neural network to take the shape of the segmentation masks into consideration, where some of the classical loss functions can't account for these kind of global properties. - For dimensionality reduction, we can argue that minimising a reconstruction loss is not enough, instead we would like to somehow make sure that the shape of the original dataset and its dimensionality-reduced version are similar.

🎤 Shrinking gigabyte sized scikit-learn models for deployment
Speakers: 👤 Pavel Zwerschke 👤 Yasin Tatar
📅 Wed, 19 Apr 2023 at 10:50 show details

🎤 Haystack for climate Q/A
Speakers: 👤 Vibha Vikram Rao
📅 Wed, 19 Apr 2023 at 10:50 show details

🎤 Most of you don't need Spark. Large-scale data management on a budget with Python
Speakers: 👤 Guillem Borrell Nogueras
📅 Wed, 19 Apr 2023 at 11:40 show details

🎤 Workshop on Privilege and Ethics in Data
Speakers: 👤 Tereza Iofciu 👤 Paula Gonzalez Avalos
📅 Wed, 19 Apr 2023 at 11:40 show details

🎤 Prompt Engineering 101: Beginner intro to LangChain, the shovel of our ChatGPT gold rush."
Speakers: 👤 Lev Konstantinovskiy
📅 Wed, 19 Apr 2023 at 11:50 show details

🎤 The future of the Jupyter Notebook interface
Speakers: 👤 Jeremy Tuloup
📅 Wed, 19 Apr 2023 at 11:50 show details

🎤 Modern typed python: dive into a mature ecosystem from web dev to machine learning
Speakers: 👤 samsja
📅 Wed, 19 Apr 2023 at 11:50 show details

🎤 Grokking Anchors: Uncovering What a Machine-Learning Model Relies On
Speakers: 👤 KIlian Kluge
📅 Wed, 19 Apr 2023 at 11:50 show details

🎤 What are you yield from?
Speakers: 👤 Maxim Danilov
📅 Wed, 19 Apr 2023 at 11:50 show details

🎤 Maximizing Efficiency and Scalability in Open-Source MLOps: A Step-by-Step Approach
Speakers: 👤 Paul Elvers
📅 Wed, 19 Apr 2023 at 12:25 show details

This talk presents a novel approach to MLOps that combines the benefits of open-source technologies with the power and cost-effectiveness of cloud computing platforms. By using tools such as Terraform, MLflow, and Feast, we demonstrate how to build a scalable and maintainable ML system on the cloud that is accessible to ML Engineers and Data Scientists. Our approach leverages cloud managed services for the entire ML lifecycle, reducing the complexity and overhead of maintenance and eliminating the vendor lock-in and additional costs associated with managed MLOps SaaS services. This innovative approach to MLOps allows organizations to take full advantage of the potential of machine learning while minimizing cost and complexity.

Building a machine learning (ML) system on a cloud platform can be a challenging and time-consuming task, especially when it comes to selecting the right tools and technologies. In this talk, we will present a comprehensive solution for building scalable and maintainable ML systems on the cloud using open source technologies like MLFlow, Feast, and Terraform. MLFlow is a powerful open source platform that simplifies the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. It allows you to track and compare different runs of your ML models and deploy them to various environments, such as production or staging, with ease. Feast is an innovative open source feature store that enables you to store and serve features for training, serving, and evaluating ML models. It integrates seamlessly with MLFlow, enabling you to track feature versions and dependencies, and deploy feature sets to different environments. Terraform is a widely-used open source infrastructure as code (IaC) tool that enables you to define and manage your cloud resources in a declarative manner. It allows you to automate the provisioning and management of your ML infrastructure, such as compute clusters, databases, and message brokers, saving you time and effort. In this talk, we will demonstrate how these open source technologies can be used together to build an ML system on the cloud and discuss the benefits and trade-offs of using them. We will also share best practices and lessons learned from our own experiences building ML systems on the cloud, providing valuable insights and guidance for attendees looking to do the same.

🎤 How to connect your application to the world (and avoid sleepless nights)
Speakers: 👤 Luis Fernando Alvarez
📅 Wed, 19 Apr 2023 at 12:25 show details

🎤 Dynamic pricing at Flix
Speakers: 👤 Amit Verma
📅 Wed, 19 Apr 2023 at 12:25 show details

🎤 Streamlit meets WebAssembly - stlite
Speakers: 👤 Yuichiro Tachibana
📅 Wed, 19 Apr 2023 at 12:25 show details

🎤 Code Cleanup: A Data Scientist's Guide to Sparkling Code
Speakers: 👤 Corrie Bartelheimer
📅 Wed, 19 Apr 2023 at 12:25 show details

🎤 You've got trust issues, we've got solutions: Differential Privacy
Speakers: 👤 Vikram Waradpande 👤 Sarthika Dhawan
📅 Wed, 19 Apr 2023 at 14:00 show details

As we are in an era of big data where large groups of information are assimilated and analyzed, for insights into human behavior, data privacy has become a hot topic. Since there is a lot of private information which once leaked can be misused, all data cannot be released for research. This talk aims to discuss Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, how it is employed to minimize the risks with private data, its applications in various domains, and how Python eases the task of employing it in our models with PyDP.

Since there is a lot of private information which once leaked can be misused, how should privacy be protected? One might think that simply making personally identifiable fields in the dataset anonymous might be useful, but this can lead to the entire dataset becoming useless and not fit for analysis. And research has proven that by statistically studying both the datasets, private information can easily be re-extracted! The session will start with a brief on the current standards of privacy, and the possible risks of handling customer data. This will lay the foundation for introducing Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, by manipulating data in such a way as to not render it useless for data analysis. Developers will gain an insight into the concept of Differential Privacy, how it is employed to minimize the risks associated with private data, its practical applications in various domains, and how Python eases the task of employing it in our models with PyDP. As the talk progresses, a walkthrough of a real-life practical example, along with a nifty visualization will acquaint the audience with PyDP, and how differential private results come out to be in approximation to what unfiltered data would have provided.

🎤 Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
Speakers: 👤 Joris Van den Bossche
📅 Wed, 19 Apr 2023 at 14:00 show details

🎤 Bringing NLP to Production (an end to end story about some multi-language NLP services)
Speakers: 👤 Larissa Haas 👤 Jonathan Brandt
📅 Wed, 19 Apr 2023 at 14:00 show details

Models in Natural Language Processing are fun to train but can be difficult to deploy. The size of their models, libraries and necessary files can be challenging, especially in a microservice environment. When services should be built as lightweight and slim as possible, large (language) models can lead to a lot of problems. All the way down from brainstorming the use case, receiving and cleaning the data, training and optimizing the model until service building, deployment, and quality monitoring, lots of important data science related decisions need to be made which in the end will influence the selection of deployment tools and infrastructure. And most often, those architectural decisions are rather long-term so they should be thoughtfully chosen in order to fit into the rest of the architecture. With a recent real-world use case as an example, which runs productively for over a year and in 10 different languages, I will walk you through my experiences with deploying NLP models. What kind of pitfalls, shortcuts, and tricks are possible while bringing an NLP model to production? How can different model types and approaches influence architectural decisions? What are the most important questions to evaluate deployment platforms when there are several options to choose from? In this talk, you will learn about different ways and possibilities to deploy NLP services. I will speak briefly about the way leading from data to model and a running service (without going into much detail) before I will focus on the MLOps part in the end. I will take you with me on my past journey of struggles and successes so that you don’t need to take these detours by yourselves. To follow this talk, you will need to know the basic concepts of deployment and MLOps, but no deeper knowledge of python or Natural Language Processing. My goal is to enable you to ask important questions about deployment and going into production right at the beginning of every NLP project. I want you to be aware of problems that might occur so that working on NLP projects will be fun and not be overshadowed by deployment issues.

🎤 Behind the Scenes of tox: The Journey of Rewriting a Python Tool with more than 10 Million Monthly Downloads
Speakers: 👤 Jürgen Gmach
📅 Wed, 19 Apr 2023 at 14:00 show details

🎤 Machine Learning Lifecycle for NLP Classification in E-Commerce
Speakers: 👤 Gunar Maiwald 👤 Tobias Senst
📅 Wed, 19 Apr 2023 at 14:00 show details

Running machine learning models in a production environment brings its own challenges. In this talk we would like to present our solution of a machine learning lifecycle for the text-based cataloging classification system from idealo.de. We will share lessons learned and talk about our experiences during the lifecycle migration from a hosted cluster to a cloud solution within the last 3 years. In addition, we will outline how we embedded our ML components as part of the overall idealo.de processing architecture.

idealo.de offers a price comparison service for millions of products from a wide variety of categories. The automated classification of the offers is carried out using both traditional and deep learning-based approaches. Our machine learning components are part of a fully automated life cycle and process up to 500 million offers daily at peak times. In addition to the enormous amount of data that we process, we particularly face the challenges of being online 24/7 while adapting to an ever-changing catalog structure. This requires a high level of reliability from our inference service and continuous automated retraining and model deployment. In this talk we would like to share and present our view on MLOps: - How we integrate our CI/CD and continuous training pipelines with Github and AWS Sagemaker - How we migrate the lifecycle from a hosted cluster (running Kubernetes, Argo Workflows and ArgoCD) to the cloud (running AWS Sagemaker and Datalake). - How we monitor our models as well as data and performance indicators up to date and alert in case of disruptions - How we embed the classifiers in an event-driven heterogeneous software architecture (based on Kotlin and Python). And share lessons learned on: - How we keep reliability high while deploying, updating, and scaling our classification inference services - How we meet a valid compromise between performance and cost requirements.

🎤 The Battle of Giants: Causality vs NLP => From Theory to Practice
Speakers: 👤 Aleksander Molak
📅 Wed, 19 Apr 2023 at 14:10 show details

🎤 Contributing to an open-source content library for NLP
Speakers: 👤 Leonard Püttmann
📅 Wed, 19 Apr 2023 at 14:10 show details

🎤 Introduction to Async programming
Speakers: 👤 Dishant Sethi
📅 Wed, 19 Apr 2023 at 14:35 show details

Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. Post execution, it notifies the main thread about the completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance, enhanced responsiveness, and effective usage of CPU. Asynchronicity seems to be a big reason why Node.js is so popular for server-side programming. Most of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database POST API call. As soon as you ask for any of these resources, your code is waiting around for process completion with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond. In this session, we are going to talk about asynchronous programming in Python. Its benefits and multiple ways to implement it.

How Do We Implement Asynchronicity in Python? 1. Multiple Processes: The most obvious way is to use multiple processes. From the terminal, you can start multiple scripts, and then all the scripts are going to run independently or at the same time. The operating system that's underneath will take care of sharing your CPU resources among all those instances. Alternatively you can use the multiprocessing library which supports spawning processes as shown in the example below. 2. Multiple Threads: The next way to run multiple things at once is to use threads. A thread is a line of execution, pretty much like a process, but you can have multiple threads in the context of one process and they all share access to common resources. But because of this, it's difficult to write a threading code. And again, the operating system is doing all the heavy lifting on sharing the CPU, but the global interpreter lock (GIL) allows only one thread to run Python code at a given time even when you have multiple threads running code. So, In CPython, the GIL prevents multi-core concurrency. Basically, you’re running in a single core even though you may have two or four or more. 3. Coroutines using yield: Coroutines are generalizations of subroutines. They are used for cooperative multitasking where a process voluntarily yield (gives away) control periodically or when idle in order to enable multiple applications to be run simultaneously. 4. Asynchronous Programming: The fourth way is asynchronous programming, where the OS is not participating is asyncio. Asyncio is the new concurrency module introduced in Python 3.4. It is designed to use coroutines and futures to simplify asynchronous code and make it almost as readable as synchronous code as there are no callbacks. 5. Using Redis and Redis Queue(RQ): Using asyncio and aiohttp may not always be in an option especially if you are using older versions of python. Also, there will be scenarios when you would want to distribute your tasks across different servers. In that case, we can leverage RQ (Redis Queue). It is a simple Python library for queueing jobs and processing them in the background with workers. It is backed by Redis - a key/value data store. A practical definition of Async is that it's a style of concurrent programming in which tasks release the CPU during waiting periods, so that other tasks can use it. In Python, there are several ways to achieve concurrency, based on our requirement, code flow, data manipulation, architecture design, and use cases we can select any of these methods.

🎤 The Beauty of Zarr
Speakers: 👤 Sanket Verma
📅 Wed, 19 Apr 2023 at 14:35 show details

In this talk, I’d be talking about [Zarr](https://zarr.dev/), an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open [technical specification](https://zarr.readthedocs.io/en/stable/spec/v2.html), making implementations across several languages possible. I’d mainly talk about [Zarr’s Python](https://github.com/zarr-developers/zarr-python) implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.

[Zarr](https://zarr.dev/) is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source [technical specification](https://zarr.readthedocs.io/en/stable/spec/v2.html) and has [implementations](https://github.com/zarr-developers/zarr_implementations) in several languages, with [Zarr-Python](https://github.com/zarr-developers/zarr-python) being the most used. Zarr is [NumFOCUS’s sponsored project](https://numfocus.org/sponsored-projects) and is under their umbrella. ### Outline: First, I’d be talking about: ### What’s, Why’s, and How’s of Zarr (15 mins.) - How does Zarr work? - Talking about the motivation and functionality of Zarr - What’s the need for using Zarr? - When, where and why to use it? - Pluggable compressors and file-storage - Talking about several compressors and file-storage systems available in Zarr - Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions - Using inbuilt functions to manage compressed chunks - How is Zarr different when compared to other storage formats? - Talking briefly about technical specification, which allows Zarr to have implementations in several languages - Pros and cons when compared to other storage formats - Zarr community - What is the Zarr community, and how do we do things? Then, I’d be doing a hands-on session, which would cover the following: ### Hands-on (10 mins.) - Creating and using Zarr arrays - Using inbuilt functions to create Zarr arrays and reading and writing data to it - Looking under the hood - Use store functions to explain how your Zarr data is stored - Consolidating metadata - Consolidating the metadata for an entire group into a single object - Writing and reading from Cloud object storage - Using S3/GCS/Azure to create Zarr arrays and write data to it - Showing how Zarr interoperates with the PyData stack - How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask I’d be closing the talk by: ### Conclusion(5 mins.) - Key takeaway - How can you contribute to Zarr? - QnA This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone curious and wants to learn about Zarr and how to use it is most welcome. The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room. Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk. ### After this talk, you’d learn: - Basic use cases for Zarr and how to use it - Understand the basics of data storage in Zarr - Understand the basics of compressors and file-storage systems in Zarr - Take a better and more informed decision on what data format to use for your data

🎤 Cloud Infrastructure From Python Code: How Far Could We Go?
Speakers: 👤 Etzik Bega 👤 Asher Sterkin
📅 Wed, 19 Apr 2023 at 14:35 show details

Discover how Infrastructure From Code (IfC) can revolutionize Cloud DevOps automation by generating cloud deployment templates directly from Python code. Learn how this technology empowers Python developers to easily deploy and operate cost-effective, secure, reliable, and sustainable cloud software. Join us to explore the strategic potential of IfC.

## Audience The talk is a call for action towards the whole Python community to take an active part in unlocking full Python potential as a truly cloud-native programming language by adapting its runtime and compiler to work optimally with cloud resources. ## Why SDK Programming and Infrastructure as Code are not enough anymore? Developing cloud software using cloud SDK combined with deployment automation using Infrastructure as Code templates has some serious limitations. The both SDK and IaC are at realively low level, require special expertise which takes time to acquire, are disconnected from each other and too often prepared by separate enigineering teams. Applying SDK+IaC to multiple test, staging, and production environments can exacerbate complexity and size issues. As a result, there is a need for a more efficient and automated approach to cloud infrastructure management that integrates tightly with application code. ## What is Infrastructure From Code? Infrastructure from Code (IfC) is a newer and more advanced approach than IaC. It interprets mainstream programming language code and automatically generates the specifications needed to configure a cloud environment. Advanced solutions like ServerlessCloud, Ampt, and Nitric have been proposed for the TypeScript ecosystem. This talk will explore the current state of IfC for Python, its potential, and what needs to be done to make Python a truly cloud-native programming language. ## Talk Outline 1. Infrastructure from Python Code (PyIfC) Mission 2. The Challenges of SDK programming combined with Infrastructure as Code (IaC) 3. The PyIfC Approach: How It Works and Its Benefits 4. Sample Code and Demo 5. A Closer Look at PyIfC's Inner Workings 6. Overcoming Deployment Location Optimization and Sustainability Challenges 7. Overview of Existing Solutions Landscape for PyIfC 8. Unleashing the Full Potential of Python ecosystem 9. The Intersection of PyIfC and Domain-Driven Design 10. Advancing PyIfC: What Needs to Be Done 11. Key Takeaways and Next Steps 12. Q&A # Tags Cloud, Deployment, Automation, Serverless, Infrastructure as Code, IaC, Infrastructure From Code, IfC, Python

🎤 Giving and Receiving Great Feedback through PRs
Speakers: 👤 David Andersson
📅 Wed, 19 Apr 2023 at 14:35 show details

🎤 evosax: JAX-Based Evolution Strategies
Speakers: 👤 Robert Langer
📅 Wed, 19 Apr 2023 at 14:35 show details

Tired of having to handle asynchronous processes for neuroevolution? Do you want to leverage massive vectorization and high-throughput accelerators for evolution strategies (ES)? [evosax](https://github.com/RobertTLange/evosax) allows you to leverage JAX, XLA compilation and auto-vectorization/parallelization to scale ES to your favorite accelerators. In this talk we will get to know the core API and how to solve distributed black-box optimization problems with evolution strategies.

The deep learning revolution has greatly been accelerated by the 'hardware lottery': Recent advances in modern hardware accelerators and compilers paved the way for large-scale batch gradient optimization. Evolutionary optimization, on the other hand, has mainly relied on CPU-parallelism, e.g. using Dask scheduling and distributed multi-host infrastructure. Here we argue that also modern evolutionary computation can significantly benefit from the massive computational throughput provided by GPUs and TPUs. In order to better harness these resources and to enable the next generation of black-box optimization algorithms, we release [evosax](https://github.com/RobertTLange/evosax): A JAX-based library of evolution strategies which allows researchers to leverage powerful function transformations such as just-in-time compilation, automatic vectorization and hardware parallelization. [evosax](https://github.com/RobertTLange/evosax) implements 30 evolutionary optimization algorithms including finite-difference-based, estimation-of-distribution evolution strategies and various genetic algorithms. Every single algorithm can directly be executed on hardware accelerators and automatically vectorized or parallelized across devices using a single line of code. It is designed in a modular fashion and allows for flexible usage via a simple ask-evaluate-tell API. We thereby hope to facilitate a new wave of scalable evolutionary optimization algorithms.

🎤 Postmodern Architecture: The Python Powered Modern Data Stack
Speakers: 👤 John Sandall
📅 Wed, 19 Apr 2023 at 15:10 show details

The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams. Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this..."). This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack. In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.

This light-hearted talk will aim to introduce the audience to the theory and terminology of data pipelines and architectures past, present and future. The "Modern Data Stack" set of interoperable tools introduced a shift in how organisations can rapidly construct a data architecture that can combine multiple data sources into a single unified data warehouse with clean analytics-ready tables for plugging BI tools, self-serve analytics dashboards, and ML models into. Until recently, the complexity of data transformation and modelling was limited to what can be done with SQL, leaving the rich ecosystem of Python tooling for complex transformations, geospatial analytics, time series modelling, data validation tools and clean tested CI-enabled codebases mostly uninvited to the Modern Data Stack party. A recent trend has been a number of tools that launched Python integrations in 2022 (most notably by dbt), opening up a world of productivity and fast scalable data processing for the PyData-savvy Pythonista. Another recent trend is an explosion of jargon, with analytics engineers getting into heated debates around whether data observability or metadata-capture should be prioritised within a data mesh architecture. These are all important concepts, especially for organisations operating at a scale where reliable data governance is mission-critical. Not all organisations are operating at that scale, and every organisation large or small is own its own data maturity journey. My goal with this talk is to bring these concepts together, introduce attendees to these recent trends, and provide a framework they can take back into their organisations for accelerating their own data maturity journey using the latest tooling & best practices.

🎤 Fear the mutants. Love the mutants.
Speakers: 👤 Max Kahan
📅 Wed, 19 Apr 2023 at 15:10 show details

🎤 Rethinking codes of conduct
Speakers: 👤 Tereza Iofciu
📅 Wed, 19 Apr 2023 at 15:10 show details

🎤 How to increase diversity in open source communities
Speakers: 👤 Maren Westermann
📅 Wed, 19 Apr 2023 at 15:10 show details

Today state of the art technology and scientific research strongly depend on open source libraries. The demographic of the contributors to these libraries is predominantly white and male [1][2][3][4]. This situation creates problems not only for individual contributors outside of this demographic but also for open source projects such as loss of career opportunities and less robust technologies, respectively [1][7]. In recent years there have been a number of various recommendations and initiatives to increase the participation in open source projects of groups who are underrepresented in this domain [1][3][5][6]. While these efforts are valuable and much needed, contributor diversity remains a challenge in open source communities [2][3][7]. This talk highlights the underlying problems and explores how we can overcome them.

In this talk we’ll first examine the problems encountered by people belonging to marginalised groups in open source as well as by project maintainers with respect to contributing to and increasing the diversity of open source projects, respectively [1][2][3][4][5][6]. Building on this overview, we’ll go over what kind of actions have been taken to increase diversity in open source projects, with special focus on scientific libraries, and the effects they have had [1][6][7]. Lastly, we’ll look at ideas that are currently being tested and next steps. By the end of this talk, the audience will have a good understanding of why contributor diversity is low in open source, the efforts that have been made so far to address this problem, and what can further be done to increase the presence of underrepresented groups in technology in general, and in open source in particular. References: [1] https://www.wired.com/2017/06/diversity-open-source-even-worse-tech-overall [2] https://arxiv.org/pdf/1706.02777.pdf [3] https://ieeexplore.ieee.org/abstract/document/8870179 [4] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9354402 [5] https://biancatrink.github.io/files/papers/JISA2021.pdf [6] https://arxiv.org/pdf/2105.08777.pdf [7] https://blog.scikit-learn.org/events/sprints-value/

🎤 Great Security Is One Question Away
Speakers: 👤 Wiktoria Dalach
📅 Wed, 19 Apr 2023 at 15:10 show details