113 talks
🎤
Honey, I broke the PyTorch model >.< - Debugging custom PyTorch models in a structured manner
Speakers:
👤
Clara Hoffmann
📅 Mon, 17 Apr 2023 at 10:50
show details
When building PyTorch models for custom applications from scratch there's usually one problem: The model does not learn anything. In a complex project, it can be tricky to identify the cause: Is it the data? A bug in the model? Choosing the wrong loss function at 3 am after an 8-hour coding session? In this talk, we will build a toolbox to find the culprits in a structured manner. We will focus on simple ways to ensure a training loop is correct, generate synthetic training data to determine whether we have a model bug or problematic real-world data, and leverage pytest to safely refactor PyTorch models. After this talk, visitors will be well equipped to take the right steps when a model is not learning, quickly identify the underlying reasons, and prevent bugs in the future.
PyTorch models for off-the-shelf applications are easy to build and debug. But in real-world ML applications, debugging can become quite tricky - especially when model complexity is high and only noisy real-world data is available. When our DNN is not learning many factors can be at fault: - Is there a bug in the model structure - for example mixed-up channels or timesteps? - Is our dataset not large or homogeneous enough to learn something? Have we mixed up labels in the preprocessing? - Have we chosen incorrect losses, accidentally skipped layers, or chosen inappropriate activation functions? The plethora of potential reasons can be overwhelming to engineers. This talk will introduce a structured approach and valuable tools for efficiently debugging PyTorch models. We'll start with techniques to check for correct training loops, such as ensuring our model overfits with a single training example. In the second step, we'll investigate how to generate simple, synthetic data for arbitrary input and output formats to validate our model. At last, we'll look at how to avoid model bugs altogether, by setting up universal tests that can be used during development and refactoring to prevent breaking PyTorch models.
🎤
Cooking up a ML Platform: Growing pains and lessons learned
Speakers:
👤
Cole Bailey
📅 Mon, 17 Apr 2023 at 10:50
show details
What is a ML platform and do you even need one? When should you consider investing in your own ML platform? What challenges can you expect building and maintaining one? Tune in and discover (some) answers to these questions and more! I will share a first-hand account of our ongoing journey towards becoming a ML platform team within Delivery Hero's Logistics department, including how we got here, how we structure our work, and what challenges and tools we are focussing on next.
What is an ML platform and do you even need one? When should you consider investing in your own ML platform? What challenges can you expect building and maintaining one? Tune in and discover (some) answers to these questions and more! I will share a first-hand account of our ongoing journey towards becoming an ML platform team within Delivery Hero's Logistics department, including how we got here, how we structure our work, and what challenges and tools we are focusing on next.
🎤
Apache StreamPipes for Pythonistas: IIoT data handling made easy!
Speakers:
👤
Tim Bossenmaier
👤
Sven Oehler
📅 Mon, 17 Apr 2023 at 10:50
show details
The industrial environment offers a lot of interesting use cases for data enthusiasts. There are myriads of interesting challenges that can be solved by data scientists. However, collecting industrial data in general and industrial IoT (IIoT) data in particular, is cumbersome and not really appealing for anyone who just wants to work with data. Apache StreamPipes addresses this pitfall and allows anyone to extract data from IIoT data sources without messing around with (old-fashioned) protocols. In addition, StreamPipes newly developed Python client now gives Pythonistas the ability to programmatically access and work with them in a Pythonic way. This talk will provide a basic introduction into the functionality of Apache StreamPipes itself, followed by a deeper discussion of the Python client. Finally, a live demo will show how IIoT data can be easily derived in Python and used directly for visualization and ML model training.
The industrial environment is becoming an increasingly attractive use case for data enthusiasts with challenges ranging from predictive maintenance to robotics to autonomous vehicles. Building a full-fledged IIoT architecture is a big endeavor, especially for small and medium sized companies with limited resources. It requires IIoT specialists with extensive knowledge of industrial protocols, software architects capable of designing an IIoT platform, and cloud specialists able to operate an infrastructure at scale that can handle potentially massive data streams. However, the added value lies not in the technical infrastructure, but in the data itself. Therefore, it should be as easy as possible for data scientists to analyze data to gain new insights without worrying about underlying technical details. But such a project has many pitfalls, which is why many projects are not even initiated because the costs seem too high. These pitfalls are addressed by Apache StreamPipes, an end-to-end toolbox that allows anyone to easily extract, explore and analyze IIoT data. With its new Python client, it targets Python data enthusiasts (e.g., data scientists) who want to work with IIoT data but don't want to get their hands dirty interacting with industrial systems. Via an easy-to-use python client, it is possible for developers to get streaming or historic data from StreamPipes internal data management layer in a pythonic representation like dictionaries or pandas dataframes. This allows data scientists to work with their familiar tech stack and use the extracted data directly for analytics, visualizations, or even machine learning. StreamPipes handles all the infrastructure such as the message broker or time-series storage and provides many out-of-the-box features that ease data analytics of industrial sources: More than 20 data adapters for quickly getting access to a variety of industrial protocols, built-in pre-processing rules to harmonize sensor and other data on the fly and a pipeline editor featuring over 100 algorithms and a rich user interface to interactively build data processing pipelines. Apache StreamPipes is a large and mature open source project which started as a research project in 2015 and made its way to an Apache top-level project in November 2022 with a community of currently more than 25 active contributors. The talk will provide a basic introduction to Apache StreamPipes, followed by a deeper discussion of the Python client focusing on the target audience (Python developers). The main part is about data handling with python, and design decisions within the client for common patterns will be discussed in detail. As a conclusion we will show how IIoT data can be extracted via Apache StreamPipes and used for further analytics within the Python world. Attendees will get familiar with Apache StreamPipes in general, its mission, and its core modules. In addition, common IIoT patterns will be presented and illustrated using the Python client of Apache StreamPipes. The presentation includes an extensive demo with many hands-on examples.
🎤
Pandas 2.0 and beyond
Speakers:
👤
Joris Van den Bossche
👤
Patrick Hoefler
📅 Mon, 17 Apr 2023 at 10:50
show details
Pandas has reached a 2.0 milestone in 2023. But what does that mean? And what is coming after 2.0? This talk will give an overview of what happened in the latest releases of pandas and highlight some topics and major new features the pandas project is working on.
The pandas 2.0 release is targeted for the first quarter of 2023. This is a major milestone for the pandas project, and this talk will start with an overview of this release. Pandas 2.0 includes some new (experimental) features, but mostly means enforcing deprecations that have been accumulated in the 1.x series, along with some necessary breaking changes. But that doesn’t mean there are no interesting features to talk about! The main part of the presentation will showcase some new features, both already released as opt-in features or to come in future releases. Support for non-nanosecond resolution datetimes, allowing time spans ranging over a billion of years. Improved support for nullable data types, including easy opt-in options for I/O functions. Experimental integration with pyarrow to back columns of a DataFrame (beyond the string dtype). A major change that is under way is a change to the copy and view semantics of operations in pandas (related to the well-known (or hated) SettingWithCopyWarning). This is already available as an experimental opt-in to test and use the new behaviour, and will probably be a highlight of pandas 3.0.
🎤
How to teach NLP to a newbie & get them started on their first project
Speakers:
👤
Lisa Andreevna Chalaguine
📅 Mon, 17 Apr 2023 at 10:50
show details
The materials presented during this tutorial are open source and can be used by coaches and tutors who want to teach their students how to use Python for text processing and text classification. (A minimal understanding of programming (in any language) is required by the students)
The materials presented at this tutorial were initially created for high school and university students to help them to get started with their first machine learning project using textual data. Machine learning on textual data is more accessible for beginners because it does not involve missing data imputation, normalisation and scaling. It is also easier to analyse and interpret the results (e.g. why something was misclassified). There are many introductory courses on NLP on the internet, however, they are not for free and they either only cover complete basics¹, or do not cover machine learning algorithms² and treat models as a black box. Also, they do not show how to do research correctly (e.g. setting a baseline, making design decisions based on correct validation etc). These materials in the form of jupyter notebooks can be used by teachers to guide their students through an NLP research project from start to finish. These materials are of course not limited to teachers and tutors at academic institutions. Many companies rely on customer reviews, social media, client records, and various other content created in natural language, but often use sub-optimal solutions to analyse it (like MS Excel). These materials will give working professionals all the tools to get started with text analysis, as well as teach them the fundamentals of machine learning, so they can automate document labelling and other manual tasks with the help of document classification (e.g. Is a customer review positive or negative? Is a certain document about topic X or topic Y?). A minimal understanding of programming (in any language) is required. However, all necessary Python libraries will be covered. The aim of the tutorial would be to present the materials which contains 7 “lectures”, several practical exercises with solutions, and a case study and hence can be covered in either 10 hours (10 weeks) over a term or a 2-day workshop. ¹https://www.udemy.com/course/natural-language-processing/ ²https://www.udemy.com/course/nlp-natural-language-processing-with-python/
🎤
Accelerate Python with Julia
Speakers:
👤
Stephan Sahm
📅 Mon, 17 Apr 2023 at 10:50
show details
Speeding up Python code has traditionally been achieved by writing C/C++ — an alien world for most Python users. Today, you can write high performance code in Julia instead, which is much much easier for Python users. This tutorial will give you hands-on experience writing a Python library that incorporates Julia for performance optimization.
Julia is a modern data science language which solves the two-language problem by being both easy to use and high performant. Although different from Python, the language can be quickly learned by Python users, making it a good choice for speeding up pieces of code. Julia, in addition to being a similar language, is designed for high-performance applied mathematics and has high-quality libraries for multi-dimensional arrays, dataframes, distributed computing and more. The older alternative of writing pieces of code in C, C++, Cython or Rust is much more cumbersome: Here, programmers have to cope with a low-level language, static types, pointers, no garbage collector, lack of scientific libraries and other difficulties not normally faced by Python users. There was simply no better alternative. The tutorial will be fully hands-on, using Jupyter Notebook and Binder to provide a smooth and easy-to-use environment for each participant. Both Julia and Python work seamlessly within Jupyter. We start with an introduction to the basics of Julia, focusing on the core differences with Python and how to work around common translation difficulties. Then we'll take a look at the interfaces between Julia and Python and build a Python sample project that runs Julia code. Finally, we will benchmark our solution. After the tutorial you will be able to use Julia to speed up your Python code.
🎤
From notebook to pipeline in no time with LineaPy
Speakers:
👤
Thomas Fraunholz
📅 Mon, 17 Apr 2023 at 10:50
show details
The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. The good news is, there's finally a cure! The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it! In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?
The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. You embark on a time-consuming journey of refactoring the notebook. You come across irrelevant and relevant code snippets that are scattered in different cells but you persevere. Midway through your journey, you realize that your refactoring is not immune from the reproducibility issues caused by deleted cells and out-of-order cell executions. We haven't even talked about the creation of a pipeline from that notebook yet! A desperate situation indeed. The good news is, there's finally a cure! The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it! In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?
🎤
Large Scale Feature Engineering and Datascience with Python & Snowflake
Speakers:
👤
Michael Gorkow
📅 Mon, 17 Apr 2023 at 11:40
show details
[Snowflake](https://www.snowflake.com/en/) as a data platform is the core data repository of many large organizations. With the introduction of Snowflake's [Snowpark for Python](https://github.com/snowflakedb/snowpark-python), Python developers can now collaborate and build on one platform with a secure Python sandbox, providing developers with dynamic scalability & elasticity as well as security and compliance. In this talk I'll explain the core concepts of Snowpark for Python and how they can be used for large scale feature engineering and data science.
This talk is for technical people that would like to get a deep dive into how Snowflake enables large scale feature engineering and data science via Snowpark for Python. During this talk we'll explore Snowflake's Python capabilities using a simple machine learning use case. After this talk you will: * know how Snowpark avoids data movement and keeps existing security & governance intact, * understand the concept of the Snowpark DataFrame-API and how it enables accelerated performance compared to standard Pandas DataFrames, * know how to distribute Hyper Parameter Tuning and training of multiple models, * understand the concept of Vectorized User-Defined-Functions and how they can be used to perform large scale model inference.
🎤
AutoGluon: AutoML for Tabular, Multimodal and Time Series Data
Speakers:
👤
Caner Turkmen
👤
Oleksandr Shchur
📅 Mon, 17 Apr 2023 at 11:40
show details
AutoML, or automated machine learning, offers the promise of transforming raw data into accurate predictions with minimal human intervention, expertise, and manual experimentation. In this talk, we will introduce AutoGluon, a cutting-edge toolkit that enables AutoML for tabular, multimodal and time series data. AutoGluon emphasizes usability, enabling a wide variety of tasks from regression to time series forecasting and image classification through a unified and intuitive API. We will specifically focus on tasks on tabular and time series tasks where AutoGluon is the current state-of-the-art, and demonstrate how AutoGluon can be used to achieve competitive performance on tabular and time series competition data sets. We will also discuss the techniques used to automatically build and train these models, peeking under the hood of AutoGluon.
[AutoGluon](http://auto.gluon.ai) is a Python machine learning library which offers cutting edge accuracy and value-for-compute on a wide variety of tasks. These tasks include regression, classification and quantile regression in tabular data, as well as multimodal tasks such as image classification, image-to-text and text-to-text similarity. A recent addition to AutoGluon is AutoGluon-TimeSeries, the library's module for time series forecasting tasks. AutoGluon is organized into modules for Tabular, Multimodal and Time Series tasks all of which share an intuitive scikit-learn-like API for fitting and performing inference with cutting-edge machine learning in as little as three lines of code, with no in-depth understanding of ML. AutoGluon is widely considered the state-of-the-art in tabular tasks as confirmed by the independent [AutoML Benchmark](https://openml.github.io/automlbenchmark/papers.html), and is the current top performer on multimodal tasks on the RAFT leaderboard. In this talk, we will focus on the tabular and time series modules and showcase how the library can be used to get competitive results on competition platforms such as Kaggle. AutoGluon also differs quite significantly under the hood from other AutoML frameworks. The library does not take AutoML to primarily mean hyperparameter optimization, but leans heavily into building (stack) ensembles of strong but varied learning algorithms to achieve superior results. We will also showcase some of the theory and building blocks of AutoGluon, describing how we built an AutoML system that takes model ensembling as a central element.
🎤
Incorporating GPT-3 into practical NLP workflows
Speakers:
👤
Ines Montani
📅 Mon, 17 Apr 2023 at 11:40
show details
In this talk, I'll show how large language models such as GPT-3 complement rather than replace existing machine learning workflows. Initial annotations are gathered from the OpenAI API via zero- or few-shot learning, and then corrected by a human decision maker using an annotation tool. The resulting annotations can then be used to train and evaluate models as normal. This process results in higher accuracy than can be achieved from the OpenAI API alone, with the added benefit that you'll own and control the model for runtime.
Software engineering is all about getting computers to do what we want them to do. As machine learning methods have improved, they've introduced a new way to specify the desired behaviour. Instead of writing code, you can prepare example data. Large language models are now starting to introduce a third option: instead of example data, you can provide a natural language prompt. Writing a prompt is far quicker than building a good set of training examples, but it's also a much less precise way to get the behaviour you want. There's also no reliable way to incrementally improve the results, even if better performance would be very valuable to you. Essentially, this new approach has a high floor, but a low ceiling. In this talk, I'll show how large language models such as GPT3 complement rather than replace existing machine learning workflows. Initial annotations are gathered from the OpenAI API via zero- or few-shot learning, and then corrected by a human decision maker using the Prodigy annotation tool. The resulting annotations can then be used to train and evaluate models as normal. This process results in higher accuracy than can be achieved from the OpenAI API alone, with the added benefit that you'll own and control the model for runtime.
🎤
An unbiased evaluation of environment management and packaging tools
Speakers:
👤
Anna-Lena Popkes
📅 Mon, 17 Apr 2023 at 11:40
show details
Python packaging is quickly evolving and new tools pop up on a regular basis. Lots of talks and posts on packaging exist but none of them give a structured, unbiased overview of the available tools. This talk will shed light on the jungle of packaging and environment management tools, comparing them on a basis of predefined features.
Python packaging is quickly evolving and new tools pop up on a regular basis. Lots of talks and posts on packaging exist but none of them give a structured, unbiased overview of the available tools. This talk will shed light on the jungle of packaging and environment management tools, comparing them on a basis of predefined features. We will categorize tools using the following categories: - Python version management - Environment management - Package management - Package building - Package publishing A lot of tools exist, including pyenv, pip, venv, poetry, hatch, and many more. We will categorize all of them and discuss some in more detail, e.g. hatch. Most importantly, we will evaluate the tools on the basis of features that are important for developers like: - Does the tool manage dependencies? - Can it manage Python installations? - Does it have a clean build/publish flow? - Does it allow for plugins? - Does it support important PEPs, e.g. PEP 660, PEP 621, PEP 582? ## Audience This talk is intended for developers who - Have used packaging and want to get to know new tools - Want to have an overview of existing tools and their capabilities ## Existing talks on the topic of packaging - PyCon US 2021 Jeremy Paige / Packaging Python in 2021 - PyCon US 2021 TUTORIAL / Bernát Gabor / Python Packaging Demystified - EuroPython 2022 Packaging in Python in 2022
🎤
Hyperparameter optimization for the impatient
Speakers:
👤
Martin Wistuba
📅 Mon, 17 Apr 2023 at 11:40
show details
In the last years, Hyperparameter Optimization (HPO) became a fundamental step in the training of Machine Learning (ML) models and in the creation of automatic ML pipelines. Unfortunately, while HPO improves the predictive performance of the final model, it comes with a significant cost both in terms of computational resources and waiting time. This leads many practitioners to try to lower the cost of HPO by employing unreliable heuristics. In this talk we will provide simple and practical algorithms for users that want to train models with almost-optimal predictive performance, while incurring in a significantly lower cost and waiting time. The presented algorithms are agnostic to the application and the model being trained so they can be useful in a wide range of scenarios. We provide results from an extensive experimental activity on public benchmarks, including comparisons with well-known techniques like Bayesian Optimization (BO), ASHA, Successive Halving. We will describe in which scenarios the biggest gains are observed (up to 30x) and provide examples for how to use these algorithms in a real-world environment. All the code used for this talk is available on (GitHub)[https://github.com/awslabs/syne-tune].
In this talk we will present simple and practical solutions to perform HPO quickly with results on-par with well-know (and costly) techniques. Our claims are supported by empirical evidence obtained on public standardized benchmarks and our work has been accepted in peer-reviewed workshop (currently under submission to a conference). Specifically, [1] has been accepted at the AutoML Conference Workshop Track and [2] has been accepted at the AutoML workshop at ICML 2021. All the code regarding the algorithms is available in the Syne-Tune package under license Apache 2.0 (https://github.com/awslabs/syne-tune). References: [1] https://arxiv.org/abs/2207.06940 [2] https://arxiv.org/abs/2103.16111
🎤
Keynote - A journey through 4 industries with Python: Python's versatile problem-solving toolkit
Speakers:
👤
Susan Shu Chang
📅 Mon, 17 Apr 2023 at 13:55
show details
In this keynote, I will share the lessons learned from using Python in 4 industries. Apart from machine learning applications that I build in my day to day as a data scientist and machine learning engineer, I also use Python to develop games for my own gaming company, Quill Game Studios. There is a lot of versatility in Python, and it's been my pleasure to use it to solve many interesting problems. I hope that this talk can give inspiration to various types of applications in your own industry as well.
In this keynote, I will share the lessons learned from using Python in 4 industries. Apart from machine learning applications that I build in my day to day as a data scientist and machine learning engineer, I also use Python to develop games for my own gaming company, Quill Game Studios. There is a lot of versatility in Python, and it's been my pleasure to use it to solve many interesting problems. I hope that this talk can give inspiration to various types of applications in your own industry as well.
🎤
Common issues with Time Series data and how to solve them
Speakers:
👤
Vadim Nelidov
📅 Mon, 17 Apr 2023 at 15:10
show details
Time-series data is all around us: from logistics to digital marketing, from pricing to stock markets. It’s hard to imagine a modern business that has no time series data to forecast. However, mastering such forecasting is not an easy task. For this talk, together with other domain experts, I have collected a list of common time series issues that data professionals commonly run into. After this talk, you will learn to identify, understand, and resolve such issues. This will include stabilising divergent time series, organising delayed / irregular data, handling missing values without anomaly propagation, and reducing the impact of noise and outliers on your forecasting models.
This talk will walk you through 4 common issues with Time Series and illustrate them using the context of energy demand forecasting. For each of these issues you will learn to identify, understand, and resolve them better. These issues are time series instability, delayed and irregular time series data, hard-to-impute missing values, impact of noise and outliers on forecasting models. The talk is therefore split into 4 parts each with some room for questions. Each part will provide some high-level background, explanations, examples and code snippets, while avoiding unnecessary in-depth computations and formulas. Therefore, the whole talk is accessible to both specialists with experience in Time Series analytics as well as those without such experience who nonetheless intend to broaden their understanding of this field and gain some valuable insights for the business problems that they are likely to encounter in the future. Data Scientists / Analysts working with time series data and understanding at least the basics of Pandas / Scikit-learn Python libraries as well as what a time series forecasting problem entails would benefit the most from this talk. However, other less technical specialists (management, product owners etc.) can still gain valuable domain knowledge in this field.
🎤
How to baseline in NLP and where to go from there
Speakers:
👤
Tobias Sterbak
📅 Mon, 17 Apr 2023 at 15:10
show details
In this talk, we will explore the build-measure-learn paradigm and the role of baselines in natural language processing (NLP). We will cover the common NLP tasks of classification, clustering, search, and named entity recognition, and describe the baseline approaches that can be used for each task. We will also discuss how to move beyond these baselines through weak learning and transfer learning. By the end of this talk, attendees will have a better understanding of how to establish and improve upon baselines in NLP.
In this talk, we will explore the role of baselines in natural language processing (NLP) and discuss how to move beyond these baselines through weak learning and transfer learning. First, I will introduce the build-measure-learn paradigm, which is a framework for developing and improving products or systems. This paradigm involves building a solution, measuring its performance, and learning from the results to iteratively improve the solution. Baselines are an essential part of this process because they provide a starting point for comparison and a benchmark to measure against. Next, I will delve into the common NLP tasks of classification, clustering, search, and named entity recognition (NER). For each task, I will describe the baseline approaches that can be used. These baselines may not be the most advanced or sophisticated solutions, but they are often quick and easy to implement, and they can serve as a useful reference and guidance for further improvement. Finally, I will discuss how to move on from these baselines. One option is to use insights from the baselines to build a weak learning system, which is a machine learning model that relies on human-generated rules or patterns rather than a large dataset. Another option is to leverage transfer learning, which involves adapting a pre-trained model to a new task or domain by fine-tuning its parameters on a smaller dataset. In conclusion, this talk will provide a practical guide to establishing baselines in NLP and moving beyond them through weak learning and transfer learning.
🎤
Exploring the Power of Cyclic Boosting: A Pure-Python, Explainable, and Efficient ML Method
Speakers:
👤
Felix Wick
📅 Mon, 17 Apr 2023 at 15:10
show details
We have recently open-sourced a pure-Python implementation of Cyclic Boosting, a family of general-purpose, supervised machine learning algorithms. Its predictions are fully explainable on individual sample level, and yet Cyclic Boosting can deliver highly accurate and robust models. For this, it requires little hyperparameter tuning and minimal data pre-processing (including support for missing information and categorical variables of high cardinality), making it an ideal off-the-shelf method for structured, heterogeneous data sets. Furthermore, it is computationally inexpensive and fast, allowing for rapid improvement iterations. The modeling process, especially the infamous but unavoidable feature engineering, is facilitated by automatic creation of an extensive set of visualizations for data dependencies and training results. In this presentation, we will provide an overview of the inner workings of Cyclic Boosting, along with a few sample use cases, and demonstrate the usage of the new Python library. You can find Cyclic Boosting on GitHub: https://github.com/Blue-Yonder-OSS/cyclic-boosting
We have recently open-sourced a pure-Python implementation of Cyclic Boosting, a family of general-purpose, supervised machine learning algorithms. Its predictions are fully explainable on individual sample level, and yet Cyclic Boosting can deliver highly accurate and robust models. For this, it requires little hyperparameter tuning and minimal data pre-processing (including support for missing information and categorical variables of high cardinality), making it an ideal off-the-shelf method for structured, heterogeneous data sets. Furthermore, it is computationally inexpensive and fast, allowing for rapid improvement iterations. The modeling process, especially the infamous but unavoidable feature engineering, is facilitated by automatic creation of an extensive set of visualizations for data dependencies and training results. In this presentation, we will provide an overview of the inner workings of Cyclic Boosting, along with a few sample use cases, and demonstrate the usage of the new Python library. You can find Cyclic Boosting on GitHub: https://github.com/Blue-Yonder-OSS/cyclic-boosting
🎤
The CPU in your browser: WebAssembly demystified
Speakers:
👤
Antonio Cuni
📅 Mon, 17 Apr 2023 at 15:10
show details
In the recent years we saw an explosion of usage of Python in the browser: Pyodide, CPython on WASM, PyScript, etc. All of this is possible thanks to the powerful functionalities of the underlying platform, WebAssembly, which is essentially a virtual CPU inside the browser.
In the recent years we saw an explosion of usage of Python in the browser: Pyodide, CPython on WASM, PyScript, etc. All of this is possible thanks to the powerful functionalities of the underlying platform, WebAssembly. In this talk we will examine what is exactly WebAssembly, what are the strong and weak points, what are the limitations and what the future will bring us. We will also see why and how WebAssembly is useful and used outside the browser. This talk is targeted to an intermediate/advanced audience: no prior knowledge of WebAssembly is required, but it is required to have a basic understanding of what is a compiler, an interpreter and the concept of bytecode. The introduction will cover the basics to make sure that the talk is understandable also by people who are completely new to the WebAssembly world, but after that we will dive into the low-level technical details, with a special focus on those who are relevant to the Python world, such WASI vs emscripten, dynamic linking, JIT compilation, interoperability with other languages, etc.
🎤
Staying Alert: How to Implement Continuous Testing for Machine Learning Models
Speakers:
👤
Emeli Dral
📅 Mon, 17 Apr 2023 at 15:10
show details
Proper monitoring of machine learning models in production is essential to avoid performance issues. Setting up monitoring can be easy for a single model, but it often becomes challenging at scale or when you face alert fatigue based on many metrics and dashboards. In this talk, I will introduce the concept of test-based ML monitoring. I will explore how to prioritize metrics based on risks and model use cases, integrate checks in the prediction pipeline and standardize them across similar models and model lifecycle. I will also take an in-depth look at batch model monitoring architecture and the use of open-source tools for setup and analysis.
Have you ever deployed a machine learning model in production only to realize that it wasn't performing as well as you thought it would, or was late to detect a model performance drop due to corrupted data? Proper monitoring can help avoid it. Typically, this involves checking the quality of the input data, monitoring the model's responses, and detecting any changes that might lead to model quality drops. However, setting up monitoring is often easier said than done. First, while it is easy to write a few assertions for data quality checks or track accuracy for a single model you created, it is much more challenging to do so consistently and at scale as the number of models, pipelines, and the volume of data increases. Second, building monitoring dashboards to track many metrics often leads to alert fatigue and does not help with root cause analysis of the problem. In this talk, I will introduce the idea of test-based ML monitoring and how it can help you keep your models in check in production. I will cover the following: - The difference between testing and monitoring and when one is better than other - How to prioritize metrics and tests for each model based on risks and model use cases - How to integrate checks in the model prediction pipeline and standardize them across similar models and model lifecycle - An in-depth look at batch model monitoring architecture, including setup and analysis of results using open-source tools
🎤
Practical Session: Learning on Heterogeneous Graphs with PyG
Speakers:
👤
Ramona Bendias
👤
Matthias Fey
📅 Mon, 17 Apr 2023 at 15:10
show details
Learn how to build and analyze heterogeneous graphs using PyG, a machine graph learning library in Python. This workshop will provide a practical introduction to the concept of heterogeneous graphs and their applications, including their ability to capture the complexity and diversity of real-world systems. Participants will gain experience in creating a heterogeneous graph from multiple data tables, preparing a dataset, and implementing and training a model using PyG.
Heterogeneous graphs are powerful tools for representing and analyzing complex systems. They are able to capture the complexity and diversity of data, provide more accurate and relevant insights, integrate multiple data sources, and support the development of sophisticated graph algorithms. In this workshop, we will use PyG, a machine graph learning library in Python, to build and analyze heterogeneous graphs. We will start with a discussion of the concept of heterogeneous graphs and their applications, and then move on to a practical session. Participants will learn how to create a heterogeneous graph from multiple data tables and use PyG to implement and train a model. By the end of the workshop, participants will have a solid understanding of the benefits and capabilities of heterogeneous graphs, as well as practical skills for building and analyzing them with PyG.
🎤
Raised by Pandas, striving for more: An opinionated introduction to Polars
Speakers:
👤
Nico Kreiling
📅 Mon, 17 Apr 2023 at 15:10
show details
Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars. Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers? In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)
Pandas and Polars are both popular open-source libraries for data manipulation and analysis in Python. While both libraries offer a range of powerful tools for working with data, there are several key differences that users should be aware of when choosing which library to use. One of the main differences between Pandas and Polars is the way that they handle data processing and evaluation. Pandas uses a traditional, eager evaluation model, in which operations are immediately evaluated and the results are returned. In contrast, Polars offers optional lazy evaluation, which allows users to delay the evaluation of certain operations until they are actually needed. This can be especially useful for large or complex datasets, as it can improve performance by reducing the amount of data that needs to be processed at any given time. Another key difference between the two libraries is the way they handle data storage and indexing. Pandas is built around a powerful indexing system that allows users to quickly access and manipulate specific rows or columns of data. However, this indexing system can be complex and can sometimes lead to slower performance. In contrast, Polars does not use indexes, which can simplify the underlying data structure and improve performance. In terms of functionality, Pandas has a number of features that are not currently available in Polars. For example, Pandas offers built-in plotting functionality, which can be useful during explorative data analysis for visualizing and interpreting data. Additionally, Pandas has a much stronger integration in the PyData ecosystem and is more widely used in data analysis and scientific computing. This can make it easier for users to find resources and support when working with Pandas. One notable difference between the two libraries is the syntax and API. Polars is inspired by the popular distributed computing library Apache Spark, but uses a column-based API in contrast to the row-based API within Spark. Generally the polar's syntax will be more familiar to spark users. Overall, both Pandas and Polars are powerful libraries with a lot to offer for data manipulation and analysis in Python. Which library is the best choice will depend on the specific needs and goals of the user. By understanding the differences between the two libraries, users can make an informed decision about which one is best suited for their needs.
🎤
A concrete guide to time-series databases with Python
Speakers:
👤
Heiner Tholen
👤
Ellen König
📅 Mon, 17 Apr 2023 at 15:45
show details
We evaluated time-series databases and complementary services to stream-process sensor data. In this talk, our evaluation will be presented. The final implementation will be shown, alongside python-tools we’ve built and lessons learned during the process.
Understanding time-series data is essential to handle automatically generated data, be it from server logs, IoT devices or any other continuous measurement. In order to handle the large amounts of incoming data from concrete mixing trucks, we evaluated a number of time-series databases as well as services to stream-process the data. For all of those decisions a key question was, of course, how well any of these tools integrate with our existing, all-Python backend. The right angle on time-series data will help you move tons of data with little engineering effort. In this talk, you’ll learn from our practical experiences of choosing and implementing a time-series database in a Python context. You’ll go away with a better understanding of how you can efficiently store, analyse and exploit streaming data.
🎤
Have your cake and eat it too: Rapid model development and stable, high-performance deployments
Speakers:
👤
Christian Bourjau
👤
Jakub Bachurski
📅 Mon, 17 Apr 2023 at 15:45
show details
At the boundary of model development and MLOps lies the balance between the speed of deploying new models and ensuring operational constraints. These include factors like low latency prediction, the absence of vulnerabilities in dependencies and the need for the model behavior to stay reproducible for years. The longer the list of constraints, the longer it usually takes to take a model from its development environment into production. In this talk, we present how we seemingly managed to square the circle and have both a rapid, highly dynamic model development and yet also a stable and high-performance deployment.
At QuantCo, we ship sklearn-based models in a real-time service that guarantees 24/7 uptime with low latency (ms) responses. Simultaneously, we adhere to strict regulatory and security policies, where every model must remain available for 3-5 years, while its dependencies are kept up-to-date. As the basis, we are using ONNX as a technology to transform our dynamic Python pipelines into static, low-overhead model definitions. To ensure the cost of the model transformation does not slow down our Data Scientists, we have developed an open-source library named Spox, to streamline these operations as much as possible. Combined with an apt model serving infrastructure, we can satisfy the needs of our data scientists (fast development and deployment) and those of corporate IT (vulnerability-free, year-long stability) without compromising efficiency.
🎤
Performing Root Cause Analysis with DoWhy, a Causal Machine-Learning Library
Speakers:
👤
Patrick Blöbaum
📅 Mon, 17 Apr 2023 at 15:45
show details
In this talk, we will introduce the audience to [DoWhy](https://www.pywhy.org/dowhy), a library for causal machine-learning (ML). We will introduce typical problems where causal ML can be applied and will specifically do a deep dive on root cause analysis using DoWhy. To do this, we will lay out what typical problem spaces for causal ML look like, what kind of problems we're trying to solve, and then show how to use DoWhy's API to solve these problems. Expect to see a lot of code with a hands-on example. We will close this session by zooming out a bit and also talk about the PyWhy organization governing DoWhy.
_"Much like machine learning libraries have done for prediction, DoWhy is a Python library that aims to spark causal thinking and analysis. DoWhy provides a wide variety of algorithms for effect estimation, causal structure learning, diagnosis of causal structures, root cause analysis, interventions and counterfactuals."_ The field of causal machine-learning (ML) is not as well-known as typical machine-learning problems and libraries. DoWhy is one of the more popular open-source libraries for causal ML. And not for nothing: DoWhy is based on the two major scientific frameworks, Potential Outcome and Graphical Causal Models and offers a large variety of features. Problems where causal ML can be applied, come from any imaginable domain, be that distributed computer systems, supply chain, workflow management, manufacturing, etc. As long as a complex system can be represented as a causal graph, one can also apply causal ML. In the talk, we will specifically dive into a microservice architecture, as this is an example which an audience like the one at PyCon can most likely relate to. We will present some data and then inject outliers (or anomalies) into that data, see how those propagate through the system, and then use DoWhy's algorithms to show us the root cause. By the end of the talk, the audience should have a good understanding of typical problem domains for causal ML and a good sense of how to use DoWhy to solve such problems.
🎤
WALD: A Modern & Sustainable Analytics Stack
Speakers:
👤
Florian Wilhelm
📅 Mon, 17 Apr 2023 at 15:45
show details
The name **WALD**-stack stems from the four technologies it is composed of, i.e. a cloud-computing **W**arehouse like Snowflake or Google BigQuery, the open-source data integration engine **A**irbyte, the open-source full-stack BI platform **L**ightdash, and the open-source data transformation tool **D**BT. Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under [waldstack.org](https://waldstack.org).
The current zeitgeist is that the data lake concept from classical data engineering and modern data warehousing from business intelligence are converging more and more. This is also driving the shift from ETL to ELT, and so tools such as [dbt] are becoming increasingly important in combination with modern Big Data warehouses such as [Snowflake] and [Google BigQuery]. For typical data and MI engineers, this is quite a departure from familiar tools like [Spark]. Having a pure Spark and ETL background myself, this trend motivated me to explore the foreign realms of ELT, data warehousing and especially the fuzz about [dbt]. In this talk I want to share my key insights with classical data / ml engineers that might have only heard about [Snowflake], [dbt], [Airbyte] and [Lightdash] but have never cared to dig deeper. My talk is structured like this: * short introduction to the differences of data lake vs data warehouse, ETL vs ELT * high-level introduction of Snowflake, Airbyte, dbt, and Lightdash * demonstration based on the [Kaggle Formula 1 World Championship dataset] to see those four tools in action * my main take-aways and key insights After this talk, you will have learned the differences between ETL & ELT, what these four tools do and in which cases you should consider the WALD stack. Also, you will know how to use Python instead of SQL to define models in dbt, which is a brand-new feature. The WALD-stack is sustainable since it consists mainly of open-source technologies, however all technologies are also offered as managed cloud services. The data warehouse itself, i.e. [Snowflake] or [Google BigQuery], is the only non-open-source technology in the WALD-stack. In my talk, I will focus on the open-source parts of the WALD-stack. [dbt]: https://www.getdbt.com/ [Snowflake]: https://www.snowflake.com/ [Lightdash]: https://github.com/lightdash/lightdash [Airbyte]: https://airbyte.com/ [Google BigQuery]: https://cloud.google.com/bigquery [Kaggle Formula 1 World Championship dataset]: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 [Spark]: https://spark.apache.org/
🎤
Polars - make the switch to lightning-fast dataframes
Speakers:
👤
Thomas Bierhance
📅 Mon, 17 Apr 2023 at 15:45
show details
In this talk, we will report on our experiences switching from Pandas to Polars in a real-world ML project. Polars is a new high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will compare the performance of polars with the popular pandas library, and show how polars can provide significant speed improvements for data manipulation and analysis tasks. We will also discuss the unique features of polars, such as its ability to handle large datasets that do not fit into memory, and how it feels in practice to make the switch from Pandas. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python.
The pandas library is one of the most widely used tools for working with data in the Python ecosystem. However, pandas can be slow for medium and larger datasets, and many users have been looking for faster alternatives. In this talk, we introduce the new polars library, a high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will report on our experiences switching from Pandas to Polars in a real-world ML project. We will compare the performance of polars with pandas using various use-cases, and show how polars can provide significant speed improvements for common data manipulation and analysis tasks. Due to its speed it can even be an alternative for cases where people normally use distributed systems like Spark. For example, we will demonstrate how polars can process large datasets with minimal overhead, and how its massive use of parallelization can provide an additional speed boost. We will also discuss how polars compares to other popular options like DuckDB and cuDF. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python. Whether you are a pandas user looking for a faster alternative, or a Spark user interested in a simpler alternative, this talk will provide valuable insights and practical examples.
🎤
Driving down the Memray lane - Profiling your data science work
Speakers:
👤
Cheuk Ting Ho
📅 Mon, 17 Apr 2023 at 15:45
show details
When handling a large amount of data, memory profiling the data science workflow becomes more important. It gives you insight into which process consumes lots of memory. In this talk, we will introduce Mamray, a Python memory profiling tool and its new Jupyter plugin.
In this talk, we will be exploring what memory profiling is, and how it can help with data science work. We will start the talk with a basic explanation of how Python arrange memories for various objects. This lays the foundation explanation of why we need a special tool to memory profile Python programs. Then we will be going through a data science use case where we memory profiles some part of the process with the Memray Jupyter plug-in. This would be a use case that a data science practitioner or learner would be familiar with and they can see how memory profiling could be useful. We will then explain how to interpret the frame diagram in Memray, a commonly used diagram in memory profiling to understand how much memory a process and its sub-process uses. This is something that for a new user, it could be hard to understand and not know what to look into. From this example, audiences can see what they can learn about from the frame diagram. ## Goal This talk is for data scientists, learners or anyone who is interested in memory profiling their Python program. Although the talk will be using a data science use case as an example, the explanation and the tool can be expanded to be used in any Python program. However, for data science practitioners and learners who have been using Python to process data, this may be a step forward for them to improve their data workflow and prevent memory leaks from their programs. ## Outline - Introduction (5 mins) - Why we need a special tool for memory profiling (5 mins) - How to use Memray in Jupyter notebook (5 mins) - Demonstration for using Memray in data science work (5 mins) - How to interpret a frame diagram (5 mins) - Conclusion (5 mins)
🎤
Specifying behavior with Protocols, Typeclasses or Traits. Who wears it better (Python, Scala 3, Rust)?
Speakers:
👤
Kolja Maier
📅 Mon, 17 Apr 2023 at 16:20
show details
In this talk, we will explore the use of Python's `typing.Protocol`, Scala's Typeclasses, and Rust's Traits. They all offer a very powerful & elegant mechanism for abstracting over various concepts (such as Serialization) in a modular manner. We will compare and contrast the syntax and implementation of these constructs in each language and discuss their strengths and weaknesses. We will also look at real-world examples of how these features are used in each language to specify behavior, and consider differences in terms of type system expressiveness and effectiveness. By the end of the talk, attendees will have a better understanding of the differences and similarities between these three language features, and will be able to make informed decisions about which one is best suited for their needs.
Within simple applications abstractions are only needed to a certain degree. E.g., why would someone need a complex class hierarchy, if the task at hand could be solved more pragmatically? However, as applications and the business get more complex, abstractions can become crucial for improving the quality and maintainability of your code. With `typing.Protocol` a great Python language feature was introduced, which allows abstraction and modularization while also having static typing. This allows for very robust software development. How do other languages solve that problem? Besides `typing.Protocol` we’ll also dive into the world of Scala Typeclasses, and Rust Traits, and explore how these features are used in each language to ensure the correctness and safety of code. All these mechanisms have in common that they specify behavior for types in a very flexible and safe manner.
🎤
FastAPI and Celery: Building Reliable Web Applications with TDD
Speakers:
👤
Avanindra Kumar Pandeya
📅 Mon, 17 Apr 2023 at 16:20
show details
In this talk, we will explore how to use the FastAPI web framework and Celery task queue to build reliable and scalable web applications in a test-driven manner. We will start by setting up a testing environment and writing unit tests for the core functionality of our application. Next, we will use FastAPI to create an api to perform some long-running task. Finally, we will then see how Celery can help us offload long-running tasks and improve the performance of our application. By the end of this talk, attendees will have a strong understanding of TDD and how to apply it to your FastAPI and Celery projects, and you will be able to write tests that ensure the reliability and maintainability of your code.
1. Introduction (1 min) - Title of the talk and speaker's name: This section introduces the title of the talk and the speaker's name, and current role of the speaker. - Overview of the topics covered in the talk: This section introduces the main themes and goals of the talk, and gives the audience a sense of what they can expect to learn. 2. What is Test-Driven Development (TDD)? (2 min) - Definition of TDD and how it fits into the software development process: This section defines TDD and explains how it fits into the software development process. It will highlight the benefits of TDD such as improved quality, reduced debugging time, and faster development. 3. Setting up a dockerized Development Environment for a Math API (5 min) - Installing the necessary tools and libraries with Docker: This section covers the steps to install the necessary tools and libraries for testing, such as FastAPI, Celery, and a testing framework. - Setting up a testing database with Docker: This subsection explains how to set up a testing database (PostgreSQL) using Docker. It can include instructions for pulling the Docker image, running the container, and configuring the connection. - Configuring the application to use the testing database: This sub-section covers the steps to configure the application to use the testing database during testing. It can include instructions for setting up environment variables or config files to switch between different databases. - Writing a basic test case: This subsection provides an example of a basic test case that verifies the setup of the testing environment. It will include a demonstration for running the test and checking the results. 4. Writing Unit Tests (7 min) - Identifying the core functionality and behavior of the application: This section discusses how to identify the core functionality and behavior of the application, and how to break it down into smaller pieces that can be tested separately. It can include tips on how to prioritize the tests and focus on the most important or risky areas of the code. - Writing test cases to cover the different scenarios and edge cases: This sub-section covers the steps to write test cases that cover the different scenarios and edge cases for the core functionality of the application. It can include examples of different types of tests, such as positive, negative, and boundary tests. - Using mocks and fixtures to isolate the tests: This subsection explains how to use mocks and fixtures to isolate the tests from external dependencies and control the input and output. It can include examples of how to use these techniques to test different parts of the application in isolation. 5. Building the API with FastAPI and Celery (8 min) - Setting up a FastAPI application: This section introduces FastAPI, and explains its key features and benefits. It will include a demonstration of how to use FastAPI to build a simple API using TDD. - Setting up a Celery worker and task queue: This subsection explains how to set up a Celery worker and task queue, and how to configure the application to use them. It can include instructions on how to install Celery, create a Celery instance, and define the queue and backend. - Defining tasks as functions and decorating them with Celery's @task decorator: This sub-section covers the steps to define tasks as functions and decorate them with Celery's @task decorator. It can include examples of how to define tasks and pass arguments and options to them. - Using the Celery client to trigger tasks and receive the results: This subsection explains how to use the Celery client to trigger tasks and receive the results. It can include instructions on how to send tasks and wait for the results, and how to handle errors and exceptions. 6. Conclusion and Next Steps (2 min) - Recap of the main points and takeaways from the talk: This section provides a brief summary of the main points and takeaways from the talk, and highlights the key skills and knowledge that the attendees have learned. - Suggestions for further learning and resources: This subsection provides suggestions for further learning and resources for the attendees to dive deeper into TDD and FastAPI/Celery development. It can include links to tutorials, documentation, and other resources that can help the attendees continue learning and practicing what they have learned in the talk. - Encouragement for attendees to apply these techniques to their own projects: This subsection encourages the attendees to apply the techniques and skills they have learned in the talk to their own projects, and to share their experiences and feedback with the community. 7. Question/Answer (5 min)
🎤
How to build observability into a ML Platform
Speakers:
👤
Alicia Bargar
📅 Mon, 17 Apr 2023 at 16:20
show details
As machine learning becomes more prevalent across nearly every business and industry, making sure that these technologies are working and delivering quality is critical. In her talk, Alicia will discuss the importance of machine learning observability and why it should be a fundamental tool of modern machine learning architectures. Not only does it ensure models are accurate, but it helps teams iterate and improve models quicker. Alicia will dive into how Shopify has been prototyping building observability into different parts of its machine learning platform. This talk will provide insights on how to track model performance, how to catch any unexpected or erroneous behaviour, what types of behavior to look for in your data (e.g. drift, quality metrics) and in your model/predictions, and how observability could work with large language models and Chat AIs.
As machine learning becomes more prevalent across nearly every business and industry, making sure that these technologies are working and delivering quality is critical. In her talk, Alicia will discuss the importance of machine learning observability and why it should be a fundamental tool of modern machine learning architectures. Not only does it ensure models are accurate, but it helps teams iterate and improve models quicker. Alicia will dive into how Shopify has been prototyping building observability into different parts of its machine learning platform. This talk will provide insights on how to track model performance, how to catch any unexpected or erroneous behaviour, what types of behavior to look for in your data (e.g. drift, quality metrics) and in your model/predictions, and how observability could work with large language models and Chat AIs.
🎤
BHAD: Explainable unsupervised anomaly detection using Bayesian histograms
Speakers:
👤
Alexander Vosseler
📅 Mon, 17 Apr 2023 at 16:20
show details
The detection of outliers or anomalous data patterns is one of the most prominent machine learning use cases in industrial applications. I present a Bayesian histogram anomaly detector (BHAD), where the number of bins is treated as an additional unknown model parameter with an assigned prior distribution. BHAD scales linearly with the sample size and enables a straightforward explanation of individual scores, which makes it very suitable for industrial applications when model interpretability is crucial. I study the predictive performance of the proposed BHAD algorithm with various SoA anomaly detection approaches using simulated data and also using popular benchmark datasets for outlier detection. The reported results indicate that BHAD has very competitive predictive accuracy compared to other more complex and computationally more expensive algorithms, while being explainable and fast.
I present an unsupervised and explainable Bayesian anomaly detection algorithm. For this I consider the posterior predictive distribution of a Categorical-Dirichlet distribution and use it to construct a Bayesian histogram-based anomaly detector (BHAD). BHAD scales linearly with the size of the data and allows a direct explanation of individual anomaly scores due to its simple linear functional form, which makes it very suitable for practical applications when model interpretability is crucial. Based on simulated data and also using popular benchmark datasets for outlier detetcion I analyze the predictive performances of the used candidate models and also compare them with outlier ensemble approaches. The results suggest that the proposed BHAD model has very competitive performance compared to other more complex models like variational autoencoders, in fact it is among the best performing candidates while offering individual and global model explainability.
🎤
Building a Personal Assistant With GPT and Haystack: How to Feed Facts to Large Language Models and Reduce Hallucination.
Speakers:
👤
Mathis Lucka
📅 Mon, 17 Apr 2023 at 16:20
show details
Large Language Models (LLM), like ChatGPT, have shown miraculous performances on various tasks. But there are still unsolved issues with these models: they can be confidently wrong and their knowledge becomes outdated. GPT also does not have any of the information that you have stored in your own data. In this talk, you'll learn how to use Haystack, an open source framework, to chain LLMs with other models and components to overcome these issues. We will build a practical application using these techniques. And you will walk away with a deeper understanding of how to use LLMs to build NLP products that work.
You can apply LLMs to solve various NLP and NLU tasks, such as summarization or question answering. These models have billions of parameters they can use to effectively store some of the information they saw during pre-training. This enables them to show deep knowledge of a subject, even if they weren't explicitly trained on it. Yet, this capability also comes with issues. The information stored in the parameters can’t easily be updated, and the model's knowledge might become stale. The model won’t have any of your custom data, your company’s knowledge base for example. Sometimes, the model makes things up. We call that hallucination. Cases of hallucination can be hard to spot. The model may be very confident while making up a response. It may even make up fake citations and research papers to support its claims. Haystack is an open source NLP framework for pragmatic builders. Developers use it to build NLP applications, such as question answering systems, neural search engines, or summarization services. Haystack provides all the components you need to build an actual NLP application, which differentiates it from other NLP frameworks. It provides document conversion, pre-processing, data storage, vector databases, and model inference. It also wraps all these components in a neat pipeline abstraction. You can use a pipeline to run your application as a reliable and scalable service in production. In this talk, machine learning engineers, data scientists, and NLP developers will learn how Haystack integrates with LLMs, such as GPT-3. We will show how to use the pipeline abstraction and retrieval-augmented generation to address issues like stale knowledge and hallucination. We will also provide a practical example by showing how to create a personal assistant for knowledge workers. Each step will be accompanied with open source code examples. By the end of the talk, you will have seen these concepts applied in practice and you will be able to build an assistant for your own use case.
🎤
Keynote - How Are We Managing? Data Teams Management IRL
Speakers:
👤
Noa Tamir
📅 Tue, 18 Apr 2023 at 09:15
show details
The title “Data Scientist” has been in use for 15 years now. We have been attending PyData conferences for over 10 years as well. The hype around data science and AI seems higher than ever before. But How are we managing?
Most of our conferences are about practical applications, methodologies, and platforms. In this talk, I want to focus on contemporary data science management. Including: - Our patterns and antipatterns. - The challenges we are facing as individual contributors, teams, managers, and leaders. - How the data science function has matured. - The unique aspects of Data Science compared to management in general, and software engineering in particular. If you are a data scientist, or work with some of us, you might be interested to learn about what makes us tick, what makes us great colleagues, and yes, even what makes us challenging to work with 😉.
🎤
Aspect-oriented Programming - Diving deep into Decorators
Speakers:
👤
Mike Müller
📅 Tue, 18 Apr 2023 at 10:30
show details
The aspect-oriented programming paradigm can support the separation of cross-cutting concerns such as logging, caching, or checking of permissions. This can improve code modularity and maintainability. Python offers decorator to implement re-usable code for cross-cutting task. This tutorial is an in-depth introduction to decorators. It covers the usage of decorators and how to implement simple and more advanced decorators. Use cases demonstrate how to work with decorators. In addition to showing how functions can use closures to create decorators, the tutorial introduces callable class instance as alternative. Class decorators can solve problems that use be to be tasks for metaclasses. The tutorial provides uses cases for class decorators. While the focus is on best practices and practical applications, the tutorial also provides deeper insight into how Python works behind the scene. After the tutorial participants will feel comfortable with functions that take functions and return new functions.
## Audience This tutorial is for intermediate Python programmers who want to dive deeper. Solid working knowledge of functions and classes basics is required. ## Format The tutorial will be hands on. I will start with a blank Notebook for each topic and develop the content step-by-step. The participants are encouraged to type along. My typing speed is usually appropriate and allows participants to follow. The students will receive a comprehensive PDF with all course content as well Python source code files for all use cases and large code blocks I use. I will load these files in my Notebook. The students can do the same or open the files in their preferred editor or IDE. I also explicitly ask for feedback if I am too fast or things are unclear. I encourage questions at any time. In fact, questions and my answers are often an important part of my teaching, making the learning experience much more lively and typically more useful. So the participants will be active throughout the whole tutorial. There will be two exercises that each participant has to do on its own (or in breakout rooms if the tutorials should be remote) during the tutorial. We will look at the solutions during the tutorial. I also supply a solutions PDF after the tutorial. ## Outline * Examples of using decorators * from the standard library * from third-party packages * Closures for decorators * Write a simple decorator * Best Practice * Use case: Caching * Use case: Logging * Parameterizing decorators * Chaining decorators * Callable instances instead of functions * Use case: Argument Checking * Use case: Registration * Class decorators * Wrap-up and questions
🎤
The State of Production Machine Learning in 2023
Speakers:
👤
Alejandro Saucedo
📅 Tue, 18 Apr 2023 at 10:30
show details
As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges. This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems.
As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges. This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems. This talk will be relevant for any keen python practitioners or seasoned ML practitioners interested to get an updated overview of the state of the production ML ecosystem in the current year, covering a broad range of sub-fields in the space. This talk will benefit the Python ecosystem by providing cross-functional knowledge, bringing together best practices from data scientists, software engineers and DevOps engineers to tackle the challenge of machine learning at scale. During this talk we will shed light into some of the more popular and up-and-coming libraries to watch in this space, and we will provide a conceptual and practical hands on deep dive which will allow the community to both, tackle this issues and help further the discussion.
🎤
What could possibly go wrong? - An incomplete guide on how to prevent, detect & mitigate biases in data products
Speakers:
👤
Lea Petters
📅 Tue, 18 Apr 2023 at 10:30
show details
Within this talk, I want to look at the topic of data ethics with a practical lens and facilitate the discussion about how we can establish ethical data practices into our day to day work. I will shed some light on the multiple sources of biases in data applications: Where are potential pitfalls and how can we prevent, detect and mitigate them early so they never become a risk for our data product. I will walk you through the different stages of a data product lifecycle and dive deeper into the questions we as data professionals have to ask ourselves throughout the process. Furthermore, I will present methods, tools and libraries that can support our work. Being well aware that there is no universal solution as tools and strategies need to be chosen to specifically address requirements of the use-case and models at hand, my talk will provide a good starting point for your own data ethics journey.
Terms like trustworthy, responsible or ethical AI have been popular buzzwords for some time. But while we've seen some startling examples of ‘AI gone wrong’, such as when Facebook falsely classified black persons as ‘Primates’, Amazon’s hiring algorithm discriminated against women or the A-level algorithmic grading fiasco in the UK, for many data projects ethical considerations only come into play as an afterthought - if at all. Experience has shown that more accountability and transparency are needed in AI systems, and regulatory initiatives such as the EU AI Act make it increasingly important to treat the topic as a first-class citizen throughout the whole development process. While the implementation of legal initiatives and ethics guidelines raise awareness and bring the topic into focus, it often remains quite abstract and difficult to translate into our day to day work. Therefore, I want to look at the topic with a practical lens and facilitate the discussion about how we can establish ethical data practices. I will shed some light on the multiple sources of biases in data applications: Where are potential pitfalls and how can we prevent, detect and mitigate them early so they never become a risk for our data product. I will walk you through the different stages of a data product lifecycle and dive deeper into the questions we as data professionals have to ask ourselves throughout the process. Furthermore, I will present methods, tools and libraries that can support our work. Being well aware that there is no universal solution as tools and strategies need to be chosen to specifically address requirements of the use-case and models at hand, my talk will provide a good starting point for your own data ethics journey.
🎤
Geospatial Data Processing with Python: A Comprehensive Tutorial
Speakers:
👤
Martin Christen
📅 Tue, 18 Apr 2023 at 10:30
show details
In this tutorial, you will learn about the various Python modules for processing geospatial data, including GDAL, Rasterio, Pyproj, Shapely, Folium, Fiona, OSMnx, Libpysal, Geopandas, Pydeck, Whitebox, ESDA, and Leaflet. You will gain hands-on experience working with real-world geospatial data and learn how to perform tasks such as reading and writing spatial data, reprojecting data, performing spatial analyses, and creating interactive maps. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing
Geospatial data, which refers to data that has a geographic component, is a crucial part of many fields, including geography, geography, urban planning, and environmental science. In this tutorial, you will learn about the various Python modules that are available for working with geospatial data. We will start by introducing the **GDAL** (Geospatial Data Abstraction Library) and **Rasterio** modules, which are used for reading and writing raster data (data stored in a grid of cells, where each cell has a value). You will learn how to read and write common raster formats such as GeoTIFF and ESRI ASCII, as well as how to perform common raster operations such as resampling and reprojecting. Next, we will cover the **Pyproj** module, which is used for performing coordinate system transformations. You will learn how to convert between different coordinate systems and how to perform common tasks such as converting latitude and longitude coordinates to UTM (Universal Transverse Mercator) coordinates. After that, we will introduce the **Shapely** module, which is used for working with geometric objects in Python. You will learn how to create and manipulate points, lines, and polygons, as well as how to perform spatial operations such as intersection and union. Then, we will cover the **Folium** module, which is used for creating interactive maps in Python. You will learn how to create simple maps, add markers and popups, and customize the appearance of your maps. Next, we will introduce the **Fiona** module, which is used for reading and writing vector data (data stored as individual features, each with its own geometry and attributes). You will learn how to read and write common vector formats such as ESRI Shapefile and GeoJSON, as well as how to access and manipulate the attributes of vector features. After that, we will cover the **OSMnx** module, which is used for working with OpenStreetMap data in Python. You will learn how to download and manipulate street networks, buildings, and other geospatial data from OpenStreetMap. Next, we will introduce the **Libpysal** module, which is used for performing spatial statistics and econometrics in Python. You will learn how to calculate spatial weights, perform spatial autocorrelation tests, and estimate spatial econometric models. Then, we will cover the **Geopandas** module, which is used for working with geospatial data in a Pandas DataFrame. You will learn how to load and manipulate vector data, perform spatial joins, and create choropleth maps. After that, we will introduce the **Pydeck** module, which is used for creating interactive 3D maps in Python. You will learn how to create 3D point clouds, 3D building models, and other 3D geospatial visualizations. Next, we will cover the **Whitebox** module, which is a powerful open-source GIS toolkit for performing geospatial data processing and analysis. You will learn how to use Whitebox to perform tasks such as raster reclassification, terrain analysis, and hydrological modeling. Finally, we will introduce the **ESDA** (Exploratory Spatial Data Analysis) and **LeafMap** modules, which are used for exploring and visualizing spatial patterns and relationships in data. You will learn how to calculate spatial statistics such as Moran's I and local spatial autocorrelation statistics, and how to create interactive choropleth maps. By the end of this tutorial, you will have a solid understanding of the various Python modules that are available for working with geospatial data and will have hands-on experience applying these tools to real-world data. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing.
🎤
Bayesian Marketing Science: Solving Marketing's 3 Biggest Problems
Speakers:
👤
Dr. Thomas Wiecki
📅 Tue, 18 Apr 2023 at 10:30
show details
In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value. In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined together to make optimal marketing budget decisions in complex scenarios.
Marketing data science attempts to answer three main questions: 1. How much does it cost to acquire a customer on a given channel? 2. How much do I earn from an acquired customer over their lifetime? 3. What is the causal impact of my marketing campaigns? While seemingly straight-forward, robust estimation of these quantities on noisy, non-stationary and highly structured data is quite tricky. Moreover, while these questions are intimately related, they are often answered separately. In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value. In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined. Together, these tools demonstrated provide a powerful open-source suite to solve today's biggest marketing analytics challenges.
🎤
Software Design Pattern for Data Science
Speakers:
👤
Theodore Meynard
📅 Tue, 18 Apr 2023 at 10:30
show details
Even if every data science work is special, a lot can be learned from similar problems solved in the past. In this talk, I will share some specific software design concepts that data scientists can use to build better data products.
Data science has evolved from magic models measured by accuracy to software components with an ML core. As such, data scientists’ work should also follow best practices and have a suitable architecture. It is where design patterns can help advance the discipline. A design pattern is a reusable solution to a commonly occurring problem. It is not a concrete piece of code that can be used directly but identifying a pattern help understand the problem and also help build a common language around it. In this talk, I will share some specific software design concepts that data scientists can use to build better data products. I will not focus on patterns that will improve the performance of your model (you can already find a lot about it online) but on the ones that will help you bring your model to production.
🎤
Improving Machine Learning from Human Feedback
Speakers:
👤
Erin Mikail Staples
👤
Nikolai
📅 Tue, 18 Apr 2023 at 10:30
show details
Large generative models rely upon massive data sets that are collected automatically. For example, GPT-3 was trained with data from “Common Crawl” and “Web Text”, among other sources. As the saying goes — bigger isn’t always better. While powerful, these data sets (and the models that they create) often come at a cost, bringing their “internet-scale biases” along with their “internet-trained models.” While powerful, these models beg the question — is unsupervised learning the best future for machine learning? ML researchers have developed new model-tuning techniques to address the known biases within existing models and improve their performance (as measured by response preference, truthfulness, toxicity, and result generalization). All of this at a fraction of the initial training cost. In this talk, we will explore these techniques, known as Reinforcement Learning from Human Feedback (RLHF), and how open-source machine learning tools like PyTorch and Label Studio can be used to tune off-the-shelf models using direct human feedback.
Large generative models rely upon massive data sets that are collected automatically. For example, GPT-3 was trained with data from “Common Crawl” and “Web Text”, among other sources. As the saying goes — bigger isn’t always better. While powerful, these data sets (and the models that they create) often come at a cost, bringing their “internet-scale biases” along with their “internet-trained models.” While powerful, these models beg the question — is unsupervised learning the best future for machine learning? ML researchers have developed new model-tuning techniques to address the known biases within existing models and improve the model’s performance (as measured by response preference, truthfulness, toxicity, and result generalization). All of this at a fraction of the training cost is very low compared to the initial training cost. This talk will explore these Reinforcement Learning from Human Feedback (RLHF) techniques and how open-source machine learning tools like PyTorch and Label Studio can tune off-the-shelf models using direct human feedback. We’ll start by covering traditional RLHF, in which a model is given a set of prompts to generate outputs. These prompt/output pairs are then graded by human annotators who rank pairs according to a desired metric, which are then used as a reinforcement learning data set to optimize the model to produce results closer to the metric criteria. Next, we’ll discuss recent advances within this field and the advantages they provide. One advance we’ll dive into is the use of Human Language Feedback, in which ranks are replaced with human-language summaries that take full advantage of the “full expressiveness of language that humans use.” This contextual feedback, along with the original prompt and output of the model, is used to generate a new set of model refinements. The model is then tuned with these refinements to match the new output to the human feedback. In a 2022 study, researchers at NYU reported that “using only 100 samples of human-written feedback finetunes a GPT-3 model to roughly human-level summarization ability.” It’s advances like these that are providing advantages in terms of accuracy and bias reduction. Finally, we’ll leave you with examples and resources on implementing these training methods using publicly available models and open-source tools like PyTorch and Label Studio to help retrain models for targeted applications. As this industry continues to grow, evolve, and develop into more widespread applications, we must approach this space with ethics and sustainability in mind. By combining the power and expansiveness of these widely-popular “internet-scale models” with specific, targeted, human approaches, we can avoid the “internet-scale biases” that threaten the legitimacy and trustworthiness of the industry as a whole.
🎤
Rusty Python: A Case Study
Speakers:
👤
Robin Raymond
📅 Tue, 18 Apr 2023 at 11:05
show details
Python is a very expressive and powerful language, but it is not always the fastest option for performance-critical parts of an application. Rust, on the other hand, is known for its lightning-fast runtime and low-level control, making it an attractive option for speeding up performance-sensitive portions of Python programs. In this talk, we will present a case study of using Rust to speed up a critical component of a Python application. We will cover the following topics: * An overview of Rust and its benefits for Python developers * Profiling and identifying performance bottlenecks in Python application * Implementing a solution in Rust and integrating it with the Python application using PyO3 * Measuring the performance improvements and comparing them to other optimization techniques Attendees will learn about the potential for using Rust to boost the performance of their Python programs and how to go about doing so in their own projects.
# Context In the past, C and C++ were the go-to languages for optimizing Python code while still maintaining a high-level interface. This approach was used by well-known numerical libraries such as Numpy and Pandas. However, with the increasing popularity of Rust and the emergence of PyO3, this is no longer the only solution available. Rust's impressive performance and expressive syntax, combined with its comprehensive library ecosystem, make it a viable alternative for optimizing performance-sensitive parts of Python applications. Additionally, Rust's mature support for asynchronous programming gives it an advantage over C foreign function interfaces when interacting with Python coroutines. Some library maintainers are even considering using Rust for their projects, such as Pydantic, which is implementing version 2 in Rust and achieving similar speed improvements to those obtained using C. # Timeplan In minutes * 0-2: Welcome, explanation of title * 2-7: What is Rust and how is it different to other "bare metal" languages * 7-10: Introducing the case study, running the code, getting feel for performance * 10-15: Code profiling, finding of bottle neck * 15-17: Introducing PyO3 * 17-22: Walking through the Rust code that optimizes the bottle neck * 22-25: Running the code live, showing the speedup * 25-28: Mention extensions provided by PyO3, caveats and what code might not be a good goal to optimize. Mention tradeoffs to other foreign function interfaces. * 28-30: Buffer / Q&A
🎤
How Chatbots work – We need to talk!
Speakers:
👤
Yuqiong Weng
👤
Katrin Reininger
📅 Tue, 18 Apr 2023 at 11:05
show details
Chatbots are fun to use, ranging from simple chit-chat (“How are you today?”) to more sophisticated use cases like shopping assistants, or the diagnosis of technical or medical problems. Despite their mostly simple user interaction, chatbots must combine various complex NLP concepts to deliver convincing, intelligent, or even witty results. With the advancing development of machine learning models and the availability of open source frameworks and libraries, chatbots are becoming more powerful every day and at the same time easier to implement. Yet, depending on the concrete use case, the implementation must be approached in specific ways. In the design process of chatbots it is crucial to define the language processing tasks thoroughly and to choose from a variety of techniques wisely. In this talk, we will look together at common concepts and techniques in modern chatbot implementation as well as practical experiences from an E-mobility bot that was developed using the Rasa framework.
Chatbots are fun to use, ranging from simple chit-chat (“How are you today?”) to more sophisticated use cases like shopping assistants, or the diagnosis of technical or medical problems. Despite their mostly simple user interaction, chatbots must combine various complex NLP concepts to deliver convincing, intelligent, or even witty results. With the advancing development of machine learning models and the availability of open source frameworks and libraries, chatbots are becoming more powerful every day and at the same time easier to implement. Yet, depending on the concrete use case, the implementation must be approached in specific ways. In the design process of chatbots it is crucial to define the language processing tasks thoroughly and to choose from a variety of techniques wisely. In this talk, we will look together at common concepts and techniques in modern chatbot implementation as well as practical experiences from an E-mobility bot that was developed using the Rasa framework.
🎤
BLE and Python: How to build a simple BLE project on Linux with Python
Speakers:
👤
Bruno Vollmer
📅 Tue, 18 Apr 2023 at 11:05
show details
Bluetooth Low Energy (BLE) is a part of the Bluetooth standard aimed at bringing wireless technology to low-power devices, and it's getting into everything - lightbulbs, robots, personal health and fitness devices, and plenty more. One of the main advantages of BLE is that everybody can integrate those devices into their tools or projects. However, BLE is not the most developer-friendly protocol and these devices most of the time don't come with good documentation. In addition, there are not a lot of good open-source tools, examples, and tutorials on how to use Python with BLE. Especially if one wants to build both sides of the communication. In this talk, I will introduce the concepts and properties used in BLE interactions and look at how we can use the Linux Bluetooth Stack (Bluez) to communicate with other devices. We will look at a simple example and learn along the way about common pitfalls and debugging options while working with BLE and Python. This talk is for everybody that has a basic understanding of Python and wants to have a deeper understanding of how BLE works and how one could use it in a private project.
Slides can be found here: https://drive.google.com/file/d/1rDkSKriobmW71ZMYU6pqdx7Yal1eUgXm/view?usp=sharing The problem that this talk is addressing is the difficulty of using Bluetooth Low Energy (BLE) with Python, particularly for those who are new to the protocol. One issue is that BLE is not necessarily beginner-friendly, with a steep learning curve that can be intimidating for those who are just starting out. Additionally, there are not many examples available for creating a BLE server using Python, which makes it difficult for people to learn and understand the process. This is most likely due to the fact that writing a BLE (GATT) server is often only done in professional contexts. Finally, complexity is added as one has to interact with the system Bluetooth stack which makes it more complicated, particularly on Linux where the use of DBus is required. Overall, these challenges can make it difficult for people to effectively use BLE and Python together. The problem of using BLE with Python is relevant to the audience because BLE is a widely-used technology that allows users to add a variety of peripherals to their projects, both personal and professional. Over the years more devices support a configuration or use through BLE. For example, BLE is often used in home automation systems, wearable devices, and Internet of Things (IoT) applications. By understanding how to use BLE with Python, the audience can take advantage of the many possibilities that this technology offers and create innovative projects that leverage the capabilities of many different types of BLE devices. In this talk, I will introduce the different technologies that are involved in using BLE with Python, including BLE itself, Bluez (the Linux Bluetooth stack), and DBus (a software system for inter-process communication). This is followed by a showcase of a simple GATT server example using Python, which will demonstrate how to use these technologies effectively. In addition to this, I will explain a possible development process for creating BLE projects with Python, including debugging tools and common pitfalls to avoid. Finally, I will point the audience toward further resources that they can use to continue learning about BLE and Python and to help them get started with their own projects.
🎤
“Who is an NLP expert?” - Lessons Learned from building an in-house QA-system
Speakers:
👤
Nico Kreiling
👤
Alina Bickel
📅 Tue, 18 Apr 2023 at 11:05
show details
Innovations such as sentence-transformers, neural search and vector databases fueled a very fast development of question-answering systems recently. At scieneers, we wanted to test those components to satisfy our own information needs using a slack-bot that will answer our questions by reading through our internal documents and slack-conversations. We therefore leveraged the HayStack QA-Framework in combination with a Weaviate vector database and many fine-tuned NLP-models. This talk will give you insights in both, the technical challenges we faced and the organizational learnings we took.
Innovations such as sentence-transformers, neural search and vector databases fueled a very fast development of question-answering systems recently. At scieneers, we wanted to test those components to satisfy our own information needs using a slack-bot that will answer our questions by reading through our internal documents and slack-conversations. We therefore leveraged the HayStack QA-Framework in combination with a Weaviate vector database and many fine-tuned NLP-models. This talk will give you insights in both, the technical challenges we faced and the organizational learnings we took.
🎤
Actionable Machine Learning in the Browser with PyScript
Speakers:
👤
Valerio Maggio
📅 Tue, 18 Apr 2023 at 11:05
show details
PyScript brings the full PyData stack in the browser, opening up to unprecedented use cases for interactive data-intensive applications. In this scenario, the web browser becomes a ubiquitous computing platform, operating within a (nearly) _zero-installation_ & _server-less_ environment. In this talk, we will explore how to create full-fledged interactive front-end machine learning applications using PyScript. We will dive into the the main features of the PyScript platform (e.g. _built-in Javascript integration_ and _local modules_ ), discussing _new_ data & design patterns (e.g. _loading heterogeneous data in the browser_), required to adapt and to overcome the limitations imposed by the new operating environment (i.e. the browser).
PyScript is the new open source platform that brings Python to web front-end applications. In fact, PyScript makes it possible to inject *standard* Python code into HTML, which is then _interpreted_ and _executed_ directly in the browser. And all that, with **no server-side** technology needed, and **no installation** required (_not even a local Python interpreter!, ed._) 🔮. But there's more! Thanks to its built-in integration with [`pyodide`](https://pyodide.org/en/stable/), PyScript brings the [full](https://pyodide.org/en/stable/usage/packages-in-pyodide.html) PyData stack into the browser, along with a native integration with the Javascript interpreter, then enabling full support for front-end interactivity. As a result, PyScript has the potential to radically change the way in which interactive data-driven web apps could be designed and developed: the seamless bi-directional integration of **Python** and **Javascript** is complemented by the full support to reliable numerical computation, enabled by the Python scientific ecosystem (e.g. `numpy` `scikit-learn`), using the browser as a ubiquitous virtual machine. In this talk, we will explore how PyScript enables the creation of full-fledged font-end _interactive machine learning_ (`ML`) apps using PyScript. Multiple examples of supervised and unsupervised ML apps will be presented, and analysed in details, in order to fully understand how PyScript works, and what key features are provided (e.g. _built-in Javascript integration_; _local modules_ ). Similarly, we will also discuss new_ data & design patterns (e.g. _loading heterogeneous data in the browser_; _multi-core vs multi-threading; _performance considerations_) which are required to adapt to the new _atypical_ environment in which we operate: the **browser**. No specific prior knowledge is required to attend the talk. Familiarity with Python programming, and the main `pydata` packages (i.e. `numpy`, `scikit-learn`, `Matplotlib` ) is desirable, along with a general understanding of how the web DOM works (for the Javascript interaction part) and basic principles of data processing. **Domain** knowledge: _Novice_; **Python** knowledge: _Intermediate_
🎤
How Python enables future computer chips
Speakers:
👤
Tim Hoffmann
📅 Tue, 18 Apr 2023 at 11:40
show details
At the semiconductor division of Carl Zeiss it's our mission to continuously make computer chips faster and more energy efficient. To do so, we go to the very limits of what is possible, both physically and technologically. This is only possible through massive research and development efforts. In this talk, we tell the story how Python became a central tool for our R&D activities. This includes technical aspects as well as organization and culture. How do you make sure that hundreds of people work in consistent environments? – How do you get all people on board to work together with Python? – You have lots of domain experts without much software background. How do you prevent them from creating a mess when projects get larger?
At the semiconductor division of Carl Zeiss it's our mission to continuously make computer chips faster and more energy efficient. To do so, we go to the very limits of what is possible, both physically and technologically. This is only possible through massive research and development efforts. In this talk, we tell the story how Python became a central tool for our R&D activities. This includes technical aspects as well as organization and culture. How do you make sure that hundreds of people work in consistent environments? – How do you get all people on board to work together with Python? – You have lots of domain experts without much software background. How do you prevent them from creating a mess when projects get larger?
🎤
Using transformers – a drama in 512 tokens
Speakers:
👤
Marianne Stecklina
📅 Tue, 18 Apr 2023 at 11:40
show details
“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit. In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers: 1. How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too. 2. I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that. 3. Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research. 4. So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.
“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit. In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers: 1. How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too. 2. I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that. 3. Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research. 4. So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.
🎤
Maps with Django
Speakers:
👤
Paolo Melchiorre
📅 Tue, 18 Apr 2023 at 11:40
show details
Keeping in mind the **Pythonic** principle that _“simple is better than complex”_ we'll see how to create a web **map** with the **Python** based _web framework_ **Django** using its **GeoDjango** module, storing _geographic data_ in your _local database_ on which to run _geospatial queries_.
A *map* in a website is the best way to make geographic data easily accessible to users because it represents, in a simple way, the information relating to a specific geographical area and is in fact used by many online services. Implementing a web *map* can be complex and many adopt the strategy of using external services, but in most cases this strategy turns out to be a major data and cost management problem. In this talk we'll see how to create a web *map* with the **Python** based web framework **Django** using its **GeoDjango** module, storing geographic data in your local database on which to run geospatial queries. Through this intervention you can learn how to add a *map* on your website, starting from a simple *map* based on **Spatialite/SQLite** up to a more complex and interactive *map* based on **PostGIS/PostgreSQL**.
🎤
Observability for Distributed Computing with Dask
Speakers:
👤
Hendrik Makait
📅 Tue, 18 Apr 2023 at 11:40
show details
Debugging is hard. Distributed debugging is hell. Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease. However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success. In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild. This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.
Debugging is hard. Distributed debugging is hell. Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease. However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success. In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild. This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.
🎤
5 Things about fastAPI I wish we had known beforehand
Speakers:
👤
Alexander CS Hendorf
📅 Tue, 18 Apr 2023 at 11:40
show details
An exchange of views on fastAPI in practice. FastAPI is great, it helps many developers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation. FastAPI does a great job of getting people started with APIs quickly. This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions.
An exchange of views on fastAPI in practice. FastAPI is great, it helps many developers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation. FastAPI does a great job of getting people started with APIs quickly. This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions. This talk will include the following: ### fastAPI is built on the shoulders of giants I: [pydantic](https://docs.pydantic.dev/) FastAPI makes extensive use of [pydantic](https://docs.pydantic.dev/). [pydantic](https://docs.pydantic.dev/) parses data, can validate (and transform) data, and has built-in interfaces to export OpenAPI definitions among many other features. ### fastAPI is built on the shoulders of giants I: [starlette](https://www.starlette.io) Routes and middleware are managed by [starlette](https://www.starlette.io). In this section we will explore how to create custom middleware and what we learned along the way. ### fastAPI has tutorials, but is this documentation? The fastAPI page provides a good introduction. The more we worked with fastAPI, the harder it was to find accurate documentation. Looking at the source code, we really missed DocStrings! Introspection to the rescue - will probably include a rant about missing DocStrings! ### DRY ("Don't repeat yourself") with pydantic For our use case, we decided to use strict models to validate our data structures, as we work in a highly regulated industry where no mistakes are allowed to happen. Setting up the REST API was much easier than developing consistent models that generalise well. We follow the "single source of truth" paradigm, entering redundant definitions is an absolute no-go. In this section we show how to create highly reusable pydantic model pools with inheritance for use in fastAPI. For testing, we also created models from metadata! ### "The road not taken": pydantic Depends()! API routes often consist of a request model and a response model. But what about cases where the models alone don't work and a model and e.g. query parameters need to be mixed? Apart from flake8 complaining about having callables in the signature, this can be quite a difficult use case. Strategies for resolving model/parameter conflicts. Bonus - if time: ### Integrating fastAPI with Sphinx. Demonstrate how to integrate OpenAPI with your Sphinx documentation. The talk will show how fastAPI is built and how well introspection can help you understand what is going on under the hood and which library is actually doing the heavy lifting where.
🎤
Keynote - Towards Learned Database Systems
Speakers:
👤
Carsten Binnig
📅 Tue, 18 Apr 2023 at 13:15
show details
Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk, I present my vision of Learned DBMS Components 2.0 to tackle these issues. First, I will introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. I thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train model
Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk, I present my vision of Learned DBMS Components 2.0 to tackle these issues. First, I will introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. I thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train models that generalize to unseen data sets out of the box. The idea is to train a model that has observed a variety of workloads on different data sets and can thus generalize. Initial results on the task of physical cost estimates suggest the feasibility of this approach. Finally, I discuss further opportunities which are enabled by zero-shot learning.
🎤
Data Kata: Ensemble programming with Pydantic #1
Speakers:
👤
Lev Konstantinovskiy
👤
Gregor Riegler
👤
Nitsan Avni
📅 Tue, 18 Apr 2023 at 14:05
show details
Write code as an ensemble to solve a data validation problem with Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.
The How We will play a "collaborative game" - write code together to solve a problem. Each small group of 5, an "ensemble", will be guided by a facilitator. An ensemble has only one screen and one keyboard, so participants rotate the roles of typing and talking. The goals are to have fun, learn how to use Pydantic, write better code with Test Driven Development, listen to colleagues, make typos in front of everyone, become a supportive team, defend our ideas and sometimes even accept criticism. Exercise: "Read data from a CSV and check data types, range of values, consistency between columns using Pydantic." See data and starting code in the [repo](https://github.com/tmylk/data-kata/tree/main/validation/pydantic) This is part 1 out of 2 of our data validation tutorial. Part 2 is doing the same task using a different Python framework - Pandera instead of Pydantic. You can attend both or just one part of this tutorial. Format: - Ensemble programming with a facilitator. We will all collaborate as one team, switching the person on the keyboard every 5 mins. - You don't need to have any previous experience with ensemble programming to join. - You don't need to have any previous experience with data validation to join. Schedule: - Intros - 10 mins - Ensemble programming - 30 mins - Interim Retrospective - 10 mins - Ensemble programming - 30 mins - Final Retrospective - 10 mins - Closing Things to note: - We will use gitpod.io as a shared VS Code IDE work environment
🎤
Let's contribute to pandas (3 hours) #1
Speakers:
👤
Noa Tamir
👤
Patrick Hoefler
📅 Tue, 18 Apr 2023 at 14:05
show details
PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people. pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted! If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .
PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people. pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted! If you don’t finish your contribution during the event, we hope you will continue to work on it after the workshop. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html . ❓Any other requirements ❓ 1. Bring your own laptop 2. Have Github account: https://github.com 3. Have git installed: https://git-scm.com/book/en/v2/Getting-Started-Installing-Gitf • Format for the session: First 15 minutes : an introduction - what you can contribute, how to contribute, and how to set up your development environment or use gitpod; The rest : "office hours", during which you'll be mentored through setting up a development environment and making a contribution to pandas. • Preparation (optional) For those who are more keen on using the workshop to work on their contribution to pandas, you may want to start setting up your development environment in advance. This way, by the time you arrive you are ready to get started on picking issues, and starting to contribute. Please be aware that it could take longer to set up a development on a computer running a Windows operating system compared to MacOS or Unix. We will guide you through the steps, and they are useful to learn for many open source projects. We also offer a development environment on gitpod. It can take some minutes to load, but provides you an instant and fresh development environment for each new task directly from your browser, using VScode. Documentation is in the works and will be provided before the workshop. To get the most out of the session, it's encouraged (but not required) that you have a look at the contributing guide beforehand: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html. Particularly, the development environment instructions: https://pandas.pydata.org/docs/dev/development/contributing_environment.html • Audience level Everyone is welcome to attend this session! If you've never contributed to open source software before, then you will learn how to, and if you have experience contributing, then you can either help mentor other attendees or you can work on more challenging contributions. It is useful to have some pandas, git, and python and experience. If you don't have much experience with them, you might expect to spend time "learning by doing".
🎤
Pragmatic ways of using Rust in your data project
Speakers:
👤
Christopher Prohm
📅 Tue, 18 Apr 2023 at 14:10
show details
Writing efficient data pipelines in Python can be tricky. The standard recommendation is to use vectorized functions implemented in Numpy, Pandas, or the like. However, what to do, when the processing task does not fit these libraries? Using plain Python for processing can result in lacking performance, in particular when handling large data sets. Rust is a modern, performance-oriented programming language that is already widely used by the Python community. Augmenting data processing steps with Rust can result in substantial speed ups. In this talk will present strategies of using Rust in a larger Python data processing pipeline with a particular focus on pragmatism and minimizing integration efforts.
One common strategy is to wrap the Rust part as a Python extension module. With enough care, the extensions module can have a pythonic feel and substantially improve performance. While libraries, such as PyO3, offer streamlined APIs, this task can still require lot of work. An often simpler alternative is to package the Rust part as an executable and communicate via files or network. This talk will focus on JSON messages exchanged via stdin / stdout or dataframe-like data in Arrow-compatible files. JSON is broadly supported in both Python and Rust and serialization can easily be handled with libraries such as SerDe (Rust) or cattrs (Python). The Arrow in-memory format supports complex data types, such as structs, lists, maps, or unions. These files can then be efficiently processed in Python by ever an growing list of libraries, most prominently pandas and polars. I will discuss the different strategies using real-world use cases and offer tips on how to implement them. Finally I will end by summarizing the respective strengths and weaknesses of the approaches.
🎤
Getting started with JAX
Speakers:
👤
Simon Pressler
📅 Tue, 18 Apr 2023 at 14:10
show details
Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to TensorFlow and PyTorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization. Nevertheless, making your first steps in JAX can feel complicated given some of its idiosyncrasies. This talk helps new users getting started in this promising ecosystem by sharing practical tips and best practises.
Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to Tensorflow and Pytorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization which make JAX and its ecosystem an attractive option for your deep learning projects. Nevertheless, making your first steps can feel complicated. From pure functions and the resulting differences in coding style, to avoiding recompilation, JAX comes with its own set of restrictions and design decisions to be taken by the user. This talk wants to help new and prospective users in their JAX learning journey, by providing guidance regarding practical problems they are likely to encounter when transitioning into the JAX ecosystem. Having recently switched to using Jax and Flax for my daily work this talk shares some of the insights I gained and wants to help them to avoid some of the mistakes I made early on. The talk will have a systematic look at selected situations in which JAX provides users with choices, seeing how they differ, and which is the best option given different circumstances. The talk covers: - Why bother switching to JAX? - A brief introduction to JAX including a list of JAX’s idiosyncrasies - Pure functions and the resulting architectural decisions - To JIT and or not to JIT - A speed and memory comparison of the different iteration options - Memory management and profiling
🎤
Data-driven design for the Dask scheduler
Speakers:
👤
Guido Imperiale
📅 Tue, 18 Apr 2023 at 14:10
show details
Historically, changes in the scheduling algorithm of Dask have often been based on theory, single use cases, or even gut feeling. Coiled has now moved to using hard, comprehensive performance metrics for all changes - and it's been a turning point!
Any developer worth their salt scrupulously practices functional regression testing: all functionality is covered by automated tests, and every time anybody changes something all tests must remain green. Performance testing however is a much fuzzier and often neglected area, typically due to the fact that, frequently, in order to measure realistic performance you need a production-sized test bench, and that performance typically includes some degree of variance. Historically, changes to the scheduling algorithm in Dask have gone through this thought process. There have always been plenty of functional unit tests that verify that the scheduler does whatever minute decisions the developers expects, but until recently there weren't any end-to-end, production-sized test benches on realistic use cases to measure performance. At Coiled, we have now implemented a new test suite that does just that - statistical analysis of performance metrics - that lets us understand if a change is beneficial or detrimental in terms of runtime and memory usage. This presentation delves into how we collect data, visualize it, and act on it and how much it changed our development process for the better.
🎤
Methods for Text Style Transfer: Text Detoxification Case
Speakers:
👤
Daryna Dementieva
📅 Tue, 18 Apr 2023 at 14:10
show details
Global access to the Internet has enabled the spread of information throughout the world and has offered many new possibilities. On the other hand, alongside the advantages, the exponential and uncontrolled growth of user-generated content on the Internet has also facilitated the spread of toxicity and hate speech. Much work has been done in the direction of offensive speech detection. However, there is another more proactive way to fight toxic speech -- how a suggestion for a user as a detoxified version of the message. In this presentation, we will provide an overview how texts detoxification task can be solved. The proposed approaches can be reused for any text style transfer task for both monolingual and multilingual use-cases.
Firstly, we will shortly introduce the research direction of NLP for Social Good. Then, we will show the main direction of research in text style transfer field. This field suffers from the lack of parallel data. We will describe our approach for such parallel dataset collection and show that it can be applied for any language. Then, we will show how monolingual, multilingual, and cross-lingual models can be trained for texts detoxification. In the end, we will discuss ethical issues connected with this task and tackling of toxic and hate speech in general. The whole presented work is based on the peer-reviewed papers from ACL and EMNLP conferences.
🎤
You are what you read: Building a personal internet front-page with spaCy and Prodigy
Speakers:
👤
Victoria Slocum
📅 Tue, 18 Apr 2023 at 14:10
show details
Sometimes the internet can be a bit overwhelming, so I thought I would make a tool to create a personalized summary of it! In this talk, I'll demonstrate a personal front-page project that allows me to filter info on the internet on a certain topic, built using spaCy, an open-source library for NLP, and Prodigy, a scriptable annotation tool. With this project, I learned about the power of working with tools that provide extensive customizability without sacrificing ease of use. Throughout the talk, I'll also discuss how design concepts of developer tools can improve the development experience when building complex and adaptable software.
Sometimes the internet can be a bit overwhelming, so I thought I would make a tool to create a personalized summary of it! In this talk, I'll demonstrate an open-source front-page project that allows me to filter info on the internet on a certain topic, customized and adapted to the user's preference. While building this project, I have been able to further explore the open-source NLP library, spaCy, and the scriptable annotation tool, Prodigy. Part of this talk will discuss how this project was implemented with regard to data collection, annotation and modeling. I developed a custom annotation interface, created a spaCy NLP pipeline, and explored different model architectures. Through the project, I learned about the power of working with tools that offer both good guide-rails and extensive customizability. In this talk, we'll also look at the design concepts of spaCy and Prodigy and how they've enhanced the developer experience for different types of projects, including my personal front-page. I'll discuss what I've discovered about how customizable tooling can improve the developer experience when building complex and adaptable software.
🎤
Visualizing your computer vision data is not a luxury, it's a necessity: without it, your models are blind and so do you.
Speakers:
👤
Chazareix Arnault
📅 Tue, 18 Apr 2023 at 14:45
show details
Are you ready to take your Computer Vision projects to the next level? Then don't miss this talk! Data visualization is a crucial ingredient for the success of any computer vision project. It allows you to assess the quality of your data, grasp the intricacies of your project, and communicate effectively with stakeholders. In this talk, we'll showcase the power of data visualization with compelling examples. You'll learn about the benefits of data visualization and discover practical methods and tools to elevate your projects. Don't let this opportunity pass you by: join us and learn how to make data visualization a core feature of your Computer Vision projects.
This talk is suitable for computer vision professionals and enthusiasts who want to learn about best practices for visualizing and exploring datasets and how to apply them to their projects. It will provide a valuable foundation for building better machine learning models and producing high-quality results. Data scientists from other domains may also find eye-opening information and ideas. We will explore examples of data issues in various computer vision datasets and tasks, such as object detection, few-shot learning, and visual question answering. We will then examine tools and strategies for inspecting datasets and the results of models, including FiftyOne, KnowYourData, and Streamlit. By the end of the talk, attendees will have a deeper understanding of the importance of visualizing and exploring computer vision datasets and be equipped with the knowledge and skills to apply these techniques in their own projects
🎤
Delivering AI at Scale
Speakers:
👤
Severin Schmitt
👤
Anna Achenbach
👤
Thorsten Kranz
📅 Tue, 18 Apr 2023 at 14:45
show details
Everybody knows our yellow vans, trucks and planes around the world. But do you know how data drives our business and how we leverage algorithms and technology in our core operations? We will share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven Company. • Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics, including Computer Vision and NLP • Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data Scientist • Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML • No rusty working mode: small, self-organized, agile project teams, combining state of the art Machine Learning with MLOps best practices • A young, motivated and international team – German skills are only “nice to have” But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our work, deep dive into our largest use cases that impact your everyday life and share our approach for a timeseries forecasting library - combining data science, software engineering and technology for efficient and easy to maintain machine learning projects..
Everybody knows our yellow vans, trucks and planes around the world. But do you know how data drives our business and how we leverage algorithms and technology in our core operations? We will share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven Company. • Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics, including Computer Vision and NLP • Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data Scientist • Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML • No rusty working mode: small, self-organized, agile project teams, combining state of the art Machine Learning with MLOps best practices • A young, motivated and international team – German skills are only “nice to have” But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our work, deep dive into our largest use cases that impact your everyday life and share our approach for a timeseries forecasting library - combining data science, software engineering and technology for efficient and easy to maintain machine learning projects..
🎤
Accelerating Public Consultations with Large Language Models: A Case Study from the UK Planning Inspectorate
Speakers:
👤
Michele Dallachiesa
👤
Andreas Leed
📅 Tue, 18 Apr 2023 at 14:45
show details
Local Planning Authorities (LPAs) in the UK rely on written representations from the community to inform their Local Plans which outline development needs for their area. With an average of 2000 representations per consultation and 4 rounds of consultation per Local Plan, the volume of information can be overwhelming for both LPAs and the Planning Inspectorate tasked with examining the legality and soundness of plans. In this study, we investigate the potential for Large Language Models (LLMs) to streamline representation analysis. We find that LLMs have the potential to significantly reduce the time and effort required to analyse representations, with simulations on historical Local Plans projecting a reduction in processing time by over 30%, and experiments showing classification accuracy of up to 90%. In this presentation, we discuss our experimental process which used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of the BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss the design and prototyping of web applications to support the aided processing of representations using Voilà, FastAPI, and React. Finally, we highlight successes and challenges encountered and suggest areas for future improvement.
In the United Kingdom, Local Planning Authorities (LPAs) are responsible for creating Local Plans that outline the development needs of their areas, including land allocation, infrastructure requirements, housing needs, and environmental protection measures. This process involves consulting with the local community and interested parties multiple times, which often results in hundreds or thousands of written representations that must be organised and analysed. On average, LPAs receive approx. 2000 written representations per consultation, and each Local Plan requires 4 rounds of consultation. The process of analysing these representations takes approx. 3.5 months per round of consultation to complete. The Planning Inspectorate is tasked with examining Local Plans to ensure they follow national policy and legislation. The Inspectorate examines approx. 60 Local Plans a year, each examination lasting around a year’s time. The volume of information included in each Local Plan significantly outweighs the capacity of the Planning Inspectorate to read and analyse the content in detail. This can lead to important issues being overlooked and potential problems with the review process or legal challenges. Conducting a thorough and meticulous analysis of representations takes a lot of time and effort for both LPAs and the Planning Inspectorate. Together with the Planning Inspectorate, we conducted an AI discovery to explore how Large Language Models (LLMs) can help reduce the time taken to analyze representations, improve resource planning, increase consistency in decision-making, and mitigate the risk of a key issue of material concern being missed. We assessed the performance of competing models and demonstrated their goodness with proof-of-concept apps for both LPAs and the Planning Inspectorate that unify and streamline the aided processing of representations. Our simulations on historical Local Plans resulted in a projected reduction of the time taken to analyze representations by more than 30%, and experiments show that we are able to classify representations to the relevant policy in Local Plans with up to 90% accuracy. In this talk, we share our experimental process based on Python and the experimental results. We delve into how we approached the problem, sourced and cleaned the data, and used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss our strategies for dealing with limited training data. Finally, we present the design and prototyping of two web applications using Voilà, and demonstrate how we iterated on them using FastAPI and React. Throughout the presentation, we highlight the successes and challenges we encountered, and suggest areas for future improvement.
🎤
Writing Plugin Friendly Python Applications
Speakers:
👤
Travis Hathaway
📅 Tue, 18 Apr 2023 at 14:45
show details
In modern software engineering, plugin systems are a ubiquitous way to extend and modify the behavior of applications and libraries. When software is written in a way that is plugin friendly, it encourages the use of modular organization where the contracts between the core software and the plugin have been well thought out. In this talk, we cover exactly how to define this contract and how you can start designing your software to be more plugin friendly. Throughout the talk we will be creating our own plugin friendly application using the [pluggy](https://pluggy.readthedocs.io/en/stable/) library to show these design principles in action. At the end of the talk, I also cover a real-life case study of how the package manager [conda](https://github.com/conda/conda) is currently making its 10 year old code more plugin friendly to illustrate how to retrofit an existing project.
This talk begins with a general discussion about what plugins are and how they are used in software. We cover important theoretical concepts and show just how pervasive plugins are in much of the software we use everyday. With a firm idea about what plugins allow us to do, we will begin creating our own command line application that downloads images via APIs given a search term. We will write our application with plugins in mind so that we can quickly expand and support any number of image searching backends (e.g. Google, Unsplash, etc.). The presentation will focus on everything we have to do to let plugin authors extend our application and add their own backends. A fully functional implementation of this application can be found here: [https://github.com/travishathaway/latz](https://github.com/travishathaway/latz). After building our own application, I will then present how the [conda](https://github.com/conda/conda) project approaches making its software plugin friendly. Much of what I show in the example also applies to conda's plugin architecture. This talk should prepare those interested in writing their own plugin friendly applications to get started with the [pluggy](https://pluggy.readthedocs.io/en/stable/) library. The [example project](https://github.com/travishathaway/latz) will also provide a great starting point and inspiration for new and existing applications.
🎤
When A/B testing isn’t an option: an introduction to quasi-experimental methods
Speakers:
👤
Inga Janczuk
📅 Tue, 18 Apr 2023 at 14:45
show details
Identification of causal relationships through running experiments is not always possible. In this talk, an alternative approach towards it, quasi-experimental frameworks, is discussed. Additionally, I will present how to adjust well-known machine-learning algorithms so they can be used to quantify causal relationships.
### What problem is the talk addressing? Experiments are a gold standard for estimating causal relationships. That being said, they are not always possible. Experiments can be costly, long-lasting, unethical, or illegal. In other cases, the underlying assumptions for identification cannot be met, e.g. it is not possible to split subjects into control and treatment groups randomly or avoid interactions between them. ### Why is the problem relevant to the audience? Understanding the magnitude of treatment effects is a premise for designing optimal strategies by policy makers/stakeholders. ### What are the solutions to the problem? Prediction-driven algorithms might not be best-tailored for accurate identification of causal links. In this talk I will show how to shift the goal post of those algorithms from prediction towards identification of treatment effects. First, I will cover classical quasi-experimental frameworks such as difference-in-differences and regression discontinuity design. Then, I shed some light on how to augment those methods with out-of-the-box machine-learning techniques. To this end, orthogonal machine learning will be discussed. ### What are the main takeaways from the talk? I will reiterate that correlation does not imply causation. The audience will get familiarized with causal-inference methods used when laboratory experiments are not feasible. The participants will learn how to adjust off-the-shelf machine-learning algorithms to identify conditional average treatment effects.
🎤
Let's contribute to pandas (3 hours) #2
Speakers:
👤
Noa Tamir
👤
Patrick Hoefler
📅 Tue, 18 Apr 2023 at 15:45
show details
PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people. pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted! If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .
PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people. pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted! If you don’t finish your contribution during the event, we hope you will continue to work on it after the workshop. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html . ❓Any other requirements ❓ 1. Bring your own laptop 2. Have Github account: https://github.com 3. Have git installed: https://git-scm.com/book/en/v2/Getting-Started-Installing-Gitf • Format for the session: First 15 minutes : an introduction - what you can contribute, how to contribute, and how to set up your development environment or use gitpod; The rest : "office hours", during which you'll be mentored through setting up a development environment and making a contribution to pandas. • Preparation (optional) For those who are more keen on using the workshop to work on their contribution to pandas, you may want to start setting up your development environment in advance. This way, by the time you arrive you are ready to get started on picking issues, and starting to contribute. Please be aware that it could take longer to set up a development on a computer running a Windows operating system compared to MacOS or Unix. We will guide you through the steps, and they are useful to learn for many open source projects. We also offer a development environment on gitpod. It can take some minutes to load, but provides you an instant and fresh development environment for each new task directly from your browser, using VScode. Documentation is in the works and will be provided before the workshop. To get the most out of the session, it's encouraged (but not required) that you have a look at the contributing guide beforehand: https://pandas.pydata.org/pandas-docs/dev/development/contributing.html. Particularly, the development environment instructions: https://pandas.pydata.org/docs/dev/development/contributing_environment.html • Audience level Everyone is welcome to attend this session! If you've never contributed to open source software before, then you will learn how to, and if you have experience contributing, then you can either help mentor other attendees or you can work on more challenging contributions. It is useful to have some pandas, git, and python and experience. If you don't have much experience with them, you might expect to spend time "learning by doing".
🎤
Data Kata: Ensemble programming with Pydantic #2
Speakers:
👤
Lev Konstantinovskiy
👤
Gregor Riegler
👤
Nitsan Avni
📅 Tue, 18 Apr 2023 at 15:45
show details
Write code as an ensemble to solve a data validation problem using Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.
The How We will play a "collaborative game" - write code together to solve a problem. Each small group of 5, an "ensemble", will be guided by a facilitator. An ensemble has only one screen and one keyboard, so participants rotate the roles of typing and talking. The goals are to have fun, learn how to use Pydantic, write better code with Test Driven Development, listen to colleagues, make typos in front of everyone, become a supportive team, defend our ideas and sometimes even accept criticism. Exercise: "Read data from a CSV and check data types, range of values, consistency between columns using Pandera." See data and starting code in the [repo](https://github.com/tmylk/data-kata/tree/main/validation/pydantic) This is part 2 of our data validation tutorial. Part 1 is doing the same task using a different Python framework - Pydantic. You can attend both or just one part of this tutorial. Format: - Ensemble programming with a facilitator. We will all collaborate as one team, switching the person on the keyboard every 5 mins. - You don't need to have any previous experience with ensemble programming to join. - You don't need to have any previous experience with data validation to join. Schedule: - Intros - 10 mins - Ensemble programming - 30 mins - Interim Retrospective - 10 mins - Ensemble programming - 30 mins - Final Retrospective - 10 mins - Closing Things to note: - We will use gitpod.io as a shared VS Code IDE work environment
🎤
MLOps in practice: our journey from batch to real-time inference
Speakers:
👤
Theodore Meynard
📅 Tue, 18 Apr 2023 at 16:00
show details
I will present the challenges we encountered while migrating an ML model from batch to real-time predictions and how we handled them. In particular, I will focus on the design decisions and open-source tools we built to test the code, data and models as part of the CI/CD pipeline and enable us to ship fast with confidence.
At GetYourGuide we build a marketplace for travel experiences. The ranking of activities on the platform is one of the most essential machine-learning products for the business. In this talk, I will explain how we gradually migrated our ranking from global precomputed scores to a live reranking service. Building such a service with high availability requirements and constant modifications brings challenges. I will dive into the design decisions and open-source tools we built to enable us to test code, data, and models as part of the CI/CD pipeline. It allows us to ship fast with confidence without losing ourselves in cumbersome tests and/or a mocking hell. At the end of the talk, you will have actionable insights you can apply to your Machine Learning products and understand how to introduce good MLOps practices using open-source tools.
🎤
Enabling Machine Learning: How to Optimize Infrastructure, Tools and Teams for ML Workflows
Speakers:
👤
Yann Lemonnier
📅 Tue, 18 Apr 2023 at 16:00
show details
In this talk, we will explore the role of a machine learning enabler engineer in facilitating the development and deployment of machine learning models. We will discuss best practices for optimizing infrastructure and tools to streamline the machine learning workflow, reduce time to deployment, and enable data scientists to extract insights and value from data more efficiently. We will also examine case studies and examples of successful machine learning enabler engineering projects and share practical tips and insights for anyone interested in this field.
In this talk, we will explore the role of a machine learning enabler engineer in facilitating the development and deployment of machine learning models. We will discuss best practices for optimizing infrastructure and tools to streamline the machine learning workflow, reduce time to deployment, and enable data scientists to extract insights and value from data more efficiently. We will also examine case studies and examples of successful machine learning enabler engineering projects and share practical tips and insights for anyone interested in this field.
🎤
Introducing FastKafka
Speakers:
👤
Tvrtko Sternak
📅 Tue, 18 Apr 2023 at 16:00
show details
FastKafka is a Python library that makes it easy to connect to Apache Kafka queues and send and receive messages. In this talk, we will introduce the library and its features for working with Kafka queues in Python. We will discuss the motivations for creating the library, how it compares to other Kafka client libraries, and how to use its decorators to define functions for consuming and producing messages. We will also demonstrate how to use these functions to build a simple application that sends and receives messages from the queue. This talk will be of interest to Python developers looking for an easy-to-use solution for working with Kafka. The documentation of the library can be found here: https://fastkafka.airt.ai/
FastKafka is a Python library that simplifies the process of connecting to Apache Kafka queues and sending and receiving messages. It follows a decorator-based approach inspired by the popular FastAPI library, making it easy to define functions for consuming messages from the queue and producing and sending new ones. In this talk, we will introduce FastKafka and its features for working with Kafka in Python. We will start by discussing the motivations for creating the library and how it compares to other Kafka client libraries. We will then delve into a live demonstration of the library's features, showing how to use the decorators to define functions for consuming and producing messages, and how to use these functions to build a simple application that sends and receives messages from the queue. Finally, we will discuss some real-world use cases for FastKafka and how it can be used to build scalable, high-performance applications that need to process and transmit large amounts of data. This talk will be of particular interest to Python developers looking for an easy-to-use solution for working with Kafka.
🎤
The bumps in the road: A retrospective on my data visualisation mistakes
Speakers:
👤
Artem Kislovskiy
📅 Tue, 18 Apr 2023 at 16:00
show details
We will delve into the importance of effective data visualisation in today's world. We will explore how it can help convey insights from data using Matplotlib and best practices for creating informative visualisations. We will also discuss the limitations of static visualisations and examine the role of continuous integration in streamlining the process and avoiding common pitfalls. By the end of this talk, you will have gained valuable insights and techniques for creating informative and accurate data visualisations, no matter what tools you're using.
In today's world, effective visualisation is crucial for conveying insights from data. We will explore best practices for creating visualisations with Matplotlib. We will discuss the limitations of static visualisations and how continuous integration can help streamline the process and avoid common pitfalls. I will share my practical experiences and learned lessons from working with analytics drawing on the insights of well-known experts such as Edward Tufte, Stephen Few, Alberto Cairo, and Dona Wong. The work of these authors has helped shape our understanding of how to create informative and accurate visualisations. I will reflect on what I wish I had known about the best practices in this field. This talk is suitable for professionals who work with data and want to improve the effectiveness of analytics and reporting. Data visualisation is a form of communication that is important to learn how to apply to convey the stories that data tells us. By the end of this talk, you will have gained valuable techniques for creating informative analytics and an understanding of how CI can support your data visualisation projects.
🎤
Neo4j graph databases for climate policy
Speakers:
👤
Marcus Tedesco
📅 Tue, 18 Apr 2023 at 16:35
show details
In this talk we walkthrough our experience using Neo4j and Python to model climate policy as a graph database. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!
As the ambition and complexity of climate regulations and policies grows, it is becoming increasingly difficult to represent them in relational databases. For example the EU Sustainable Taxonomy regulation contains thousands of interrelated legal clauses, many of which also reference other legal texts and entities. Graph databases such as Neo4j present a possible alternative well suited to model the complicated, interrelated and evolving structure of climate regulations. In this talk we walkthrough our experience using Neo4j and Python to model climate policy such as the EU Sustainable Taxonomy as a graph database. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!
🎤
Use Spark from anywhere: A Spark client in Python powered by Spark Connect
Speakers:
👤
Martin Grund
📅 Tue, 18 Apr 2023 at 16:35
show details
Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL. Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages. This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.
Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL. Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages. This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.
🎤
Ask-A-Question: an FAQ-answering service for when there's little to no data
Speakers:
👤
Suzin You
📅 Tue, 18 Apr 2023 at 16:35
show details
Doing data science in international development often means finding the right-sized solution in resource-constrained settings. This talk walks you through how my team helped answer thousands of questions from pregnant folks and new parents on a South African maternal and child health helpline, which model we ended up choosing and why (hint: resource-constraints!), and how we've packaged everything into a service that anyone can start for themselves, By the end of the talk, I hope you'll know how to start your own FAQ-answering service and learn about one example of doing data science in international development.
Doing data science in international development often means finding the right-sized solution in resource-constrained settings. This talk walks you through how my team helped answer thousands of questions from pregnant folks and new parents on a South African maternal and child health helpline, which model we ended up choosing and why (hint: resource-constraints!), and how we've packaged everything into a service that anyone can start for themselves, By the end of the talk, I hope you'll know how to start your own FAQ-answering service and learn about one example of doing data science in international development.
🎤
Keynote - Lorem ipsum dolor sit amet
Speakers:
👤
Miroslav Šedivý
📅 Wed, 19 Apr 2023 at 09:10
show details
A life without joy is like software without meaningful test data - it's uncertain and unreliable. The search for the perfect test data is a challenge. Real data should not be too real. Random data should not be too random. This is a randomly real and a really random journey to discover the balance between these two, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
A life without joy is like software without meaningful test data - it's uncertain and unreliable. The search for the perfect test data is a challenge. Real data should not be too real. Random data should not be too random. This is a randomly real and a really random journey to discover the balance between these two, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
🎤
Building Hexagonal Python Services
Speakers:
👤
Shahriyar Rzayev
📅 Wed, 19 Apr 2023 at 10:00
show details
The importance of enterprise architecture patterns is all well-known and applicable to varied types of tasks. Thinking about the architecture from the beginning of the journey is crucial to have a maintainable, therefore testable, and flexible code base. In We are going to explore the Ports and Adapters(Hexagonal) pattern by showing a simple web app using Repository, Unit of Work, and Services(Use Cases) patterns tied together with Dependency Injection. All those patterns are quite famous in other languages but they are relatively new for the Python ecosystem, which is a crucial missing part. As a web framework, we are going to use FastAPI which can be replaced with any framework in a matter of time because of the abstractions we have added.
In nearly all web applications and Python tutorials we are starting from installing a web framework, and database server, the next step is to build database models and then use ORM, etc. But wait, there is a problem with this classical approach, we lose the core business domain discussions - so-called core domain models just get lost inside some classes and functions. How about changing and reverting our approach? How about first starting by thinking, modeling our business, and core domain, and then testing it properly? Afterward, how about adding an abstraction layer on the database, then adding another abstraction on actual services, and use cases? But wait, how we are going to manage all transactional usage - okay let's add another layer with the Unit of Work pattern to manage our work as units. Sounds cryptic? Here is a step-by-step guide to starting our project: * We are going to start with domain modeling and adding tests for our domain models * The database layer will be abstracted using a Repository pattern * The database transactions will be managed by the Unit of Work pattern * The business logic actions were encapsulated in the Use Cases The question can arise: where are our web framework and database server? Answer: good architecture lets us defer those choices until the end. Because the web framework and the database server are details for our business/core application itself. Web framework will be considered as an entry point for our application and the database layer will be encapsulated using SQLAlchemy ORM, but still, ORM itself is hidden behind Repository and UoW patterns. This allows us to change the ORM library if there will be any need in the future. The most important part here is to understand how we are going to build our application using Ports and Adapters(Hexagonal) pattern and all aforementioned patterns will be divided into Ports(using abstract base classes) and Adapters(the actual implementations), we can think about this as a contract between our actual implementations and abstractions.
🎤
Accelerating Python Code
Speakers:
👤
Jens Nie
📅 Wed, 19 Apr 2023 at 10:00
show details
Python is a beautiful language for fast prototyping and and sketching ideas quickly. People often struggle to get their code into production though for various reasons. Besides of all security and safety concerns that usually are not addressed from the very beginning when playing around with an algorithmic idea, performance concerns are quite frequently a reason for not taking the Python code to the next level. We will look at the "missing performance" worries using a simple numerical problem and how to speed the corresponding Python code up to top notch performance.
We all know how much fun it is to play around with an algorithmic idea in Python. It's very satisfying to see the idea develop, doing what it's supposed to be doing and how simple and elegant the code finally looks like. Python being so feature complete with its standard library and the 3rd party universe of libraries and packages allows development to be very quick. And, we're all very grateful to be able to focus on the problem itself, not on the language specifics, to solve it. But when we're arriving at the point where everything just works, there is this one last step that needs to be mastered: Get it into production to finally let it do what it was supposed to be doing and make life easier for all of us. But at that stage there are those final hurdles, and they usually feel giant, that arise unpleasant questions. Will the algorithm really do what it was supposed to be doing under all circumstances? Will it be safe? What if it fails? Will it actually be fast enough for all the data it needs to process in production? Will it be capable of doing its job in the future, when the amount of work grows? Whilst the first worries usually can be addressed well using established software engineering habits and patterns, the performance related issue is often seen as the killer on the way to production use, as Python is still considered to be slow just based on the fact that it is an interpreted language. Quite often code is rewritten after the prototyping phase in other languages considered to be fast, such as C++ for example, for this very reason. We'll look at exactly this point and explore ways to accelerate Python code by simple modifications and using third party libraries to support us. To do that we will look at some code to solve a simple numerical problem - calculating the Mandelbrot Set - as it is well suited for this and quite simple to follow. Yet it generates stunning and beautiful results entertaining us through the course of the presentation. The strategies shown to accelerate the code, based on concepts taken from standard library, PyPy, numpy, numba and dask, however are transferable to other algorithmic problems as well. We will analyse the advantages as well as the drawbacks for each concept to see the overall effect and where else the solution might apply.
🎤
Advanced Visual Search Engine with Self-Supervised Learning (SSL) Representations and Milvus
Speakers:
👤
Antoine Toubhans
👤
Noé Achache
📅 Wed, 19 Apr 2023 at 10:00
show details
Image retrieval is the process of searching for images in a large database that are similar to one or more query images. A classical approach is to transform the database images and the query images into embeddings via a feature extractor (e.g., a CNN or a ViT), so that they can be compared via a distance metric. Self-supervised learning (SSL) can be used to train a feature extractor without the need for expensive and time-consuming labeled training data. We will use DINO's SSL method to build a feature extractor and Milvus, an open-source vector database built for evolutionary similarity search, to index image representation vectors for efficient retrieval. We will compare the SSL approach with supervised and pre-trained feature extractors.
[Image Retrieval](https://en.wikipedia.org/wiki/Image_retrieval) consists in searching in a large database for the most similar images to one or more query images. It has many applications in various fields, e.g., to validate whether a person's photo is contained in your database of people's photos; to build a visual recommendation system; or to create a video deduplication system. Huge progress in Computer Vision in the deep learning era highlighted [Content-based Image Retrieval](https://en.wikipedia.org/wiki/Content-based_image_retrieval) (CBIR) techniques that use the image contents (features, colors, shapes, etc) rather than metadata (keywords, tags). This gets rid of time-consuming, costly and error-prone human annotations to produce the metadata. A classic CBIR approach consists of three steps: 1. A deep neural network called **the feature extractor** (typically a CNN, or a [ViT](https://arxiv.org/pdf/2010.11929.pdf)) computes a representation of each image of the database in the form of an embedding vector. 2. The same *feature extractor* is used to compute an embedding of a query image. 3. The search is performed by retrieving the **closest** representations in this vector space using a distance metric (cosine, L1, or more complex ones). Thereafter, two main challenges arise: - **Quality of image representations** - the embeddings should capture the visual features that are relevant to your searches/tasks. For instance, if you intend to do face recognition, embeddings should encode eye/hair color, skin texture, nose position, etc. Traditionally, the feature extractor is trained in a supervised way. Therefore, the relevance of the representations hugely depends on 1) how close is the training dataset from the searched query images 2) the potential visual biases in the annotations (see a [famous example here](https://medium.com/hackernoon/dogs-wolves-data-science-and-why-machines-must-learn-like-humans-do-41c43bc7f982)). - **Speed of search in the representation space** - comparing each query image to every single image in the searched database in near real-time is challenging and expensive with large datasets. In this talk, we will build a [Visual Search Engine](https://en.wikipedia.org/wiki/Visual_search_engine): - We will introduce **[Self-Supervised Learning](https://en.wikipedia.org/wiki/Self-supervised_learning) (SSL)** in the context of computer vision and the [data2vec](https://arxiv.org/pdf/2202.03555.pdf) approach. Labelling data can be a time-consuming and expensive process, especially if it requires specialized knowledge or expertise. SSL does not require labelled training data to learn good representations, hence it allows to lower the cost and time to build a model producing good representations for our visual search engine. - As a concrete example for this talk, we will use the [DINO](https://arxiv.org/pdf/2104.14294.pdf)'s SSL method to build a feature extractor. - We will compare the DINO feature extractor with supervised pre-trained feature extractors. We will show the main differences between the obtained representations: SSL ones are generally richer (more visual features are in the representation) whereas supervised learning introduces a natural semantic bias in the representations. In addition, we will present practical tools to understand the visual features encoded in the embeddings (activation maps, grad-cams, self-attention maps for transformers). - We will present [Milvus](https://milvus.io/), a vector database built for scalable similarity search: it’s an open-source search engine tool (14.5k stars on Github) that is suitable for production use cases as it can be easily scaled and managed. Milvus uses [Approximate Nearest Neighbors (ANN) methods](https://milvus.io/docs/v2.0.x/index.md#Selecting-an-Index-Best-Suited-for-Your-Scenario) to build vector indexes that improve retrieval efficiency by sacrificing accuracy within an acceptable range. - We will use the Milvus Python API to index the image representation vectors: as a result, the images the most similar to a query image can be retrieved in a split second, even for datasets containing millions of vectors. By the end of the session, participants will have learned how to build a Visual Search Engine using Milvus with pre-trained self-supervised and supervised models.
🎤
Why GPU Clusters Don't Need to Go Brrr? Leverage Compound Sparsity to Achieve the Fastest Inference Performance on CPUs
Speakers:
👤
Damian Bogunowicz
📅 Wed, 19 Apr 2023 at 10:00
show details
Forget specialized hardware. Get GPU-class performance on your commodity CPUs with compound sparsity and sparsity-aware inference execution. This talk will demonstrate the power of compound sparsity for model compression and inference speedup for NLP and CV domains, with a special focus on the recently popular Large Language Models. The combination of structured + unstructured pruning (to 90%+ sparsity), quantization, and knowledge distillation can be used to create models that run an order of magnitude faster than their dense counterparts, without a noticeable drop in accuracy. The session participants will learn the theory behind compound sparsity, state-of-the-art techniques, and how to apply it in practice using the Neural Magic platform.
By intelligently applying SOTA compound sparsity techniques, we can remove 95%+ of the weights and reduce the remaining 5% to 8-bit precision on modern models such as BERT, while maintaining 99%+ of their baseline accuracy. In this talk, we’ll be covering how we can build up to this extreme sparsity and how to harness it to achieve an order of magnitude speedup for CPU inference. This talk will focus on the success story of utilizing sparsity to run fast inference of modern neural networks on CPUs. We will focus on the popular Large Language Models with the goal of learning how the recent state-of-the-art in model compression can help to dramatically lower the computational budget when it comes to model inference. Today’s ML hardware acceleration is headed towards chips that apply a petaflop of compute to a cell phone-size memory. Our brains, on the other hand, are biologically the equivalent of applying a cell phone of compute to a petabyte of memory. In this sense, the direction being taken by hardware designers is the opposite of that proven by nature. Why? Simply because we don’t know the algorithms nature uses. GPUs bring data in and out quickly, but have little locality of reference because of their small caches. They are geared towards applying a lot of compute to little data, not little compute to a lot of data. The networks are designed to run on them full layer after full layer in order to saturate their computational pipeline. CPUs, on the other hand, have large, much faster caches than GPUs, and have an abundance of memory (terabytes). A typical CPU server can have memory equivalent to tens or even hundreds of GPUs. CPUs are perfect for a brain-like ML world in which parts of an extremely large network are executed piecemeal, as needed. This is the problem Neural Magic set out to solve and the perspective which led to the creation of DeepSparse, a custom computational engine designed to mimic, on commodity hardware, the way brains compute. It uses neural network sparsity combined with the locality of communication by utilizing the CPU’s large fast caches and its very large memory.
🎤
Create interactive Jupyter websites with JupyterLite
Speakers:
👤
Jeremy Tuloup
📅 Wed, 19 Apr 2023 at 10:00
show details
Jupyter notebooks are a popular tool for data science and scientific computing, allowing users to mix code, text, and multimedia in a single document. However, sharing Jupyter notebooks can be challenging, as they require installing a specific software environment to be viewed and executed. JupyterLite is a Jupyter distribution that runs entirely in the web browser without any server components. A significant benefit of this approach is the ease of deployment. With JupyterLite, the only requirement to provide a live computing environment is a collection of static assets. In this talk, we will show how you can create such static website and deploy it to your users.
We will cover the basics of JupyterLite, including how to use its command-line interface to generate and customize the appearance and behavior of your Jupyter website. This will be a guided walkthrough with step-by-steps instructions for adding content, extensions and configuration. By the end of this tutorial, you will be able to create your own interactive Jupyter website using JupyterLite. Outline: - Introduction to Jupyter and JupyterLite - Examples of JupyterLite used for interactive documentation and educational content (NumPy, Try Jupyter, SymPy) - Step-by-step demo for creating a Jupyter website - Quickstart with the demo repository - Adding content: notebooks, files and static assets - Adding extensions to the user interface - Adding packages to the Python runtime - Customization and custom settings - Deploy JupyterLite as a static website on GitHub Pages, Vercel or your own server - Conclusion and next steps for learning more about the Jupyter ecosystem The tutorial will be based on resources already publicly available: - try JupyterLite in your browser: https://jupyterlite.github.io/demo/ - the JupyterLite documentation: https://jupyterlite.readthedocs.io/en/latest/quickstart/deploy.html - the JupyterLite repositories: https://github.com/jupyterlite At the end of the tutorial the attendees will have something very concrete to present and a functioning Jupyter website.
🎤
The Spark of Big Data: An Introduction to Apache Spark
Speakers:
👤
Pasha Finkelshteyn
📅 Wed, 19 Apr 2023 at 10:00
show details
Get ready to level up your big data processing skills! Join us for an introductory talk on Apache Spark, the distributed computing system used by tech giants like Netflix and Amazon. We'll cover PySpark DataFrames and how to use them. Whether you're a Python developer new to big data or looking to explore new technologies, this talk is for you. You'll gain foundational knowledge about Apache Spark and its capabilities, and learn how to leverage DataFrames and SQL APIs to efficiently process large amounts of data. Don't miss out on this opportunity to up your big data game!
Get ready to level up your big data processing skills! Join us for an introductory talk on Apache Spark, the distributed computing system used by tech giants like Netflix and Amazon. We'll cover PySpark DataFrames and how to use them. Whether you're a Python developer new to big data or looking to explore new technologies, this talk is for you. You'll gain foundational knowledge about Apache Spark and its capabilities, and learn how to leverage DataFrames and SQL APIs to efficiently process large amounts of data. Don't miss out on this opportunity to up your big data game!
🎤
Monorepos with Python
Speakers:
👤
AbdealiLoKo
📅 Wed, 19 Apr 2023 at 10:00
show details
Working with python is fun. Managing python packaging, linters, tests, CI, etc. is not as fun. Every maintainer needs to worry about consistent styling, quality, speed of tests, etc as the project grows. Monorepos have been successful in other communities - how does it work in Python ?
As a python project grows (within 2-3 years), you will go down either of these 2 paths: - Create a monolith - Modularize your code into smaller packages In the current world, you will be affected by multiple other libraries you use. And modularity is a requirement for any good project. But managing multiple modular packages becomes tough over time. 1. How do you ensure coding standards (quality, styling, etc) is consistent across them ? 1. How would we ensure all the pakages work correctly without spending hours and hours of CI time ? 1. How can common logical pieces be modularized further and still be DRY ? These are common issues I have faced by the 2-3 year mark in any active project. And if not solved quickly can easily cause your project to get messy very quickly. This talk aims to discuss these common issues and how a monorepo structure which is widely popular in other communities like NodeJS can also be applied to python. We also discuss how the crux of the issue: - Making your code structure machine understandable - How this structured information can then be used to optimize workloads - How this structured information can be used to automate tasks And also go into discussing how **monorepo tools** like pants, bazel, nx, etc. leverage this code structure information to simplify your life as a maintainer
🎤
Thou Shall Judge But With Fairness: Methods to Ensure an Unbiased Model
Speakers:
👤
Nandana Sreeraj
📅 Wed, 19 Apr 2023 at 10:50
show details
Is your model prejudicial? Is your model deviating from the predictions it ought to have made? Has your model misunderstood the concept? In the world of artificial intelligence and machine learning, the word "fairness" is particularly common. It is described as having the quality of being impartial or fair. Fairness in ML is essential for contemporary businesses. It helps build consumer confidence and demonstrates to customers that their issues are important. Additionally, it aids in ensuring adherence to guidelines established by authorities. So guaranteeing that the idea of responsible AI is upheld. In this talk, let's explore how certain sensitive features are influencing the model and introducing bias into it. We'll also look at how we can make it better.
We cannot escape thinking about fairness through numbers and math. Models are not fair simply because they are mathematical, contrary to popular belief. AI systems are subjected to bias. It may be inherent which is due to historical bias in the training dataset. There may be label bias that occurs when the set of labeled data is not a full representation of the entire universe of existing potential labels. Another potential bias is sampling bias, which occurs when certain people in the intended universe have a higher or lower sampling probability than others. Models learn from such biased datasets which may lead to unfair decisions. As cascading models are developed, this bias continues to spread. Model fairness is an alerting concern. Unfair AI systems can create habitual losses for businesses. It may also contribute unfavorable commercial values to the company, creating situations like customer eroding, slandering, and decreasing transparency. As a result, Model fairness is becoming increasingly necessary. In the proposed talk, I would gently introduce you to the above concepts and some open source libraries that would help us in accessing ML models' fairness. Lastly, I would be walking you through how to assess the fairness of a model for a law school dataset using Fairlearn, an open source library by Microsoft and the measures that can be taken to mitigate the same. My Talk will Focus On 1. What are the metrics that need to be considered for assessing the fairness of an ML model? 2. What are the mitigation measures that can be implemented for the same? 3. Python code to gauge the fairness of a model trained on a law school dataset using Fairlearn and steps to mitigate the model.
🎤
Unlocking Information - Creating Synthetic Data for Open Access.
Speakers:
👤
Antonia Scherz
📅 Wed, 19 Apr 2023 at 10:50
show details
Many good project ideas fail before they even start due to the sensitive personal data required. The good news: a synthetic version of this data does not need protection. Synthetic data copies the actual data's structure and statistical properties without recreating personally identifiable information. The bad news: It is difficult to create synthetic data for open-access use, without recreating the exact copy of actual data. This talk will give hands-on insights into synthetic data creation and challenges along its lifecycle. We will learn how to create and evaluate synthetic data for any use case using the open-source package Synthetic Data Vault. We will find answers to why it takes so long to synthesize the huge amount of data dormant in public administration. The talk addresses owners who want to create access to their private data as well as analysts looking to use synthetic data. After this session, listeners will know which steps to take to generate synthetic data for multi-purpose use and its limitations for real-world analyses.
A vast amount of private data lies dormant in public institutions, hidden from the research community. Synthesizing complex, anonymized data could allow researchers access without disclosing personally identifiable information while keeping information loss minimal. The tools to do this exist, but why is it still difficult to realize synthetic solutions? One challenge is to reach the minimum viable quality to serve as many use cases as possible. Ideally, the synthetic data allows data exploration with equal results as the real data. We will guide you through the challenges of creating synthetic data and shine a light on its lifecycle. We will explore the different levels of quality of generated structured data and discuss their potential. Finally, we will link these issues to the domain of public administration, but the main insights are generally applicable to all kinds of domains. In particular, we will focus on four key questions: 1. How can we create synthetic data from private data? 2. How can synthetic data creation be integrated into institutions that sit on piles of unused highly private data? 3. Can SOTA methods for synthetic data fulfill all needs of the research community? When is access to the actual, private data needed? 4. Which quality measures are adequate for synthetic data? As we address these questions, we'll use the Synthetic Data Vault to create and evaluate synthetic data. After the talk listeners will have understood the concept of synthetic data and will be able to evaluate synthetic data for a plethora of use cases. As a plus, they will also gain a deeper understanding of why open data access is (not yet) solved by synthetic data.
🎤
Teaching Neural Networks a Sense of Geometry
Speakers:
👤
Jens Agerberg
📅 Wed, 19 Apr 2023 at 10:50
show details
By taking neural networks back to the school bench and teaching them some elements of geometry and topology we can build algorithms that can reason about the shape of data. Surprisingly these methods can be useful not only for computer vision – to model input data such as images or point clouds through global, robust properties – but in a wide range of applications, such as evaluating and improving the learning of embeddings, or the distribution of samples originating from generative models. This is the promise of the emerging field of Topological Data Analysis (TDA) which we will introduce and review recent works at its intersection with machine learning. TDA can be seen as being part of the increasingly popular movement of Geometric Deep Learning which encourages us to go beyond seeing data only as vectors in Euclidean spaces and instead consider machine learning algorithms that encode other geometric priors. In the past couple of years TDA has started to take a step out of the academic bubble, to a large extent thanks to powerful Python libraries written as extensions to scikit-learn or PyTorch.
Researchers have hypothesised that a sense of geometry is something that sets the intelligence of humans apart from that of other animals. This intriguing fact motivates why geometric reasoning can be an interesting direction for AI. How can we incorporate geometric concepts into deep learning? We can tap in to the mathematical fields of geometry and topology and see how methods in these fields can be adapted to be used in the context of data analysis and machine learning. This is the aim of Topological Data Analysis. Starting from hierarchical clustering, which many data scientists are familiar with, we gently introduce a method used in TDA, where we look at clustering of a data set at different thresholds and form a topological summary which represents the creation and destruction of clusters (which is an example of a topological feature) at different thresholds. We then look at a few examples where these methods can be useful: - In neuroscience we can use these methods to model neuronal or glia trees, capturing properties of important branching structures and incorporating the invariances that these objects have. - In image segmentation we would like to teach a neural network to take the shape of the segmentation masks into consideration, where some of the classical loss functions can't account for these kind of global properties. - For dimensionality reduction, we can argue that minimising a reconstruction loss is not enough, instead we would like to somehow make sure that the shape of the original dataset and its dimensionality-reduced version are similar.
🎤
Shrinking gigabyte sized scikit-learn models for deployment
Speakers:
👤
Pavel Zwerschke
👤
Yasin Tatar
📅 Wed, 19 Apr 2023 at 10:50
show details
We present an open source library to shrink pickled scikit-learn and lightgbm models. We will provide insights of how pickling ML models work and how to improve the disk representation. With this approach, we can reduce the deployment size of machine learning applications up to 6x.
At QuantCo, we create value from data using machine learning. To that end, we frequently build gigabyte-sized machine learning models. However, deploying and sharing those models can be challenge because of their size. We built and open-sourced a library to aggressively compress tree-based machine learning models: [slim-trees](https://github.com/pavelzw/slim-trees). In this talk, we share our journey and the ideas that went into the making of slim-trees. We delve into the internals of sklearn’s Tree-based models to understand their memory footprint. Afterwards, we explore different techniques that allow us to reduce model size without sacrificing predictive performance. Finally, we present how to include slim-trees in your project and give an outlook on what’s to come.
🎤
Haystack for climate Q/A
Speakers:
👤
Vibha Vikram Rao
📅 Wed, 19 Apr 2023 at 10:50
show details
How can NLP and Haystack help answer sustainability questions and fight climate change? In this talk we walkthrough our experience using Haystack to build Question Answering Models for the climate change and sustainability domain. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!
Haystack is a framework that enables you to build powerful and production-ready pipelines for different search use cases. You can use the State-of-the-Art NLP models in Haystack to provide unique search experiences and allow your users to query in natural language. It is built on a modular fashion so that you can combine the best technology from other open-source projects like Transformers, Elasticsearch etc. We use the Haystack pipeline to build Question Answering systems to answer domain specific question for climate change and sustainability topics. We would like to talk about the challenges we faced, how we do it and how using haystack can in help companies do quicker POCs and eventually take it to production.
🎤
Most of you don't need Spark. Large-scale data management on a budget with Python
Speakers:
👤
Guillem Borrell Nogueras
📅 Wed, 19 Apr 2023 at 11:40
show details
The Python data ecosystem has matured during the last decade and there are less and less reasons to rely only large batch process executed in a Spark cluster, but with every large ecosystem, putting together the key pieces of technology takes some effort. There are now better storage technologies, streaming execution engines, query planners, and low level compute libraries. And modern hardware is way more powerful than what you'd probably expect. In this workshop we will explore some global-warming-reducing techniques to build more efficient data transformation pipelines in Python, and a little bit of Rust.
When one looks at the architecture diagram for the big data ecosystem of most corporations, there's a Spark cluster in the center. Even some of these corporations have adopted Spark as the "de facto" platform for ETL. If you have a Spark cluster, it's fine to use it, but maybe there are other ways to extract, transform, and load large volumes of data more efficiently and with less overhead. Some of the technologies that we'll cover are: * Duckdb. Probably the hottest piece of technology of this decade. * Polars. * Datafusion, and a little bit or Rust. * Microbatching. * Statistical tests. * We'll dive a little into what makes Parquet datasets so great. * Filter pushdown and predicate pushdown. * Overlapping communications and computation. We'll work on a synthetic use case where we'll try to find find out if an online casino is trying to manipulate the roulette boards. To make things harder, we'll use an old and crappy low-power desktop PC with the equivalent computing power of a modern Raspberry PI to crunch around half a terabyte of data.
🎤
Workshop on Privilege and Ethics in Data
Speakers:
👤
Tereza Iofciu
👤
Paula Gonzalez Avalos
📅 Wed, 19 Apr 2023 at 11:40
show details
Data-driven products are becoming more and more ubiquitous. Humans build data-driven products. Humans are intrinsically biased. This bias goes into the data-driven products, confirming and amplifying the original bias. In this tutorial, you will learn how to identify your own -often unperceived- biases and reflect on and discuss the consequences of unchecked biases in Data Products.
Data-driven products are becoming more and more ubiquitous across industries. Data-driven products are built by humans. Humans are intrinsically biased. This bias goes into the data-driven products, which then amplify the original bias. This has the consequence that the power imbalances in a data-driven world tend to get bigger instead of smaller, most of the time unintentionally, and is particularly prevalent in the tech sector where teams are not diverse. One of the obvious solutions is to get diverse teams, but when considering all the intersections of diversity, achieving full diversity is practically an impossible task. Therefore we see education and awareness as foundational steps to working towards a more equitable data world. This tutorial has two parts. In the first exercise, we will start by revisiting our own privileges, as a tool to better educate ourselves in order to identify our individual - often unperceived - biases. In the second part, we will evaluate what happens when these biases happen on a group level and go unchecked into our data products, based on the Data Feminism book and enriched with our own experiences as data professionals. Education about Privilege and Ethics in the data-driven world can only improve how we see and work with data and better understand how our work with data can affect others.
🎤
Prompt Engineering 101: Beginner intro to LangChain, the shovel of our ChatGPT gold rush."
Speakers:
👤
Lev Konstantinovskiy
📅 Wed, 19 Apr 2023 at 11:50
show details
A modern AI start-up is a front-end developer plus a prompt engineer" is a popular joke on Twitter. This talk is about LangChain, a Python open-source tool for prompt engineering. You can use it with completely open-source language models or ChatGPT. I will show you how to create a prompt and get an answer from LLM. As an example application, I will show a demo of an intelligent agent using web search and generating Python code to answer questions about this conference.
There is a gold rush to apply AI to anything nowadays. Anyone can do it, you no longer need to be a Machine Learning Engineer! Just write some prompts for ChatGPT. There is a saying "During a gold rush - sell shovels". This talk is about a wonderful tool, LangChain, as easy to use as a good shovel. This talk is about LangChain, a Python open-source tool for prompt engineering. You can use it with completely open-source language models or ChatGPT. The project started 6 months ago and now has 25k Github stars and raised $10 mln. What is all this about? This talk is a gentle introduction. It will show how to: - create a simple prompt - get an answer from a Large Language Model of your choice - local or API - chain requests together to search the web, use Python REPL - make LLM choose which tools to use for complex questions - answer questions over a collection of long documents As an example application, we will code an AI agent to answer "When is the PyCon DE & PyData Berlin 2023 conference? How many days are between that date and today?" using web search and Python REPL.
🎤
The future of the Jupyter Notebook interface
Speakers:
👤
Jeremy Tuloup
📅 Wed, 19 Apr 2023 at 11:50
show details
Jupyter Notebooks have been a widely popular tool for data science in recent years due to their ability to combine code, text, and visualizations in a single document. Despite its popularity, the core functionality and user experience of the Classic Jupyter Notebook interface has remained largely unchanged over the past years. Lately the Jupyter Notebook project decided to base its next major version 7 on JupyterLab components and extensions, which means many JupyterLab features are also available to Jupyter Notebook users. In this presentation, we will demo the new features coming in Jupyter Notebook version 7 and how they are relevant to existing users of the Classic Notebook.
Jupyter Notebook 7 is based on the JupyterLab codebase, but provides an equivalent user experience to the current (version 6) application. Notebook 7 keeps the document-centric user experience at its core, and brings many new features that were not previously available: - Debugger - Real-time collaboration - Theming and dark mode - Internationalization - Improved Web Content Accessibility Guidelines (WCAG) compliance - Support for many JupyterLab extensions, including Jupyter LSP (Language Server Protocol) for enhanced code completions - Performance improvements This talk will be about demoing the new features coming to Notebook 7, and how uses of the Classic Notebook interface should approach. We will also cover other aspects mentioned in the related Jupyter Enhancement Proposal, such as support for popular extensions and future developments: https://jupyter.org/enhancement-proposals/79-notebook-v7/notebook-v7.html
🎤
Modern typed python: dive into a mature ecosystem from web dev to machine learning
Speakers:
👤
samsja
📅 Wed, 19 Apr 2023 at 11:50
show details
Typing is at the center of „modern Python“, and tools (mypy, beartype) and libraries (FastAPI, SQLModel, Pydantic, DocArray) based on it are slowly eating the Python world. This talks explores the benefits of Python type hints, and shows how they are infiltrating the next big domain: Machine Learning
The talk will focus on **modern python** and its extensive usage of **type hints and static type analysis**. There will be a special focus on **DocArray and multi-modal AI applications**. The talk will cover different topics around modern python: - The history of Python and type hint. How Python came from being not static typed language to having static-type analysis? - The state of the modern python ecosystem in 2023: - Powerful development tools like mypy and beartype. Parallel with TypeScript - Powerful libraries that leverage type hint: Pydantic, FastAPI, SQLModel, Typer, DocArray - Deep dive on DocArray and the future of AI-based web app: - Why modern python is a key to speeding up the development of multi-modal AI applications (stable diffusion, neural search …)? - What is DocArray and how does it extend pydantic with multi-modal AI in mind?
🎤
Grokking Anchors: Uncovering What a Machine-Learning Model Relies On
Speakers:
👤
KIlian Kluge
📅 Wed, 19 Apr 2023 at 11:50
show details
Assessing the robustness of models is an essential step in developing machine-learning systems. To determine if a model is sound, it often helps to know which and how many input features its output hinges on. This talk introduces the fundamentals of “anchor” explanations that aim to provide that information.
Many data scientists are familiar with algorithms like Integrated Gradients, SHAP, or LIME that determine the importance of input features. But that’s not always the information we need to determine whether a model’s output is sound. Is there a specific feature value that will make or break the decision? Does the outcome solely depend on artifacts in an image? These questions require a different explanation method. First introduced in 2018, “anchors” are a model-agnostic method to uncover what parts of the input a machine-learning model's output hinges on. Their computation is based on a search-based approach that can be applied to different modalities such as image, text, and tabular data. In this talk, to truly grok the concept of anchor explanations, we will implement a basic anchor algorithm from scratch. Starting with nothing but a text document and a machine learning model, we will create a sampling, encoding, and search component and finally compute an anchor. No knowledge of machine learning is required to follow this talk. Aside from familiarity with the basics of `numpy` arrays, all you need is your curiosity.
🎤
What are you yield from?
Speakers:
👤
Maxim Danilov
📅 Wed, 19 Apr 2023 at 11:50
show details
Many developers avoid using generators. For example, many well-known python libraries use lists instead of generators. The generators themselves are slower than normal list loops, but their use in code greatly increases the speed of the application. Let’s discover why.
Many developers, avoid to use the generators in regular python code: It is hard to debug, it is not easy to profile, it is not obviously to refactor. it requires to use special algorithms. In this talk i speak about generator pipelines, one-line-generators, builtin-generators, custom generators with yield and yield from. I will show how to use generators and why we should use them. Also, we learn about situations where we can’t use generators and how to change our thinking to avoid such situations in the future. I give some hints and examples - how big python frameworks use lists instead of generators and therefore lose performance. At the end we can see how builtin zip function works in other world, where developers always use generators in own code. Let see what we can yield from this talk…
🎤
Maximizing Efficiency and Scalability in Open-Source MLOps: A Step-by-Step Approach
Speakers:
👤
Paul Elvers
📅 Wed, 19 Apr 2023 at 12:25
show details
This talk presents a novel approach to MLOps that combines the benefits of open-source technologies with the power and cost-effectiveness of cloud computing platforms. By using tools such as Terraform, MLflow, and Feast, we demonstrate how to build a scalable and maintainable ML system on the cloud that is accessible to ML Engineers and Data Scientists. Our approach leverages cloud managed services for the entire ML lifecycle, reducing the complexity and overhead of maintenance and eliminating the vendor lock-in and additional costs associated with managed MLOps SaaS services. This innovative approach to MLOps allows organizations to take full advantage of the potential of machine learning while minimizing cost and complexity.
Building a machine learning (ML) system on a cloud platform can be a challenging and time-consuming task, especially when it comes to selecting the right tools and technologies. In this talk, we will present a comprehensive solution for building scalable and maintainable ML systems on the cloud using open source technologies like MLFlow, Feast, and Terraform. MLFlow is a powerful open source platform that simplifies the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. It allows you to track and compare different runs of your ML models and deploy them to various environments, such as production or staging, with ease. Feast is an innovative open source feature store that enables you to store and serve features for training, serving, and evaluating ML models. It integrates seamlessly with MLFlow, enabling you to track feature versions and dependencies, and deploy feature sets to different environments. Terraform is a widely-used open source infrastructure as code (IaC) tool that enables you to define and manage your cloud resources in a declarative manner. It allows you to automate the provisioning and management of your ML infrastructure, such as compute clusters, databases, and message brokers, saving you time and effort. In this talk, we will demonstrate how these open source technologies can be used together to build an ML system on the cloud and discuss the benefits and trade-offs of using them. We will also share best practices and lessons learned from our own experiences building ML systems on the cloud, providing valuable insights and guidance for attendees looking to do the same.
🎤
How to connect your application to the world (and avoid sleepless nights)
Speakers:
👤
Luis Fernando Alvarez
📅 Wed, 19 Apr 2023 at 12:25
show details
Let’s say you are the ruler of a remote island. For it to succeed and thrive you can’t expect it to be isolated from the world. You need to establish trade routes, offer your products to other islands, and import items from them. Doing this will certainly make your economy grow! We’re not going to talk about land masses or commerce, however, you should think of your application as an island that needs to connect to other applications to succeed. Unfortunately, the sea is treacherous and is not always very consistent, similar to the networks you use to connect your application to the world. We will explore some techniques and libraries in the Python ecosystem used to make your life easier while dealing with external services. From asynchronicity, caching, testing, and building abstractions on top of the APIs you consume, you will definitely learn some strategies to build your connected application gracefully, and avoid those pesky 2 AM errors that keep you awake.
This talk will explore best practices for distributed programming in Python, and how to solve some of the more common issues when dealing with external systems. We will be exploring a few techniques that can help your system be reliable and available, even if your external services aren't. Outline: Agenda: - Introduction - 2 min - The problems around distributed computing - 3 min - Caching - 5 min - Asynchronous task queuing - 5 min - Building API abstractions - 5 min - Testing - 5 min - Closing statements - 5 min
🎤
Dynamic pricing at Flix
Speakers:
👤
Amit Verma
📅 Wed, 19 Apr 2023 at 12:25
show details
In the talk we give a brief overview of how we use Dynamic Pricing to tune the prices for rides based on demand, time of purchase, unexpected events strike etc., and other criteria to fulfil our business requirements.
Dynamic pricing is more challenging in Flixbus compared to other travel companies as we do not discriminate prices based on various categories such as business, economy classes, which are often used in trains and airlines. In the talk, we describe the challenges faced and discuss how we designed innovative solutions to solve these challenges. The main topic I want to present is how we implemented a real time pipeline to calculate the prices based on current demand. At the same time, how it’s so reactive to changes for example, booking, route changes, etc. I will also present some of the efficient data structures we use to apply the changes very fast and efficient.
🎤
Streamlit meets WebAssembly - stlite
Speakers:
👤
Yuichiro Tachibana
📅 Wed, 19 Apr 2023 at 12:25
show details
Streamlit, a pure-Python data app framework, has been ported to Wasm as "stlite". See its power and convenience with many live examples and explore its internals from a technical perspective. You will learn to quickly create interactive in-browser apps using only Python.
Streamlit lets you create interactive web apps with Python, and its WebAssembly-port "stlite" extends its power to in-browser apps. "stlite" offers offline capability, data privacy, scalability, and multi-platform portability while keeping Streamlit’s original features, such as Python productivity and its rich ecosystem. In this talk, after a short intro of Streamlit, we will review stlite in the context of the recent emergence of various Wasm-based Python frameworks such as PyScript, and show you what's possible with stlite. We will also see its internals from a technical aspect which may inspire you with ideas about how to make use of Pyodide and how to transform Python frameworks for the Pyodide/Wasm runtime.
🎤
Code Cleanup: A Data Scientist's Guide to Sparkling Code
Speakers:
👤
Corrie Bartelheimer
📅 Wed, 19 Apr 2023 at 12:25
show details
Does your production code look like it’s been copied from Untitled12.ipynb? Are your engineers complaining about the code but you can’t find the time to work on improving the code base? This talk will go through some of the basics of clean coding and how to best implement them in a data science team.
Data scientists often have a different background and priorities than software engineers. A lot of the code Data Scientists write never makes it to production, and as a result, the code might not always meet the same standards as production-ready code in a developer team. While it makes sense to have rather lax requirements on code for one-off analyses, this can lead to difficulties in maintaining production code and collaborating on projects with software engineers. Since production code is not (always) the main output of a data science team, it can also be hard to prioritize code quality. In this presentation, we will go over some of the main principles of clean code and talk about practical steps that data science teams can take to improve their code. We will specifically focus on strategies that teams can implement to slowly and steadily improve the existing code base. This talk is aimed at data scientists who may not have a strong background in software engineering, but are interested in improving code quality and collaborating more effectively with software engineering teams.
🎤
You've got trust issues, we've got solutions: Differential Privacy
Speakers:
👤
Vikram Waradpande
👤
Sarthika Dhawan
📅 Wed, 19 Apr 2023 at 14:00
show details
As we are in an era of big data where large groups of information are assimilated and analyzed, for insights into human behavior, data privacy has become a hot topic. Since there is a lot of private information which once leaked can be misused, all data cannot be released for research. This talk aims to discuss Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, how it is employed to minimize the risks with private data, its applications in various domains, and how Python eases the task of employing it in our models with PyDP.
Since there is a lot of private information which once leaked can be misused, how should privacy be protected? One might think that simply making personally identifiable fields in the dataset anonymous might be useful, but this can lead to the entire dataset becoming useless and not fit for analysis. And research has proven that by statistically studying both the datasets, private information can easily be re-extracted! The session will start with a brief on the current standards of privacy, and the possible risks of handling customer data. This will lay the foundation for introducing Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, by manipulating data in such a way as to not render it useless for data analysis. Developers will gain an insight into the concept of Differential Privacy, how it is employed to minimize the risks associated with private data, its practical applications in various domains, and how Python eases the task of employing it in our models with PyDP. As the talk progresses, a walkthrough of a real-life practical example, along with a nifty visualization will acquaint the audience with PyDP, and how differential private results come out to be in approximation to what unfiltered data would have provided.
🎤
Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
Speakers:
👤
Joris Van den Bossche
📅 Wed, 19 Apr 2023 at 14:00
show details
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing, and is becoming the de facto standard for tabular data. This talk will give an overview of the recent developments both in Apache Arrow itself as how it is being adopted in the PyData ecosystem (and beyond) and can improve your day-to-day data analytics workflows.
The Apache Arrow (https://arrow.apache.org/) project specifies a standardized language-independent columnar memory format for tabular data. It enables shared computational libraries, zero-copy shared memory, efficient (inter-process) communication without serialization overhead, etc. Nowadays, Apache Arrow is supported by many programming languages and projects, and is becoming the de facto standard for tabular data. But what does that mean in practice? There is a growing set of tools in the Python bindings, PyArrow, and a growing number of projects that use (Py)Arrow to accelerate data interchange and actual data processing. This talk will give an overview of the recent developments both in Apache Arrow itself as how it is being adopted in the PyData ecosystem (and beyond) and can improve your day-to-day data analytics workflows.
🎤
Bringing NLP to Production (an end to end story about some multi-language NLP services)
Speakers:
👤
Larissa Haas
👤
Jonathan Brandt
📅 Wed, 19 Apr 2023 at 14:00
show details
Models in Natural Language Processing are fun to train but can be difficult to deploy. The size of their models, libraries and necessary files can be challenging, especially in a microservice environment. When services should be built as lightweight and slim as possible, large (language) models can lead to a lot of problems. With a recent real-world use case as an example, which runs productively for over a year and in 10 different languages, I will walk you through my experiences with deploying NLP models. What kind of pitfalls, shortcuts, and tricks are possible while bringing an NLP model to production? In this talk, you will learn about different ways and possibilities to deploy NLP services. I will speak briefly about the way leading from data to model and a running service (without going into much detail) before I will focus on the MLOps part in the end. I will take you with me on my past journey of struggles and successes so that you don’t need to take these detours by yourselves.
Models in Natural Language Processing are fun to train but can be difficult to deploy. The size of their models, libraries and necessary files can be challenging, especially in a microservice environment. When services should be built as lightweight and slim as possible, large (language) models can lead to a lot of problems. All the way down from brainstorming the use case, receiving and cleaning the data, training and optimizing the model until service building, deployment, and quality monitoring, lots of important data science related decisions need to be made which in the end will influence the selection of deployment tools and infrastructure. And most often, those architectural decisions are rather long-term so they should be thoughtfully chosen in order to fit into the rest of the architecture. With a recent real-world use case as an example, which runs productively for over a year and in 10 different languages, I will walk you through my experiences with deploying NLP models. What kind of pitfalls, shortcuts, and tricks are possible while bringing an NLP model to production? How can different model types and approaches influence architectural decisions? What are the most important questions to evaluate deployment platforms when there are several options to choose from? In this talk, you will learn about different ways and possibilities to deploy NLP services. I will speak briefly about the way leading from data to model and a running service (without going into much detail) before I will focus on the MLOps part in the end. I will take you with me on my past journey of struggles and successes so that you don’t need to take these detours by yourselves. To follow this talk, you will need to know the basic concepts of deployment and MLOps, but no deeper knowledge of python or Natural Language Processing. My goal is to enable you to ask important questions about deployment and going into production right at the beginning of every NLP project. I want you to be aware of problems that might occur so that working on NLP projects will be fun and not be overshadowed by deployment issues.
🎤
Behind the Scenes of tox: The Journey of Rewriting a Python Tool with more than 10 Million Monthly Downloads
Speakers:
👤
Jürgen Gmach
📅 Wed, 19 Apr 2023 at 14:00
show details
tox is a widely-used tool for automating testing in Python. In this talk, we will go behind the scenes of the creation of tox 4, the latest version of the tool. We will discuss the motivations for the rewrite, the challenges and lessons learned during the development process. We will have a look at the new features and improvements introduced in tox 4. But most importantly, you will get to know the maintainers.
Do you recall what developer legend Joel Spolsky called the "single worst strategic mistake" in "Things You Should Never Do"? Rewriting software from scratch. That is what we did. For the tox test automation tool. A tool, downloaded more than 10 million times a month, both heavily used in the open source community, and in multi-billion dollar companies alike. I invite you to join me on the very personal journey of rewriting tox, a journey, which already started in 2019. We will have a look at the initial motivations for the rewrite, the design decisions, the challenges, and the lessons learned. We will reconstruct why it took more than three years, from the idea to the release, and why this was a good thing. I will explain what we did to take care that the release would cause the least amount of issues, and why we still received dozens and dozens of bug reports about regressions the days after the release. And finally, I will answer the question. Was it worth it?
🎤
Machine Learning Lifecycle for NLP Classification in E-Commerce
Speakers:
👤
Gunar Maiwald
👤
Tobias Senst
📅 Wed, 19 Apr 2023 at 14:00
show details
Running machine learning models in a production environment brings its own challenges. In this talk we would like to present our solution of a machine learning lifecycle for the text-based cataloging classification system from idealo.de. We will share lessons learned and talk about our experiences during the lifecycle migration from a hosted cluster to a cloud solution within the last 3 years. In addition, we will outline how we embedded our ML components as part of the overall idealo.de processing architecture.
idealo.de offers a price comparison service for millions of products from a wide variety of categories. The automated classification of the offers is carried out using both traditional and deep learning-based approaches. Our machine learning components are part of a fully automated life cycle and process up to 500 million offers daily at peak times. In addition to the enormous amount of data that we process, we particularly face the challenges of being online 24/7 while adapting to an ever-changing catalog structure. This requires a high level of reliability from our inference service and continuous automated retraining and model deployment. In this talk we would like to share and present our view on MLOps: - How we integrate our CI/CD and continuous training pipelines with Github and AWS Sagemaker - How we migrate the lifecycle from a hosted cluster (running Kubernetes, Argo Workflows and ArgoCD) to the cloud (running AWS Sagemaker and Datalake). - How we monitor our models as well as data and performance indicators up to date and alert in case of disruptions - How we embed the classifiers in an event-driven heterogeneous software architecture (based on Kotlin and Python). And share lessons learned on: - How we keep reliability high while deploying, updating, and scaling our classification inference services - How we meet a valid compromise between performance and cost requirements.
🎤
The Battle of Giants: Causality vs NLP => From Theory to Practice
Speakers:
👤
Aleksander Molak
📅 Wed, 19 Apr 2023 at 14:10
show details
With an average of 3.2 new papers published on Arxiv every day in 2022, causal inference has exploded in popularity, attracting large amount of talent and interest from top researchers and institutions including industry giants like Amazon or Microsoft. Text data, with its high complexity, posits an exciting challenge for causal inference community. In the workshop, we'll review the latest advances in the field of Causal NLP and implement a causal Transformer model to demonstrate how to translate these developments into a practical solution that can bring real business value. All in Python!
Join us for a workshop exploring the exciting field of causal inference and its applications in natural language processing (NLP). The workshop is addressed to people who want to enrich their NLP and/or Causal Inference toolkits and enhance their perspective on contemporary machine learning. The workshop will start with an overview of modern causality frameworks. We’ll discuss the most prominent ideas in Causal NLP and present an overview of Causal NLP tasks. Finally, we’ll implement CausalBERT model and demonstrate how it can be used to estimate causal effects in practical contexts. The workshop is open to everyone, yet to fully enjoy the content, it’s recommend that you: • Have a solid understanding of Python fundamentals (lists, dicts, scientific stack) • Understand the basics of graph theory (nodes, directed and undirected edges) • Have a good understanding of deep learning basics • Have a good understanding of NLP concepts like tokens and embeddings The goal of this talk is to give you practical understanding of how to implement Causal NLP methods and inspire you to explore the fast growing world of causality.
🎤
Contributing to an open-source content library for NLP
Speakers:
👤
Leonard Püttmann
📅 Wed, 19 Apr 2023 at 14:10
show details
Bricks is an open-source content library for natural language processing, which provides the building blocks to quickly and easily enrich, transform or analyze text data for machine learning projects. For many Pythonistas, contributing to an open-source project seems scary and intimidating. In this tutorial, we offer a hands-on experience in which programmers and data scientists learn how to code their own building blocks and share their creations with the community with ease.
We will prepare some easy-to-use cases so that attendees with novice machine learning and NLP skills can participate in the session. A basic understanding of Python is required, but everyone who wants to learn more about machine learning, NLP, or open-source contributions is welcome. A brick is a modular piece of software that enriches, transforms, or analyzes text data for natural language processing, a sub-domain of machine learning. What sets a brick apart from a simple code snippet is its suitability for multiple execution environments. A brick module can also be executed in a demo playground, allowing users to try out different inputs to see if the brick meets their needs. In this session, we will begin by outlining some ideas for building a brick. After substantiating our ideas, we will make the code usable in different environments, such as the playground for testing inputs. Since SpaCy is commonly used in many NLP projects, we will also build a variant of the code that takes a SpaCy document as input. Add some documentation, and voila! You now have a brick.
🎤
Introduction to Async programming
Speakers:
👤
Dishant Sethi
📅 Wed, 19 Apr 2023 at 14:35
show details
Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. Post execution, it notifies the main thread about the completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance, enhanced responsiveness, and effective usage of CPU. Asynchronicity seems to be a big reason why Node.js is so popular for server-side programming. Most of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database POST API call. As soon as you ask for any of these resources, your code is waiting around for process completion with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond. In this session, we are going to talk about asynchronous programming in Python. Its benefits and multiple ways to implement it.
How Do We Implement Asynchronicity in Python? 1. Multiple Processes: The most obvious way is to use multiple processes. From the terminal, you can start multiple scripts, and then all the scripts are going to run independently or at the same time. The operating system that's underneath will take care of sharing your CPU resources among all those instances. Alternatively you can use the multiprocessing library which supports spawning processes as shown in the example below. 2. Multiple Threads: The next way to run multiple things at once is to use threads. A thread is a line of execution, pretty much like a process, but you can have multiple threads in the context of one process and they all share access to common resources. But because of this, it's difficult to write a threading code. And again, the operating system is doing all the heavy lifting on sharing the CPU, but the global interpreter lock (GIL) allows only one thread to run Python code at a given time even when you have multiple threads running code. So, In CPython, the GIL prevents multi-core concurrency. Basically, you’re running in a single core even though you may have two or four or more. 3. Coroutines using yield: Coroutines are generalizations of subroutines. They are used for cooperative multitasking where a process voluntarily yield (gives away) control periodically or when idle in order to enable multiple applications to be run simultaneously. 4. Asynchronous Programming: The fourth way is asynchronous programming, where the OS is not participating is asyncio. Asyncio is the new concurrency module introduced in Python 3.4. It is designed to use coroutines and futures to simplify asynchronous code and make it almost as readable as synchronous code as there are no callbacks. 5. Using Redis and Redis Queue(RQ): Using asyncio and aiohttp may not always be in an option especially if you are using older versions of python. Also, there will be scenarios when you would want to distribute your tasks across different servers. In that case, we can leverage RQ (Redis Queue). It is a simple Python library for queueing jobs and processing them in the background with workers. It is backed by Redis - a key/value data store. A practical definition of Async is that it's a style of concurrent programming in which tasks release the CPU during waiting periods, so that other tasks can use it. In Python, there are several ways to achieve concurrency, based on our requirement, code flow, data manipulation, architecture design, and use cases we can select any of these methods.
🎤
The Beauty of Zarr
Speakers:
👤
Sanket Verma
📅 Wed, 19 Apr 2023 at 14:35
show details
In this talk, I’d be talking about [Zarr](https://zarr.dev/), an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open [technical specification](https://zarr.readthedocs.io/en/stable/spec/v2.html), making implementations across several languages possible. I’d mainly talk about [Zarr’s Python](https://github.com/zarr-developers/zarr-python) implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.
[Zarr](https://zarr.dev/) is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source [technical specification](https://zarr.readthedocs.io/en/stable/spec/v2.html) and has [implementations](https://github.com/zarr-developers/zarr_implementations) in several languages, with [Zarr-Python](https://github.com/zarr-developers/zarr-python) being the most used. Zarr is [NumFOCUS’s sponsored project](https://numfocus.org/sponsored-projects) and is under their umbrella. ### Outline: First, I’d be talking about: ### What’s, Why’s, and How’s of Zarr (15 mins.) - How does Zarr work? - Talking about the motivation and functionality of Zarr - What’s the need for using Zarr? - When, where and why to use it? - Pluggable compressors and file-storage - Talking about several compressors and file-storage systems available in Zarr - Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions - Using inbuilt functions to manage compressed chunks - How is Zarr different when compared to other storage formats? - Talking briefly about technical specification, which allows Zarr to have implementations in several languages - Pros and cons when compared to other storage formats - Zarr community - What is the Zarr community, and how do we do things? Then, I’d be doing a hands-on session, which would cover the following: ### Hands-on (10 mins.) - Creating and using Zarr arrays - Using inbuilt functions to create Zarr arrays and reading and writing data to it - Looking under the hood - Use store functions to explain how your Zarr data is stored - Consolidating metadata - Consolidating the metadata for an entire group into a single object - Writing and reading from Cloud object storage - Using S3/GCS/Azure to create Zarr arrays and write data to it - Showing how Zarr interoperates with the PyData stack - How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask I’d be closing the talk by: ### Conclusion(5 mins.) - Key takeaway - How can you contribute to Zarr? - QnA This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone curious and wants to learn about Zarr and how to use it is most welcome. The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room. Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk. ### After this talk, you’d learn: - Basic use cases for Zarr and how to use it - Understand the basics of data storage in Zarr - Understand the basics of compressors and file-storage systems in Zarr - Take a better and more informed decision on what data format to use for your data
🎤
Cloud Infrastructure From Python Code: How Far Could We Go?
Speakers:
👤
Etzik Bega
👤
Asher Sterkin
📅 Wed, 19 Apr 2023 at 14:35
show details
Discover how Infrastructure From Code (IfC) can revolutionize Cloud DevOps automation by generating cloud deployment templates directly from Python code. Learn how this technology empowers Python developers to easily deploy and operate cost-effective, secure, reliable, and sustainable cloud software. Join us to explore the strategic potential of IfC.
## Audience The talk is a call for action towards the whole Python community to take an active part in unlocking full Python potential as a truly cloud-native programming language by adapting its runtime and compiler to work optimally with cloud resources. ## Why SDK Programming and Infrastructure as Code are not enough anymore? Developing cloud software using cloud SDK combined with deployment automation using Infrastructure as Code templates has some serious limitations. The both SDK and IaC are at realively low level, require special expertise which takes time to acquire, are disconnected from each other and too often prepared by separate enigineering teams. Applying SDK+IaC to multiple test, staging, and production environments can exacerbate complexity and size issues. As a result, there is a need for a more efficient and automated approach to cloud infrastructure management that integrates tightly with application code. ## What is Infrastructure From Code? Infrastructure from Code (IfC) is a newer and more advanced approach than IaC. It interprets mainstream programming language code and automatically generates the specifications needed to configure a cloud environment. Advanced solutions like ServerlessCloud, Ampt, and Nitric have been proposed for the TypeScript ecosystem. This talk will explore the current state of IfC for Python, its potential, and what needs to be done to make Python a truly cloud-native programming language. ## Talk Outline 1. Infrastructure from Python Code (PyIfC) Mission 2. The Challenges of SDK programming combined with Infrastructure as Code (IaC) 3. The PyIfC Approach: How It Works and Its Benefits 4. Sample Code and Demo 5. A Closer Look at PyIfC's Inner Workings 6. Overcoming Deployment Location Optimization and Sustainability Challenges 7. Overview of Existing Solutions Landscape for PyIfC 8. Unleashing the Full Potential of Python ecosystem 9. The Intersection of PyIfC and Domain-Driven Design 10. Advancing PyIfC: What Needs to Be Done 11. Key Takeaways and Next Steps 12. Q&A # Tags Cloud, Deployment, Automation, Serverless, Infrastructure as Code, IaC, Infrastructure From Code, IfC, Python
🎤
Giving and Receiving Great Feedback through PRs
Speakers:
👤
David Andersson
📅 Wed, 19 Apr 2023 at 14:35
show details
Do you struggle with PRs? Have you ever had to change code even though you disagreed with the change just to land the PR? Have you ever given feedback that would have improved the code only to get into a comment war? We'll discuss how to give and receive feedback to extract maximum value from it and avoid all the communication problems that come with PRs.
Do you struggle with PRs? Have you ever had to change code even though you disagreed with the change just to land the PR? Have you ever given feedback that would have improved the code only to get into a comment war? We'll discuss how to give and receive feedback to extract maximum value from it and avoid all the communication problems that come with PRs. We'll start with some thoughts about what PRs are intended to achieve and then first discuss how to give feedback that will be well received and result in improvements to the code followed by how to extract maximum value from feedback you receive without agreeing to suboptimal changes. Finally, we will look at a checklist for giving and receiving feedback you can use as you go through reviews both as an author and reviewer.
🎤
evosax: JAX-Based Evolution Strategies
Speakers:
👤
Robert Langer
📅 Wed, 19 Apr 2023 at 14:35
show details
Tired of having to handle asynchronous processes for neuroevolution? Do you want to leverage massive vectorization and high-throughput accelerators for evolution strategies (ES)? [evosax](https://github.com/RobertTLange/evosax) allows you to leverage JAX, XLA compilation and auto-vectorization/parallelization to scale ES to your favorite accelerators. In this talk we will get to know the core API and how to solve distributed black-box optimization problems with evolution strategies.
The deep learning revolution has greatly been accelerated by the 'hardware lottery': Recent advances in modern hardware accelerators and compilers paved the way for large-scale batch gradient optimization. Evolutionary optimization, on the other hand, has mainly relied on CPU-parallelism, e.g. using Dask scheduling and distributed multi-host infrastructure. Here we argue that also modern evolutionary computation can significantly benefit from the massive computational throughput provided by GPUs and TPUs. In order to better harness these resources and to enable the next generation of black-box optimization algorithms, we release [evosax](https://github.com/RobertTLange/evosax): A JAX-based library of evolution strategies which allows researchers to leverage powerful function transformations such as just-in-time compilation, automatic vectorization and hardware parallelization. [evosax](https://github.com/RobertTLange/evosax) implements 30 evolutionary optimization algorithms including finite-difference-based, estimation-of-distribution evolution strategies and various genetic algorithms. Every single algorithm can directly be executed on hardware accelerators and automatically vectorized or parallelized across devices using a single line of code. It is designed in a modular fashion and allows for flexible usage via a simple ask-evaluate-tell API. We thereby hope to facilitate a new wave of scalable evolutionary optimization algorithms.
🎤
Postmodern Architecture: The Python Powered Modern Data Stack
Speakers:
👤
John Sandall
📅 Wed, 19 Apr 2023 at 15:10
show details
The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams. Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this..."). This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack. In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.
This light-hearted talk will aim to introduce the audience to the theory and terminology of data pipelines and architectures past, present and future. The "Modern Data Stack" set of interoperable tools introduced a shift in how organisations can rapidly construct a data architecture that can combine multiple data sources into a single unified data warehouse with clean analytics-ready tables for plugging BI tools, self-serve analytics dashboards, and ML models into. Until recently, the complexity of data transformation and modelling was limited to what can be done with SQL, leaving the rich ecosystem of Python tooling for complex transformations, geospatial analytics, time series modelling, data validation tools and clean tested CI-enabled codebases mostly uninvited to the Modern Data Stack party. A recent trend has been a number of tools that launched Python integrations in 2022 (most notably by dbt), opening up a world of productivity and fast scalable data processing for the PyData-savvy Pythonista. Another recent trend is an explosion of jargon, with analytics engineers getting into heated debates around whether data observability or metadata-capture should be prioritised within a data mesh architecture. These are all important concepts, especially for organisations operating at a scale where reliable data governance is mission-critical. Not all organisations are operating at that scale, and every organisation large or small is own its own data maturity journey. My goal with this talk is to bring these concepts together, introduce attendees to these recent trends, and provide a framework they can take back into their organisations for accelerating their own data maturity journey using the latest tooling & best practices.
🎤
Fear the mutants. Love the mutants.
Speakers:
👤
Max Kahan
📅 Wed, 19 Apr 2023 at 15:10
show details
Developers often use code coverage as a target, which makes it a bad measure of test quality. Mutation testing changes the game: create mutant versions of your code that break your tests, and you'll quickly start to write better tests! Come and learn to use it as part of your CI/CD process. I promise, you'll never look at penguins the same way again!
Code coverage (the percentage of your code tested by your tests) is a great metric. However, coverage doesn’t tell you how good your tests are at picking up changes to your codebase - if your tests aren’t well-designed, changes can pass your unit tests but break production. Mutation testing is a great (and massively underrated) way to quantify how much you can trust your tests. Mutation tests work by changing your code in subtle ways, then applying your unit tests to these new, "mutant" versions of your code. If your tests fail, great! If they pass… that’s a change that might cause a bug in production. In this talk, I’ll show you how to get started with mutation testing and how to integrate it into your CI/CD pipeline. After the session, you’ll be ready to use mutation testing with wild abandon. Soon, catching mutant code will be a routine part of your release engineering process, and you’ll never look at penguins the same way again!
🎤
Rethinking codes of conduct
Speakers:
👤
Tereza Iofciu
📅 Wed, 19 Apr 2023 at 15:10
show details
Did you know that the Python Software Foundation Code of Conduct is turning 10 years old in 2023? It was voted in as they felt they were “unbalanced and not seeing the true spectrum of the greater community”. Why is that a big thing? Come to my talk and find out!
Did you know that the Python Software Foundation Code of Conduct is turning 10 years old in 2023? It was voted in as they felt they were “unbalanced and not seeing the true spectrum of the greater community”. And thought that with time they could “advance towards a more diverse representation.”[1] Why is that a big thing? Codes of conduct are an important part of any community, outlining the values of the community. They establish clear guidelines for acceptable behavior and help to create a safe and inclusive environment in the community. This can prevent discrimination and promote equal opportunities for all members. In this talk, we will explore the role of code of conduct in communities, it’s history in the PSF, and discuss strategies for rethinking what it means to have and enforce a Code of Conduct. We will look at the existing challenges that the Python communities face when implementing codes of conduct and talk about possible solutions. How does it look like when it works well and when it doesn’t? As an essential part of any open source project, reflecting on these guidelines can help to ensure that the projects are successful and sustainable in the long term. Thinking back to Python, which also turns 20 in 2023: “Python got to where it is by being open, and it’ll continue to prosper by remaining open”. It’s important we continue this mission, after all, one of the things many people love about Python is the community. [1] https://pyfound.blogspot.com/2013/06/announcing-code-of-conduct-for-use-by.html
🎤
How to increase diversity in open source communities
Speakers:
👤
Maren Westermann
📅 Wed, 19 Apr 2023 at 15:10
show details
Today state of the art technology and scientific research strongly depend on open source libraries. The demographic of the contributors to these libraries is predominantly white and male [1][2][3][4]. This situation creates problems not only for individual contributors outside of this demographic but also for open source projects such as loss of career opportunities and less robust technologies, respectively [1][7]. In recent years there have been a number of various recommendations and initiatives to increase the participation in open source projects of groups who are underrepresented in this domain [1][3][5][6]. While these efforts are valuable and much needed, contributor diversity remains a challenge in open source communities [2][3][7]. This talk highlights the underlying problems and explores how we can overcome them.
In this talk we’ll first examine the problems encountered by people belonging to marginalised groups in open source as well as by project maintainers with respect to contributing to and increasing the diversity of open source projects, respectively [1][2][3][4][5][6]. Building on this overview, we’ll go over what kind of actions have been taken to increase diversity in open source projects, with special focus on scientific libraries, and the effects they have had [1][6][7]. Lastly, we’ll look at ideas that are currently being tested and next steps. By the end of this talk, the audience will have a good understanding of why contributor diversity is low in open source, the efforts that have been made so far to address this problem, and what can further be done to increase the presence of underrepresented groups in technology in general, and in open source in particular. References: [1] https://www.wired.com/2017/06/diversity-open-source-even-worse-tech-overall [2] https://arxiv.org/pdf/1706.02777.pdf [3] https://ieeexplore.ieee.org/abstract/document/8870179 [4] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9354402 [5] https://biancatrink.github.io/files/papers/JISA2021.pdf [6] https://arxiv.org/pdf/2105.08777.pdf [7] https://blog.scikit-learn.org/events/sprints-value/
🎤
Great Security Is One Question Away
Speakers:
👤
Wiktoria Dalach
📅 Wed, 19 Apr 2023 at 15:10
show details
After a decade of writing code, I joined the application security team. During the transition process, I discovered that there are many myths about security, and how difficult it is. Often devs choose to ignore it because they think that writing more secure code would take them ages. It is not true. Security doesn’t have to be scary. From my talk, you will learn the most useful piece from the Application Security theory. It will be practical and not boring at all.
There are so many myths about security, and how difficult it is. Often devs choose to ignore it because they think that writing more secure code would take them ages. It is not true. Security doesn’t have to be scary. In my talk, I share 5 tips that can almost immediately make a product more secure. After a decade of writing code, I joined the application security team. During the transition process, I discovered that there are a few pieces of security theory that would have made my life as a developer much more painless if I had known them before. - Always validate the input - Do not commit credentials into your repository - Use scanners to find vulnerabilities - Learn CIA triad - Confidentiality, Integrity and Availability can be a useful framework to develop a security mindset. This is a simple yet powerful piece of theory. It can be a base of threat modeling of a whole project but can also work on a level of a single user story. - When in doubt, ask your security team for help