Multi-language Data Wrangling and Acquisition Conversational Agents

FOSDEM 2022

In this presentation we discuss the Conversational Agent (CA) designs for two closely related problem areas:

Data Acquisition Workflows (DAWs)
Data Transformation Workflows (DTWs)

The CA perspective is taken mostly for exposition and didactic purposes. Nevertheless, we emphasise the practical applicability of the underlying designs and implementations.

Although, operationally data acquisitions are prerequisite for data wrangling we discuss data wrangling first -- the corresponding DTWs designs and implementations are more mature and the related materials are more universal, applicable to multiple programming languages.

Multi-language Data Wrangling and Acquisition Conversational Agents

Anton Antonov
FOSDEM 2022

Abstract

In this presentation we discuss the Conversational Agent (CA) designs for two closely related problem areas:

Data Acquisition Workflows (DAWs)
Data Transformation Workflows (DTWs)

The CA perspective is taken mostly for exposition and didactic purposes. Nevertheless, we emphasise the practical applicability of the underlying designs and implementations.

Outline

Data Wrangling

In the first part of the presentation we show and compare data wrangling examples in different programming languages using different packages.

Here is a list of the programming languages and packages we consider:

Julia-DataFrames
Python-pandas
R
R-tidyverse
WL

We look into the common data wrangling workflows and how we can design a conversational agent that translates natural language commands into data wrangling code for Julia, Python, R, SQL, WL.

WL's external evaluator features are heavily utilized.

Data Acquisition Workflows

In the second part of the presentation we discuss the following facets of a data acquisition system:

Conversational Agent based on a Finite State Machine
Gathering and utilizing metadata taxonomies
The making of datasets recommender systems and search engines
- In/for both R and WL
Making (ingredient) variables queries
Introspection queries
Random data generation specifications
Data obfuscation specifications

Extensions to ML models acquisition workflows

Speakers: Anton Antonov