Unlocking Information - Creating Synthetic Data for Open Access.

PyCon DE & PyData Berlin 2023

Many good project ideas fail before they even start due to the sensitive personal data required. The good news: a synthetic version of this data does not need protection. Synthetic data copies the actual data's structure and statistical properties without recreating personally identifiable information. The bad news: It is difficult to create synthetic data for open-access use, without recreating the exact copy of actual data. This talk will give hands-on insights into synthetic data creation and challenges along its lifecycle. We will learn how to create and evaluate synthetic data for any use case using the open-source package Synthetic Data Vault. We will find answers to why it takes so long to synthesize the huge amount of data dormant in public administration. The talk addresses owners who want to create access to their private data as well as analysts looking to use synthetic data. After this session, listeners will know which steps to take to generate synthetic data for multi-purpose use and its limitations for real-world analyses.

A vast amount of private data lies dormant in public institutions, hidden from the research community. Synthesizing complex, anonymized data could allow researchers access without disclosing personally identifiable information while keeping information loss minimal. The tools to do this exist, but why is it still difficult to realize synthetic solutions? One challenge is to reach the minimum viable quality to serve as many use cases as possible. Ideally, the synthetic data allows data exploration with equal results as the real data. We will guide you through the challenges of creating synthetic data and shine a light on its lifecycle. We will explore the different levels of quality of generated structured data and discuss their potential. Finally, we will link these issues to the domain of public administration, but the main insights are generally applicable to all kinds of domains. In particular, we will focus on four key questions: 1. How can we create synthetic data from private data? 2. How can synthetic data creation be integrated into institutions that sit on piles of unused highly private data? 3. Can SOTA methods for synthetic data fulfill all needs of the research community? When is access to the actual, private data needed? 4. Which quality measures are adequate for synthetic data? As we address these questions, we'll use the Synthetic Data Vault to create and evaluate synthetic data. After the talk listeners will have understood the concept of synthetic data and will be able to evaluate synthetic data for a plethora of use cases. As a plus, they will also gain a deeper understanding of why open data access is (not yet) solved by synthetic data.

Speakers: Antonia Scherz