Machine learning with missing values

EuroSciPy 2022

This talk will cover how to build predictive models that handle well missing values, using scikit-learn. It will give on the one side the statistical considerations, both the classic statistical missing-values theory and the recent development in machine learning, and on the other side how to efficiently code solutions.

In many data-science applications, the data may come with missing values. There is a rich statistical literature on performing analysis with missing values. However, machine learning brings new tradeoffs: how to deal with missing-values at test time? Should we really care about recovering the model suitable for fully-observed data? I will cover both the classic theory and recent theoretical advances. I will show how scikit-learn can be used to implement various solutions, and how these illustrate the theory. Tentative outline: - The classic statistical view on missing values - Missing at Random Settings: why it is important - Imputation, and corresponding scikit-learn tools - Prediction for missing values - Simple predictors need very good predictors - Rich predictors work with simple imputers, even outside Missing at Random settings

Speakers: Gaël Varoquaux