Evaluating your machine learning models: beyond the basics

EuroSciPy 2022

This tutorial will guide towards good evaluation of machine-learning models, choosing metrics and procedures that match the intended usage, with code examples using the latest scikit-learn's features. We will discuss how good metrics should characterize all aspects of error, e.g. on the positive and negative class; the probability of a detection, or the probability of a true event given a detection; as they may need to catter for class imbalance. Metrics may also evaluate confidence scores, e.g. calibration. Model-evaluation procedures should gauge not only the expected generalization performance, but also its variations.

Model evaluation is a crucial aspect of machine-learning, to choose the best model, or to decide if a given model is good-enough for production. This tutorial will give didactic introductions to the various statistical aspects of model evaluation: what aspects of model prediction are important to capture, and how different metrics available in scikit-learn captures them. How to devise a model-evaluation procedure that is best suited to select the best model, or control that a model is suited for usage. This tutorial goes beyond mere application of scikit-learn and we expect even experts to learn useful considerations. The tutorial will be loosely based on the following preprint https://hal.archives-ouvertes.fr/hal-03682454, but with code examples for each important concept. A tentative outline is as follows: ### Performance metrics #### Metrics for classification - Binary classification - Confusion matrix - Simple summaries and their pitfalls - Probability of detection given true class, or vice versa? - Summary metrics for low prevalence - Metrics for shifts in prevalence - Multi-threshold metrics - Confidence scores and calibration - Multi-class classification - Adapting binary metrics to multi-class settings - Multilabel classification #### Metrics for regression - R2 score - Absolute error measures - Assessing the distribution of errors ### Evaluation strategies #### Evaluating a learning procedure - Cross-validation strategies - Driving model choices: nested cross-validation - Statistical testing - Sources of variance - Accounting for benchmarking variance #### Evaluating generalization to an external population - The notion of external validity - Confidence intervals for external validation

Speakers: Gaël Varoquaux Arturo Amor