This tutorial will guide towards good evaluation of machine-learning models, choosing metrics and procedures that match the intended usage, with code examples using the latest scikit-learn's features. We will discuss how good metrics should characterize all aspects of error, e.g. on the positive and negative class; the probability of a detection, or the probability of a true event given a detection; as they may need to catter for class imbalance. Metrics may also evaluate confidence scores, e.g. calibration. Model-evaluation procedures should gauge not only the expected generalization performance, but also its variations.
Model evaluation is a crucial aspect of machine-learning, to choose the best model, or to decide if a given model is good-enough for production. This tutorial will give didactic introductions to the various statistical aspects of model evaluation: what aspects of model prediction are important to capture, and how different metrics available in scikit-learn captures them. How to devise a model-evaluation procedure that is best suited to select the best model, or control that a model is suited for usage. This tutorial goes beyond mere application of scikit-learn and we expect even experts to learn useful considerations.
The tutorial will be loosely based on the following preprint https://hal.archives-ouvertes.fr/hal-03682454, but with code examples for each important concept. A tentative outline is as follows:
### Performance metrics
#### Metrics for classification
- Binary classification
- Confusion matrix
- Simple summaries and their pitfalls
- Probability of detection given true class, or vice versa?
- Summary metrics for low prevalence
- Metrics for shifts in prevalence
- Multi-threshold metrics
- Confidence scores and calibration
- Multi-class classification
- Adapting binary metrics to multi-class settings
- Multilabel classification
#### Metrics for regression
- R2 score
- Absolute error measures
- Assessing the distribution of errors
### Evaluation strategies
#### Evaluating a learning procedure
- Cross-validation strategies
- Driving model choices: nested cross-validation
- Statistical testing
- Sources of variance
- Accounting for benchmarking variance
#### Evaluating generalization to an external population
- The notion of external validity
- Confidence intervals for external validation