This tutorial will guide towards good evaluation of machine-learning models, choosing metrics and procedures that match the intended usage, with code examples using the latest scikit-learn's features. We will discuss how good metrics should characterize all aspects of error, e.g. on the positive and negative class; the probability of a detection, or the probability of a true event given a detection; as they may need to catter for class imbalance. Metrics may also evaluate confidence scores, e.g. calibration. Model-evaluation procedures should gauge not only the expected generalization performance, but also its variations.
        
      
      
      
      
        
        Model evaluation is a crucial aspect of machine-learning, to choose the best model, or to decide if a given model is good-enough for production. This tutorial will give didactic introductions to the various statistical aspects of model evaluation: what aspects of model prediction are important to capture, and how different metrics available in scikit-learn captures them. How to devise a model-evaluation procedure that is best suited to select the best model, or control that a model is suited for usage. This tutorial goes beyond mere application of scikit-learn and we expect even experts to learn useful considerations.
The tutorial will be loosely based on the following preprint https://hal.archives-ouvertes.fr/hal-03682454, but with code examples for each important concept. A tentative outline is as follows:
### Performance metrics
#### Metrics for classification
- Binary classification
    - Confusion matrix
    - Simple summaries and their pitfalls
    - Probability of detection given true class, or vice versa?
    - Summary metrics for low prevalence
    - Metrics for shifts in prevalence
    - Multi-threshold metrics
    - Confidence scores and calibration
- Multi-class classification
    - Adapting binary metrics to multi-class settings
    - Multilabel classification
#### Metrics for regression
- R2 score
- Absolute error measures
- Assessing the distribution of errors
### Evaluation strategies
#### Evaluating a learning procedure
- Cross-validation strategies
- Driving model choices: nested cross-validation
- Statistical testing
    - Sources of variance
    - Accounting for benchmarking variance
#### Evaluating generalization to an external population
- The notion of external validity
- Confidence intervals for external validation