scikit-learn and fairness, tools and challenges

EuroSciPy 2022

Fairness, accountability, and transparency in machine learning have become a major part of the ML discourse. Since these issues have attracted attention from the public, and certain legislation are being put in place regulating the usage of machine learning in certain domains, the industry has been catching up with the topic and a few groups have been developing toolboxes to allow practitioners incorporate fairness constraints into their pipelines and make their models more transparent and accountable. Some examples are fairlearn, AIF360, LiFT, fairness-indicators (TF), ... This talk explores some of the tools existing in this domain and discusses work being done in scikit-learn to make it easier for practitioners to adopt these tools.

On the machine learning side, scikit-learn has been one of the most commonly used libraries which has been extended by third party libraries such as imbalanced-learn and scikit-lego. However, when it comes to incorporating fairness constraints in a usual scikit-learn pipeline, there are challenges and limitations related to the API, which has made developing a scikit-learn compatible fairness focused package challenging and hampering the adoption of these tools in the industry. In this talk, we start with a common classification pipeline, then we assess fairness/bias of the data/outputs using disparate impact ratio as an example metric, and finally mitigate the unfair outputs and search for hyperparameters which give the best accuracy while satisfying fairness constraints. This workflow will expose the limitations of the API related to passing around feature names and/or sample metadata in a pipeline down to the scorers. We discuss certain workarounds and then talk about the work being done to address these issues and show how the final solution would look like. After this talk, you will be able to follow the related discussions happening in these open source communities and know where to look for them. The code and the presentation will be publicly available on github.

Speakers: Adrin Jalali