🔬 Research Summary by Bhavya Ghai, a PhD Candidate at the Computer Science Department, Stony Brook University working on the identification and mitigation of social biases in ML systems.
[Original paper by Bhavya Ghai, Mihir Mishra, Klaus Mueller]
Overview: Recent years have seen a huge surge in fairness enhancing interventions that focus on mitigating social biases at different stages of the ML pipeline instead of the entire pipeline. In this work, we undertake an extensive empirical study to investigate if fairness across the ML pipeline can be enhanced by applying multiple interventions at different stages of the ML pipeline and what might be its possible fallouts.
Introduction
Algorithmic bias can virtually emerge from any single or multiple stage(s) of the machine learning pipeline, right from problem formulation, dataset selection/creation to model formulation, deployment, and so on. The existing literature focuses on curbing algorithmic bias by intervening at a particular stage of the ML pipeline. Hence, algorithmic bias can still flourish via other stages/components of the ML pipeline. An intuitive solution to enhance fairness across the ML pipeline can be to apply multiple fixes (interventions) at different stages of the ML pipeline where bias can emerge from. We refer to such a series of fairness enhancing interventions as cascaded interventions. For example, one might choose to debias the dataset, train a fairness aware classifier over it and then post-process the model’s predictions to achieve more fairness. This approach is inline with the real world where different laws/policies/guidelines try to alleviate social inequality by intervening at multiple stages of life like education, employment, promotion, etc. Examples include Affirmative action in the US and Caste based reservation in India. This begs the question if it were possible to achieve more fairness in the ML world by intervening at multiple different stages of the ML pipeline. In this work, we perform an extensive empirical study to answer the following research questions:
R1. Effect of Cascaded Interventions on Fairness metrics
Does intervening at multiple stages reduce bias even further? If so, does it always hold true? What is the impact on Group fairness metrics and Individual fairness metrics?
R2. Effect of Cascaded Interventions on Utility metrics
How do utility metrics like accuracy and F1 score vary with different numbers of interventions? Existing literature discusses the presence of Fairness Utility tradeoff for individual interventions. Does it hold true for cascaded interventions?
R3. Impact of Cascaded Interventions on Population Groups
How are the privileged and unprivileged groups impacted by cascaded interventions in terms of F1 score, False negative rate, etc.? Are there any negative impacts on either group?
R4. How do different cascaded interventions compare on fairness and utility metrics?
Key Insights
We have used IBM’s AIF 360 open source toolkit to conduct all experiments for this paper. To execute multiple interventions in conjunction, we feed the output of one intervention as input to the next stage of the ML pipeline.
Experimental Setup
We simulated multiple 3 stage ML pipelines that are acted upon by different individual and cascaded interventions. Here, we have considered 9 different interventions where 2 operate at the data stage, 4 operate at the modeling stage and 3 operate at the post modeling stage. Apart from these individual interventions, we also execute different combinations of these interventions in groups of 2 and 3. For example, one might choose to intervene at any 2 stages (say data stage and post-modeling stage) or choose to intervene at all 3 stages of the ML pipeline. To form all possible combinations of these interventions, we cycle through all available options (interventions) for a given ML stage along with a `No Intervention’ option and repeat it for all the 3 stages. In totality, we perform 9 individual interventions, 50 different combinations of interventions and a baseline case (No intervention for all stages) for each of the 4 datasets. Here, we have used datasets such as Adult Income dataset and COMPAS Recidivism dataset that have been used extensively in the fairness literature. We measure the impact of all these interventions on 9 fairness metrics and 2 utility metrics. Here, utility metrics include Accuracy and F1 score that measure the ability of a ML model to learn the underlying patterns from the training dataset. Fairness metrics include individual fairness metrics like consistency and theil index and group fairness metrics like false positive rate difference, statistical parity difference, etc.
Key Findings
- Applying multiple interventions results in better fairness and lower utility than individual interventions on aggregate.
- Adding more interventions do not always result in better fairness or worse utility.
- The likelihood of achieving high performance (F1 Score) along with high fairness increases with larger numbers of interventions.
- Fairness-enhancing interventions can negatively impact different population groups, especially the privileged group. Different interventions result in disproportionate misclassification of the privileged group with the unfavorable outcome and the unprivileged group with the favorable outcome on aggregate.
- This study highlights the need for new fairness metrics that account for the impact on different population groups apart from just the disparity between groups.
- We offer a list of combinations of interventions that perform best for different fairness and utility metrics to aid the design of fair ML pipelines. See the paper for details.
Between the lines
It is important to note that all the insights and analyses presented in this work are based on empirical evidence and so they may or may not generalize to other datasets, interventions or metrics. Moreover, the scope of this work is limited to the binary classification problem for tabular datasets. Future work might conduct similar studies for other data types like text, images, etc., consider other problem types such as regression, clustering, etc. and include more datasets, interventions and fairness metrics. The source code and experimental data has been made publicly available at this github link for easy reproducibility and for anyone to analyze the data in their own different ways. We hope the insights provided by this study will help guide future research and assist ML practitioners design fair ML pipelines.