An Empirical Study of Modular Bias Mitigators and Ensembles

🔬 Research Summary by Michael Feffer, a Societal Computing PhD student at Carnegie Mellon University.

[Original paper by Michael Feffer, Martin Hirzel, Samuel C. Hoffman, Kiran Kate, Parikshit Ram, Avraham Shinnar]

Overview: To deal with algorithmic bias at a technological level rather than a societal one, researchers have proposed a myriad of different bias mitigation strategies. Unfortunately, these strategies are typically unstable across different data splits, meaning fairness impacts can differ between training and production settings. This paper analyzes whether bias mitigation can be made more stable with ensemble learning, and explores the space of mitigators and ensembles across several learning objectives.

Introduction

Given the pervasiveness and increasing utilization of artificial intelligence (AI) and machine learning (ML) systems in various situations, biased algorithms can have negative impacts on members of certain subgroups. As model-building from data is a garbage-in-garbage-out process, fixing biased data would lead to less biased downstream models. However, this task may be ill-defined (i.e. what constitutes “good data”?) or otherwise beyond the capabilities of researchers and practitioners (i.e. technologists cannot singlehandedly eliminate societal biases).

While these issues still warrant attention, they each require a long-term solution. In the interim, ML researchers have proposed numerous bias mitigation methods. Unfortunately, these methods are unstable across dataset splits, meaning a model can appear unbiased during training time only to yield biased predictions in production (or vice versa). One common method to remedy accuracy instability is ensemble modeling (e.g. by using bagging, boosting, stacking, or voting ensembles), but it has been largely unexplored in the context of algorithmic bias mitigation. This paper and corresponding open-source code empirically evaluate ensembles when used with bias mitigators in terms of accuracy, fairness, and resource consumption. We find that the “best” combination of components and hyperparameters depends on the metric of interest (i.e. no configuration is completely dominant in all ways).

Key Insights

Defining the Search Space

Datasets

We utilize 13 datasets for our experiments. Some of these datasets are found elsewhere in the fairness literature (such as COMPAS, Adult, and credit-g) while others are novel additions to this research area (e.g. Nursery, TAE). We use binary classification with all datasets, and all have fairness implications with respect to correlations between outcomes and subgroup membership. See the paper for the full list of datasets used.

Bias Mitigators

We distinguish between three different types of bias mitigators. With the estimator defined as the classifier to be mitigated, a pre-estimator mitigator is one that modifies the training set to remove bias before training the estimator, a post-estimator mitigator is one that alters the estimator’s predictions to remove bias, and an in-estimator mitigator is a special type of estimator that tries to resist bias in the training set.

In our experiments, we consider 10 mitigators from these three categories.

Ensembles

We consider four different ensemble learning methods that train and use collections of base estimators, estimators that form the building block of an ensemble:

Bagging involves training n identical base estimators on different training data subsets.
Boosting involves training n identical base estimators in sequence such that each subsequent estimator is trained on the previous estimator’s misclassified examples.
Voting involves training n (not necessarily identical) base estimators and using a majority vote on base estimator outputs to determine the overall output.
Stacking is similar to voting except instead of using majority vote, a final estimator uses base estimator outputs as input features to make overall predictions. This final estimator will also include features passed to base estimators if the hyperparameter passthrough is True and omit them if passthrough is False.

Cartesian Product

Key contributions of our work include providing interoperability between scikit-learn modeling and AIF360 bias mitigation as well as evaluating the performance of various configurations on 13 datasets. Helper methods to facilitate this interoperability and download these datasets with default preprocessing for reproducibility purposes (and ideally future adoption) can be found in our open-source Python package, Lale.

Given that we have 13 datasets, 10 mitigators, 4 ensembles, and various hyperparameters for the mitigators and ensembles, a search over this Cartesian product is costly. Moreover, all types of mitigation can happen at the estimator-level (on a base estimator), but pre-estimator and post-estimator mitigation can also happen at the ensemble-level (mitigating inputs to or outputs from an ensemble). To deal with this large space, we run experiments in two steps:

1. Perform a grid search of all bias mitigators on all datasets while varying hyperparameters to determine the “best” pre-estimator mitigator, in-estimator mitigator, and post-estimator mitigator for each dataset, and

2. For each dataset and each type of ensemble with “various hyperparameters”, use the “best” mitigator for each type of mitigation (pre-estimator, in-estimator, and post-estimator) for the given dataset at both the estimator-level and ensemble-level (except in-estimator mitigation, which is only applied at the estimator-level).

See the paper for more details regarding how we define “best” and “various hyperparameters”.

Results

Metrics

For each configuration corresponding to a part of the Cartesian product we explore, we ran 5 trials of 3-fold cross-validation with folds approximately preserving fairness (as per methods described here. For each trial-fold, we measure predictive performance, group fairness, and resource consumption in terms of space required (in megabytes) and time required (in seconds) to fit the configuration to the given data. We aggregate results for a given configuration by finding the mean and standard deviation of each metric across the 15 trial-folds.

To account for differences in baseline fairness and fitting difficulty between different datasets, we use min-max scaling per each metric of interest on the results of fitting each dataset so that performance on one dataset can be compared to performance on every other dataset.

Guidance

We isolated a few overall trends from our exploration. First, we find that using ensembles without mitigation does not improve fairness on average, but it does improve fairness stability. Conversely, using ensembles with mitigation typically decreases predictive performance, but it increases fairness stability and average fairness. Lastly, estimator-level mitigation produces better fairness results than ensemble-level mitigation at the cost of more time and memory used for fitting.

Beyond these trends, what constitutes “best” largely depends on the dataset and metric of interest. To that end, we constructed a guidance diagram to suggest to practitioners which approach(es) they should attempt based on their goals and setup.

Between the lines

To our knowledge, this is the first extensive piece of research that explores combining scikit-learn ensemble models with bias mitigators from AIF360. We also perform these experiments with a number of datasets (13) seldom seen elsewhere in the fairness literature, and we have distilled best practices and released our key components as an open-source package to encourage others to adopt and reproduce our work.

However, there are more than a few aspects of our research that warrant further analysis. For instance, our datasets are relatively small, tabular, and limited to binary classification problems. Existing research (such as this work) has highlighted that the algorithmic fairness research community needs to grapple with bias in problems beyond tabular binary classification, such as image recognition or text translation. Additionally, our guidance diagram would benefit from more interactivity and generalizability. If we could develop more responsive guidance, it might be of greater benefit to practitioners.