🔬 Research Summary by Angelina Wang, a PhD student in computer science at Princeton University studying issues of machine learning fairness and algorithmic bias.
[Original paper by Angelina Wang, Vikram V. Ramaswamy, Olga Russakovsky]
Overview: We consider the complexities of incorporating intersectionality into machine learning and how this is much more than simply including more axes of identity. We consider the next steps of deciding which identities to include, handling the increasingly small number of individuals in each group, and performing a fairness evaluation on a large number of groups.
Introduction
Machine learning has typically considered demographic attributes along one axis and as binary. However, gender is not just male and female; race is not just Black and white, and many axes of identity interact to produce unique forms of discrimination. Thus, extending machine learning to intersectionality is not as easy as simply incorporating more axes of identity. In this work, we investigate three considerations that will arise along the machine learning pipeline and offer substantive steps forward:
- Data collection: which identities to include
- Model training: how to handle the increasingly small number of individuals in each group
- Evaluation: how to perform fairness evaluation on a large number of groups
Key Insights
In our work, we focus on the algorithmic effects of discrimination against demographic subgroups (rather than individuals). Specifically, we conduct empirical studies of five fairness algorithms across a suite of five tabular datasets derived from the US Census with target variables like income and travel time to work. We do so under the framework of the canonical machine learning fairness setting: supervised binary classification of a target label of social importance, which balances accuracy and one mathematical notion of fairness among a finite set of discretely defined demographic groups. These groups may result from a conjunction of identities.
Selecting which Identities to Include
As a first step, we must consider which identities are relevant to include in the task. If we include too many axes of identities and numbers of identities along each axis, we risk having a computationally intractable problem. But if we include too few, we miss critical intersectional differences in how different groups are treated. In our work, we find that a priori knowledge on which identities to include may be insufficient and recommend combining domain knowledge and empirical validation. Domain knowledge based on which identities are relevant for a particular downstream task is essential but also insufficient for understanding whether, for example, two groups share similar input distributions for one particular domain and would actually benefit from being empirically treated as the same group rather than separated.
Handling Progressively Smaller Groups
In model training, we need to consider how we can train a model when the various groups are likely to have a minimal number of individuals as a result of having added additional axes of identity. We first provide a warning that normative implications may disqualify specific existing techniques in machine learning – for example, harmful historical parallels connected to generating synthetic facial images can raise concerns. We instead suggest a potential alternative through leveraging structure in the data, such as people in different groups but with shared identities may share more features than those without shared identities.
Evaluating a Large Number of Groups
For model evaluation, we need to consider how fairness for a large number of groups can be evaluated since most existing fairness evaluation metrics are only designed for two groups, and the difference between them is on some kind of performance metric. When there are a more significant number of groups, researchers frequently use different variations of these pairwise comparisons, such as the difference between the group with the highest and lowest true positive rate. However, all of these metrics look the same if you, for example, swap the labels of the subgroup labeled Black female and white male. This is surprising in the context of intersectionality, where the historical ways different groups have been treated are vitally important. Thus, we suggest both using more deliberative, contextual pairwise comparisons, as well as incorporating additional evaluation metrics that are better able to capture existing disparities in the data. For example, we propose a metric that measures the correlation between the ranking of groups for their positive labels in the datasets and the ranking of groups for their true positive rates predicted by the model.
Between the lines
Machine learning fairness has begun to grapple with the complexities of intersectional identities, but there is still a long way to go. It’s essential to consider the different normative implications that technical decisions may have. For example, while it may be easy to think that a choice like extrapolating existing evaluation metrics to work for more than two groups is a sensible thing to do in terms of having the technical details worked out, there are additional normative implications at play when the data is about individuals of different demographic groups. In other words, more than just having an algorithm or metric technically manipulated to accommodate a more significant number of groups, we need to consider that, for example, historical injustices matter during evaluation, and there are ways we can try to encode this into the metric. In this work, we take a step toward doing this thinking by bridging existing social science work on intersectionality into machine learning.