It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

🔬 Research Summary by Rishi Balakrishnan, a student at UC Berkeley passionate about algorithmic fairness, privacy, and trustworthy AI more broadly.

[Original paper by Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, Suresh Venkatasubramanian]

Overview: Criminal justice (CJ) data is not neutral or objective: it emerges out of a messy process of noisy measurements, individual judgements, and location-dependent context. However, by ignoring the context around risk assessment instrument (RAI) datasets, computer science researchers both risk reinforcing upstream value judgements about what the data should say, and the downstream effects of their models on the justice system. The authors argue that responsibly and meaningfully engaging with this data requires computer scientists to explicitly consider the context of and values within these datasets.

Introduction

The issue of fairness in algorithms was thrust into public spotlight in 2016 with ProPublica publishing an exposé on the Northpointe’s COMPAS tool used to predict recidivism. Propublica’s article claimed that Northpointe’s tool was racially biased in that black defendants consistently received higher risk scores than white defendants, even when controlling for factors such as prior crimes, age, and gender. Since then, the field of algorithmic fairness has boomed, with more and more research devoted to the question of how to achieve fair outcomes with respect to some sensitive attribute (such as race). However, the field of algorithmic fairness and machine learning often looks at datasets such as the COMPAS dataset without considering the surrounding context, a practice that risks misinterpreting and misusing data. In this paper, the authors first overview many of the issues surrounding risk assessment indicator (RAI) datasets such as COMPAS and the disconnect between the algorithmic fairness literature and real-life fairness concerns. They then provide suggestions for CS researchers to responsibly engage with the data they use.

Data biases within RAI datasets

The authors first deep-dive into RAI datasets, providing several situations where bias in the data may arise. They specifically look at pre-trial RAI datasets, whose purpose is to inform pretrial detainment decisions. The legal system usually only commits to pre-trial detainment if the defendant is likely to flee or commit a violent crime, which is often hard to measure. Instead, datasets use failure to appear (FTA) as a proxy even though FTA is hardly equivalent to fleeing from the justice system. Defendants often do not appear at court because of scheduling, work, childcare, etc. meaning that detainment decisions based on predicted probabilities are likely too harsh as well as biased against low-income defendants. Second, the sensitive attribute (in most cases race) – which is integral to existing work on achieving algorithmic fairness – is noisy. For example, the authors cite a study that officer-reported racial designations were inconsistent for every racial group except for “black” [1]. Third, the data used as input is often biased itself – while datasets purport to measure crime, the most they can realistically do is measure arrest rates, which are biased along racial lines due to over-policing of minority communities. Data processing done to create the dataset itself also hides value-laden questions, like whether prior arrests based on crimes that are now decriminalized (such as marijuana possession) should be included in the data in the first place. The covariates, sensitive attribute, and target variable all have significant levels of noise even before a machine learning researcher touches the dataset.

Issues with algorithmic fairness in machine learning

But the problem doesn’t end there. Underlying the creation of many RAI models is the assumption they can be “plugged in” to the relevant part of the CJ pipeline. Reality is much messier. The criminal justice system contains several individual points of discretion, with one of the biggest being judicial. Judges can (and do) choose to ignore RAI recommendations, and the authors cite studies demonstrating judges deviate more from recommendations when the defendant is black than white [2]. Different jurisdictions also interpret risk scores differently: in some districts, a 40% chance of not appearing at court is considered “high-risk”, whereas in others that probability is as low as 10% [3]. Thus, claims about “reforming” or “improving” the criminal justice system which are only based off of fairer model performance should be viewed skeptically. These claims also implicitly justify the current justice system as worthy of reform – which can be a reasonable position – but one that papers hardly ever explicitly defend.

When incorporated into machine learning algorithms, fairness concerns are often just constraints on the underlying objective of reducing crime. This makes sense on face. However, few datasets take look at the long-term behavior of individuals, and the most efficient short-term way to reduce crime is to incarcerate because detained individuals cannot commit crime.

The work on algorithmic fairness is also situated within the larger machine learning community, but there is a mismatch between machine learning practices and the care that CJ data requires. Machine learning papers focus not on gleaning new insights from data but rather on new methods. Machine learning conferences also require quick turn-around with authors often asked to implement their method on a new dataset during a one-week rebuttal period. While suited for pure machine learning tasks, these practices inevitably decontextualize and subordinate the dataset itself. Collaborating with domain experts is much harder in a culture that requires quick turn-around, but what often ensures ethical questions surrounding a dataset go unanswered is the practice of benchmarking. Once a seminal paper uses a dataset, the dataset quickly becomes the norm. Other papers use that dataset as a point of comparison of their method’s effectiveness, and omitting such a dataset from a paper risks rejection from a conference. As a result, flawed assumptions about a dataset quickly become baked into the topic literature.

Where do we go from here?

Machine learning is not the first field to work with RAI datasets. Psychology and criminology, among others, have over time found responsible ways to engage with CJ data. To lay out the path forward, the authors have a couple suggestions. First, they suggest avoiding using CJ datasets as generic real-world examples to test out new algorithmic fairness methods on. This also means avoiding making broad conclusions about the CJ system simply from these datasets. Each dataset emerges from a rich, complex context which shapes any insights gleaned from it.

To make this context more explicit, the authors advocate for writing data sheets and model cards which lay out the context and limitations of both datasets and the models built off of them. This also includes identifying underlying assumptions in algorithmic fairness measures and seeing how models behave when these assumptions are violated. Machine learning already studies similar questions under the banner of robustness, which can help address this. The authors also question the use of standard metrics like accuracy and AUC when working with these datasets, as these metrics neglect disparities in performance between specific individuals and groups. The authors close with a call for future work into making benchmarks more explicit about their ethical assumptions, and ask whether benchmarks can exist at all for CJ data.

Between the lines

This paper situates two broader critiques of machine learning within the domain of CJ data: 1) Datasets are not objective and context-free. Real life is messy, and a set of input features, labels, and sensitive attributes simply cannot capture all this complexity. Thus, choices on what to include in data reduces this complexity in a way that inherently makes value judgements. 2) An obsessive focus on benchmarks and achieving state of the art performance leads to a disconnect from the real world problems that gave rise to datasets in the first place. With a hope that machine learning systems will be deployed in situations like the criminal justice system comes a responsibility to study the real-world effects of these algorithms. Machine learning researchers can no longer afford to be disconnected from the world they influence.

References

[1] Kristian Lum, Chesa Boudin, and Megan Price. The impact of overbooking on a pre-trial risk assessment tool. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pages 482–491, New York, NY, USA, January 2020. Association for Computing Machinery.  

[2] Alex Albright. If you give a judge a risk score: Evidence from Kentucky bail decisions. In  The John M. Olin Center for Law, Economics, and Business Fellows’ Discussion Paper Series 85, 2019.  

[3] Sandra G Mayson. Dangerous defendants. Yale Law Journal, 127:490, 2017