Algorithmic Auditing and Social Justice: Lessons from the History of Audit Studies

🔬 Research Summary by Rishi Balakrishnan, a student at UC Berkeley passionate about algorithmic fairness, privacy, and trustworthy AI more broadly.

[Original paper by Briana Vecchione, Solon Barocas, Karen Levy]

Overview: Audit studies have a long and rich history within the social sciences. In this paper, the authors draw upon that history to see how it can inform algorithmic auditing, a way of examining discrimination in algorithms that is becoming increasingly popular.

Introduction

In 2018, MIT graduate student Joy Buolamwini released a study showing the disparate accuracy of facial recognition on people of color and women: facial recognition had error rates of up to 34% on women of color and only 0.8% on white men. Following the study was a documentary, Coded Bias, in which Joy Buolamwini sits in front of a facial recognition system, unrecognized. Only once she dons a white, featureless mask does the system realize a face is present. It’s a powerful scene, and one that represents a growing trend of algorithmic audits that examine the workings of automated systems. Although new to computer science, audit studies have a long history within the social sciences, and in this paper, the authors draw upon the history and limitations of audit studies to create recommendations for algorithmic auditing.

History of audits in the social sciences

The earliest audits in social science emerged from activist research in the 1940’s and 1950’s – civil rights were not yet statutorily protected, so researchers used audit studies to raise public awareness of discrimination. These audits were a prime example of participatory action research, because they involved and consulted the people affected by biased systems. Communities organized and executed their own studies, with researchers providing scaffolding and support. It wasn’t long until participatory action research started losing traction however, because conducting this research was poorly incentivized for participants and labor intensive for researchers. As a result, focus shifted from participatory action approaches to correspondence studies, in which researchers interact with institutions (like employers) through fictional aliases meant to simulate people of different demographics. A landmark example: the study of hiring discrimination conducted by Bertrand and Mullainathan, in which they found that black job applicants were 50% less likely to receive callbacks than white applicants. They came to this conclusion by sending fake resumes in response to hundreds of job ads, with resumes being identical except for the race of applicant. Studying racial differences boiled down to changing the name on a resume from a white one to a black one, and this study demonstrated many of the appealing characteristics of correspondence studies: scalable, methodologically rigorous, and resource light. No actual job candidates needed to be trained or consulted.

The limitations of audits

However, correspondence studies have several limitations instructive for designing algorithmic audits. Algorithmic audits are varied and broad – some include reverse engineering an algorithm, others study accuracy across different subgroups, and even others just try to find instances where the algorithm outright fails. However, the authors focus on audits that uncover “disparate impact” (i.e. a noticeable difference in outcomes for different demographics) because of their similarity to social science audits. Much of the time, these audits study discrete moments in time, like the Bertrand and Mullainathan study which examined callback rates for an interview. However, the parts of the process not amenable to easy quantification – like the interview itself, or even who sees job ads in the first place – are not considered, and the discrimination embedded in those processes goes unstudied. For algorithmic auditing, the related risk is only studying the fairness of the model outputs with respect to its inputs: this misses the fact that algorithmic systems are often embedded within human processes. A biased judge may disregard an algorithm’s recommendations on pre-trial bail. An unfair hiring manager can ignore candidates suggested by a “fair” hiring algorithm. Saying an algorithm is fair means little if the processes around it are still biased, and such studies often can’t capture how unfairness propagates through the different steps of a process.

The disconnect between audit studies and the affected communities also has concrete consequences. Original audit studies contained great narrative power: their whole focus was to create awareness of discrimination to galvanize change. But an overemphasis on methodological rigor sacrifices telling a story, and with it the ability to communicate with wider audiences. These studies also risk inaccurately estimating discrimination. A resume or a training point can be modified to appear white or black, but in the real world, attributes like race and gender can’t flip values so easily. An individual’s race is not independent of their other characteristics – it’s an embodied experience that affects much about them. Even if these audits explicitly use training data from people with different demographics, that provides no guarantee on the fairness of an algorithm when the distribution of the training data is different from the real-life distribution of users, a common occurrence due to sampling biases in the dataset curation process.

Where to go from here

To suggest a way forward, the researchers point to several examples of promising audit studies which offer an integration between the methodological rigor of correspondence studies and the power of participatory action research. For example, they point to community audits, in which citizens volunteer their own data to researchers to help studies. This is possible under Europe’s GDPR regulations, in which individuals can request to access the data companies have on them and can then submit that to researchers. Another example is the Markup’s Citizen Browser, which helps users understand Facebook’s tracking of users through external websites. Another extension of integrating methodological rigor and participatory action is “algorithmic bug bounties”, in which community members are incentivized to report incidences of unfairness while using a platform. Twitter released an algorithmic bug bounty on their image cropping algorithm, encouraging users to demonstrate cases where the algorithm cropped people of color out of the images. Participatory approaches can call attention to previously unknown situations of discrimination, which researchers can then study very rigorously. The authors also highlight a need to study how systems are used in the real world, not just in ideal conditions. The truth is automated systems are often used with their default settings not their recommended ones. Although Amazon may recommend a 99% confidence threshold on its facial recognition, an ACLU study pointed out that with Amazon’s default 80% confidence threshold that police stations likely use, its facial recognition system incorrectly matched several members of Congress to mugshots.

Between the lines

This paper is one of many recent papers advocating for increased engagement of fairness researchers with the communities they aim to protect. It’s important to remember the origin of audit studies: not merely to uncover discrimination but rather to remove it. A goal to simply statistically quantify some metric of bias rings quite hollow compared to this original aim. As algorithmic decision making becomes increasingly prevalent, audit studies must also adopt a more important role. Their ability to combine narrative power (like Joy Buolamwini putting on a white mask) and methodological rigor (like assessing facial recognition’s accuracy on people of color) is necessary to ensure that the public realizes that abstract, seemingly objective systems like machine learning can have concrete, discriminatory impacts on real people.