A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning

🔬 Research Summary by Siobhan Mackenzie Hall, PhD student at the Oxford Neural Interfacing groups at the University of Oxford. Siobhan is also a member of the Oxford Artificial Intelligence Society, along with the authors who contributed to the development of the work described below.

[Original paper by Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain]

Overview: Large-scale vision-language models are becoming more pervasive in society. This is concerning given the evidence of the societal biases manifesting in these models and the potential for these biases to be ingrained in society’s perceptions forever – hindering the natural process of norms being challenged and overturned. In this study, we successfully developed efficient but cheap computational methods for debiasing these out-of-the-box models while maintaining performance. This shows promise for empowering individuals to combat these injustices.

Introduction

Our new study, conducted under the University of Oxford’s Artificial Intelligence Society, has revealed societal biases in AI models designed to match text captions to images, like those used in search engines. These ‘vision-language’ AI models, like OpenAI’s CLIP, are trained on hundreds of millions of captioned images from the internet. Training models on this scale can take several months, cost upwards of $1M, and incur a hefty carbon footprint. The process of uploading images to the internet and choosing words to caption them does not happen in a societal vacuum. If these large-scale, internet-scraped datasets embed societal biases or stereotypical patterns, then negative portrayals can be ‘frozen in time’ in data. AI models built from the data can then reflect, entrench or amplify unjust representations of marginalized groups in society. This has been clearly demonstrated, for example, in OpenAI’s own assessment of CLIP’s ingrained biases. This entrenchment is likely to become more evident and pervasive with the release of generative models such as Stable Diffusion, which heavily leans on the use of CLIP, for dataset filtering and certain model components.

This study confirms biases exist in large multimodal AI models trained on web data, especially on the axis of gender and race, and presents a cheap and efficient method that developers can employ at home to debias out-of-the-box pretrained vision-language models.

Key Insights

[This study has been peer-reviewed and will be published at The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP)]

Findings and the implications thereof

As discussed above, these societal biases are becoming entrenched, limiting the potential for societal norms to be challenged and changed. These lasting implications should be cause for alarm and become even more concerning when considering seemingly benign applications such as search engine queries. These are best explained in the context of representational and allocational harms. Representational harms refer to the over-representation of, for example, one gender when querying for a profession (e.g., “nurse” versus “doctor”) or one ethnicity in explicit and NSFW content (Birhane et al., 2021). These impact our perceptions and personal biases of people in different contexts, which can lead to unfair treatment – manifesting as allocational harm. Allocational harms arise when an individual’s or group’s access to resources and opportunity are differentially impacted (Weidinger et al., 2021); for instance, if the ordering of images in search results shifts recruiters’ perceptions about the real-world suitability of different peoples for different jobs.

How are these biases addressed without impacting the V+L model’s performance?

To tackle these harmful biases and preemptively counteract their harms, we propose a cheap yet effective method of debiasing off-the-shelf pretrained models (such as CLIP). In other words, we manipulate how the model links the faces of different people with captions to ensure a more equal variation of returned images. This is achieved through a two-pronged approach of adversarial debiasing with joint training — a method that introduces a second model into the process, whose goal is to prevent the first model from using sensitive demographic features when making choices over relevant captions. Via simultaneously training the model for demographic fairness and its original task of image-caption matching (joint training), our proposed solution is better at reducing bias and increasing diversity in model outputs while maintaining its performance. The contribution of this method lies in its ability to post-hoc debias trained models. This is crucial, as few machine learning practitioners have access to the data and resources needed to recreate such models from scratch. Our methods ensure you don’t need a million dollars to make a difference.

Key takeaways

Large-scale models trained on web data are at risk of not only perpetuating societal biases but entrenching these through using data that can become ‘frozen in time’. These issues are pervasive, as models share capabilities such as filtering of datasets, thus further entrenching biases on a larger, global scale.
To combat the prohibitively expensive requirements to retrain large-scale V+L models, we propose a method of debiasing that also maintains model accuracy so that the models can perform downstream tasks to the same level.
These methods use adversarial debiasing (i.e., adding a second model that tries to infer demographic features from the first model’s outputs and training the two models “against” each other). All code is available on GitHub (https://github.com/oxai/debias-vision-lang).

A common technical barrier to debiasing methods is that they come at the expense of accuracy and performance on downstream tasks. As authors evaluated this trade-off and found that using joint training (adding an image classification loss to the adversarial training), models can be debiased and high-performing on downstream image-captioning tasks.

Between the lines

This study is important as it empowers individuals to tackle these biases and not accept the out-of-the-box biases baked into the large-scale pre-trained models. However, more must be done to improve the pipelines in which these models are trained. The data curation standards need to be called into question as well. To summarise, lead author, Hugo Berg, argues for a needed shift in priorities: “we, as individuals, researchers, and society, need to reconsider how we evaluate AI models moving forward because both accuracy and fairness are desirable qualities.” While the AI research community has directed growing attention toward issues of bias, fairness, and transparency in recent years, more work is to be done in finding methods for AI models and datasets to encode non-harmful worldviews and, indeed, what those values look like, as well as finding practical methods to realize that vision.