🔬 Research Summary by Isabel O. Gallegos, a Ph.D. student in Computer Science at Stanford University, researching algorithmic fairness to interrogate the role of artificial intelligence in equitable decision-making.
[Original paper by Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed]
Overview: Social biases in large language models (LLMs) have been well-documented, but how can we address them? This paper presents a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We consolidate, formalize, and expand notions of social bias and fairness in natural language processing, unify the literature with three intuitive taxonomies, and identify open problems and challenges for future work.
Introduction
Rapid advancements in large language models (LLMs) have enabled the understanding and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. This paper presents a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive taxonomies: two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure; we also release a consolidation of publicly available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent bias propagation in LLMs.
Key Insights
The Challenge of Bias in Large Language Models
The rise and rapid advancement of large language models (LLMs) has fundamentally changed language technologies. With the ability to generate human-like text and adapt to a wide array of natural language processing (NLP) tasks, the impressive capabilities of these models have initiated a paradigm shift in the development of language models. Instead of training task-specific models on relatively small task-specific datasets, researchers and practitioners can use LLMs as foundation models that can be fine-tuned for particular functions. Even without fine-tuning, foundation models increasingly enable few- or zero-shot capabilities for various scenarios like classification, question-answering, logical reasoning, fact retrieval, information extraction, and more.
Laying behind these successes, however, is the potential to perpetuate harm. Typically trained on an enormous scale of uncurated Internet-based data, LLMs inherit stereotypes, misrepresentations, derogatory and exclusionary language, and other denigrating behaviors that disproportionately affect already vulnerable and marginalized communities. These harms are “social bias,” a subjective and normative term we broadly use to refer to disparate treatment or outcomes between social groups arising from historical and structural power asymmetries. Though LLMs often reflect existing biases, they can amplify these biases, too; in either case, the automated reproduction of injustice can reinforce systems of inequity.
Defining Bias and Fairness for NLP
Despite the growing emphasis on addressing these issues, bias and fairness research in LLMs often fails to precisely describe the harms of model behaviors: who is harmed, why the behavior is harmful, and how the harm reflects and reinforces social hierarchies. Consolidating literature from machine learning, NLP, and (socio)linguistics, we define several distinct facets of bias to disambiguate the types of social harms that may emerge from LLMs. We organize these harms in a taxonomy of social biases that researchers and practitioners can leverage to accurately describe bias evaluation and mitigation efforts. We shift fairness frameworks typically applied to machine learning classification problems towards NLP and introduce several fairness desiderata that begin to operationalize various fairness notions for LLMs.
Taxonomies for Bias Evaluation and Mitigation
With the growing recognition of the biases embedded in LLMs has emerged an abundance of works proposing techniques to measure or remove social bias, primarily organized by (1) metrics for bias evaluation, (2) datasets for bias evaluation, and (3) techniques for bias mitigation. We categorize, summarize, and discuss these three research areas.
Metrics for Bias Evaluation
We characterize the relationship between evaluation metrics and datasets, which are often conflated in the literature, and we categorize and discuss a wide range of metrics that can evaluate bias at different fundamental levels in a model: (1) embedding-based, which use hidden vector representations; (2) probability-based, which use model-assigned token probabilities; and (3) generated text-based, which use model-generated text continuations
We formalize metrics mathematically with a unified notation that improves comparison between metrics. We also identify the limitations of each class of metrics to capture downstream application biases, highlighting areas for future research.
Datasets for Bias Evaluation
We categorize datasets by their data structure: (1) counterfactual inputs, or pairs of sentences with perturbed social groups, and (2) prompts or phrases to condition text generation. With this classification, we leverage our taxonomy of metrics to highlight the compatibility of datasets with new metrics beyond those originally posed. We increase comparability between dataset contents by identifying the types of harm and the social groups targeted by each dataset. We highlight consistency, reliability, and validity challenges in existing evaluation datasets as areas for improvement. Finally, we consolidate and share publicly available datasets here: https://github.com/i-gallegos/Fair-LLM-Benchmark
Techniques for Bias Mitigation
We classify an extensive range of bias mitigation methods by their intervention stage: (1) pre-processing, which modifies model inputs; (2) in-training, which modifies model parameters via gradient-based updates; (3) intra-processing, which modifies inference behavior without further training; and (4) post-processing, which modifies model outputs. We construct granular subcategories at each mitigation stage to draw similarities and trends between classes of methods, with a mathematical formalization of several techniques with unified notation and representative examples of each class of method. We draw attention to ways that bias may persist at each mitigation stage.
Open Problems and Challenges
The work we survey makes important progress in understanding and reducing bias, but several challenges remain largely open. We challenge future research to address power imbalances in LLM development, conceptualize fairness more robustly for NLP, improve bias evaluation principles and standards, expand mitigation efforts, and explore theoretical limits for fairness guarantees.
Between the lines
As LLMs are increasingly deployed and adapted in various applications, bias evaluation and mitigation efforts remain critical research areas to ensure social harms are not automated nor perpetuated by technical systems. However, the role of technical solutions must be contextualized in a broader understanding of historical, structural, and institutional power hierarchies. For instance, who holds power in developing and deploying LLM systems, who is excluded, and how does technical solutionism preserve, enable, and strengthen inequality? We hope our work improves understanding of technical efforts to measure and reduce the perpetuation of bias by LLMs while also challenging researchers to interrogate more deeply the social, cultural, historical, and political contexts that shape the underlying assumptions and values engrained in technical solutions.