Handling Bias in Toxic Speech Detection: A Survey

🔬 Research summary by Sarah Masud & Tanmoy Chakraborty.

Sarah is currently a 3rd-year doctoral student at the Laboratory for Computational Social Systems (LCS2) at IIIT-Delhi. Within the broad area of social computing, her work mainly revolves around modelling hate speech detection & diffusion on the web. Tanmoy Chakraborty is an Assistant Professor of Computer Science and a Ramanujan Fellow at IIIT Delhi where he leads a research group, Laboratory for Computational Social Systems (LCS2), and heads the Infosys Centre for Artificial Intelligence. His broad research includes Natural Language Processing and Social Computing with a major focus on designing machine learning models for cyber-safety, trust and social goods.

[Original paper by Tanmay Garg, Sarah Masud, Tharun Suresh, Tanmoy Chakraborty]

Overview: When attempting to detect toxic speech*(footnote) in an automated manner, we do not want the model to modify its predictions based on the speaker’s race or gender. If the model displays such behaviour, it has acquired what is usually referred to as “unintended bias.” Adoption of such biased models in production may result in the marginalisation of the groups that they were designed to assist in the first place. The current survey puts together a systematic study of existing methods for evaluating and mitigating bias in toxicity detection.

Introduction

The subjects of bias in machine learning models and the methods used in toxicity detection have been extensively surveyed. The authors explore the niche area of bias detection, evaluation, and mitigation applied to automated toxicity detection in the existing papers. To develop a systematic overview of various unintended biases in the literature, the authors design a taxonomy of bias based on the source of harm or downstream impact of harm. The source of harm determines where in the modelling pipeline the bias gets introduced (e.g., data collection, annotation, etc.). Meanwhile, the impact of harm tries to capture which characteristic of the end-user (race, gender, age, etc.) does the biased model discriminates against). While not mutually exclusive and exhaustive, this taxonomy provides a precise overview of the existing literature. Based on this classification, the survey then deep dives into the methods used to detect, evaluate, and mitigate these biases. In addition to discussing the traditional demographic biases, the survey also touches on intersectional, cross-geographic biases and discrimination based on psychographic preferences.

Apart from developing a taxonomy of the various biases, the authors also develop a taxonomy for the different evaluation metrics used to study biases in toxicity detection models. The taxonomy also aims to map these bias evaluation metrics to one or more concepts of fairness that the metric is trying to improve upon.

Bias as a source of harm

Talking about the biases as a source of harm, the authors discuss the impact of complying with the used data sampling strategy on introducing biases in the toxicity datasets. Interestingly, the authors highlight how the topics and user-sets captured in the dataset significantly impact biasing the dataset than the sampling technique. Additionally, the issue of lexical and annotation biases are discussed with a particular focus on reducing the model’s confusion regarding disclosure of identity vs attack on identity.

The interplay of annotation and lexicon go hand in hand. A dataset in which the annotator’s inherent biases cause the labelling to be skewed due to explicit terms will eventually develop spurious lexical connotations. Note that no standard annotation guideline and inter-annotator agreement range exist in the area of toxic speech detection. Despite the best efforts of researchers and practitioners, what can be considered toxic is highly subjective. There are no universally adopted benchmark datasets and annotations to compare against.

Bias as a target of harm

Unfortunately, the biases in datasets and modelling for toxicity detection impact various demographic groups it should prevent the spread of toxicity against. This broadly includes the markers of race and gender. Owing to the limited availability of ground-level demographics to map against, the study of racial bias in toxicity has primarily focused on discrimination against the African-American dialects. Meanwhile, the study of gender bias has focused on binary gender.

Even though performing gender-pronoun swapping and transfer learning from less biased datasets can help mitigate the gender bias. Manual inspection of such augmentations and extension of gender beyond binary remains unexplored for toxicity detection.

Within the scope of racial prejudice, the authors observe that priming the annotators with racial information can help reduce racial bias. Still, it is a double-edged sword that such priming can intensify the inherent biases of the annotators. The authors point out how regularising racial bias via statistical models that assume different dialects have the same conditional probability, is limited in scope and application. What is accepted and commonly used in one dialect may be frowned upon or rarely used in another.

Biases beyond demography

To initiate the conversation on bias beyond preferences like race and gender, the authors highlight the limited yet significant work in the areas of intersectional and psychographic biases. One form of intersection is the duality of race and gender. Another is to look at race, gender or both from a cross-geophagy perspective, both of which are in nascent stages of study in toxicity detection. On the other hand, political ideologies and stances are being explored in determining their impact on toxicity modelling. Despite the best efforts of researchers and practitioners, areas such as ageism, religious affiliation, socio-economic status remain underexplored markers. Depending on the geography, a combination of these could be critical for mitigating bias in toxicity detection.

Between the lines

Accounting and mitigating biases within the broad area of toxicity detection is far from a solved problem. Our biggest takeaway is that practitioners need to incorporate bias mitigation at every step of the modelling pipeline rather than looking at bias mitigation as a one-stop solution. Having discovered that biases can easily be transformed from one form to another, the authors explore concepts of “bias-shifts” in lexicon debiasing. In some cases, the presence of one bias can lead to the development of other forms of downstream harm. For example, lexical and racial biases in toxicity detection have been known to occur together due to stylistic variations of African American dialects. Thus, an end-to-end pipeline will help practitioners better monitor the side effects of dealing with an existing bias. As pointed out in the survey, sadly, the majority of existing work in detecting toxic speech and mitigations of biases in toxicity detection is focused on the English-speaking dual-gendered population. To build more coherent and robust models that can help fight toxicity at scale will require us to look at linguistic nuances that occur due to regional geographics and inclusive gender dynamics.

Footnote:

Throughout the survey, the term “toxic speech” is used as an umbrella term to refer to any form of malicious content, including but not limited to hate speech, cyberbullying, abusive speech, misogyny, sexism, offence, and obscenity.