Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparative Study

🔬 Research Summary by Marco Guerini, a researcher in Computational Linguistics and head of the Language and Dialogue Technologies group at Fondazione Bruno Kessler (FBK).

[Original paper by Serra Sinem Tekiroglu, Helena Bonaldi, Margherita Fanton, and Marco Guerini]

Overview: Many international institutions and countries are taking action to monitor, restrict, and remove online hate content. Still, results are not always satisfactory, and they are often charged with censorship. An alternative approach that emerged in recent years is based on the use of so-called Counter Narratives, which are de-escalating and fact-bound textual responses refuting the hateful messages in a non-aggressive way. In this scenario, automatization is needed to address the sheer amount of hate produced daily effectively. Therefore, we conducted a comparative study investigating the use of several AI-based Neural Language Models for an automatic counter-narrative generation to be used as a companion to NGO operators while tackling online hate.

Introduction

Hate Speech (HS) has found fertile ground in Social Media Platforms. Actions undertaken by such platforms to tackle online hatred consist in identifying possible sources of hate and removing them using content deletion, account suspension, or shadow-banning. However, these actions are often interpreted and denounced as censorship by the affected users and political groups. For this reason, such restrictions can have the opposite effect of exacerbating the hostility of the haters. An alternative strategy looming on the horizon is based on using Counter-Narratives (CNs), i.e., communicative actions used to refute hate speech through thoughtful and cogent reasons and true and fact-bound arguments. As a de-escalating measure, CNs have successfully diminished hate while preserving freedom of speech. An example of HS and CN follows:

HS: Jews have a secret plot to take over the world.

CN: This myth traces back to ‘The Protocols of the Learned Elders of Zion,’ which is used as proof of a Jewish conspiracy to rule the world. But the Protocols are a proven forgery, written by agents of the Russian czar in the late 19th century, that spread throughout the 20th century and continues to this day.

Key Insights

Based on their effectiveness, NGOs are employing CNs to counter online hate. Still, it is impossible to respond to all instances of hate manually. For this reason, a line of AI research based on Natural Language Processing (NLP) has recently emerged, focusing on designing systems to generate CN suggestions automatically. This study aims to compare the most recent and advanced AI-based Language Models (LM) to understand their pros and cons in generating CNs.

Effective CN Generation experiments

In our experiments, we use various automatic metrics and manual evaluations with expert judgments to assess several LMs, representing the main categories of the model architectures and decoding methods that are currently available. We further test the robustness of the fine-tuned LMs in generating CNs for unseen targets of hate. For this study, we rely on a dataset that grants the target diversity and the CN quality we aim for. The dataset was collected with a human-in-the-loop approach and features 5k HS-CN pairs, covering several targets, including DISABLED, JEWS, LGBT+, MIGRANTS, MUSLIMS, POC, and WOMEN.

Results show that autoregressive language models such as GPT-2 are, in general, more suited for the task, and while stochastic decoding mechanisms can generate more novel, diverse, and informative outputs, deterministic decoding is useful in scenarios where more generic and less novel (yet ‘safer’) CNs are needed. Furthermore, in out-of-target experiments, we find that the similarity of targets (e.g., JEWS and MUSLIMS as religious groups) plays a crucial role in the effectiveness of portability to new targets. We finally show a promising research direction of leveraging human corrections of LM’s outputs for building an additional automatic post-editing step to correct errors made by LMs during generation.

Between the Lines

Automating CN generation can help increase the efficiency of online hate countering while preserving freedom of speech and promoting less aggressive and hostile debates. However, the AI-based generation models we tested are not meant to be used autonomously since even the best model can still produce substandard CNs containing inappropriate or negative language. Instead, following a human-computer cooperation paradigm, we want to build models that can be helpful to NGO operators by providing them with diverse and novel CN candidates for their hate-countering activities while granting them total control over the final output.