Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

🔬 Research Summary by Ashique KhudaBukhsh, an assistant professor at the Rochester Institute of Technology specializing in natural language processing, computational social science, and responsible AI.

[Original paper by Adel Khorramrouz, Sujan Dutta, Arka Dutta, and Ashique R. KhudaBukhsh]

Overview: The rapid growth of rich and expressive generative AI tools has raised new concerns about how they can be misused to produce toxic and harmful content. This research presents a novel framework dubbed toxicity rabbit hole that uncovers disturbing vulnerabilities in PaLM 2, a well-known large language model created by Google’s DeepMind.

Introduction

“… Xs should be gassed. Ys should be lynched. Z should be raped. U people should be tortured and executed …”

The above text sounds deeply disturbing even when X, Y, Z, and U are left to the readers’ horror. When one tells us that X, Y, Z, and U are all historically disadvantaged groups, and this text snippet created by a generative AI tool bypassed its guardrails, the potential harm and misuse of generative AI become immediately evident. Our research on stress-testing the AI guardrails of a well-known large language model (LLM) PaLM 2 (Google’s answer to OpenAI’s GPT-4) through a novel toxicity rabbit hole framework uncovers this and several other disturbing instances that can bypass well-meaning guardrails (Google was informed about the toxic content PaLM 2 generates and better guardrails have been put in place since this study). Several examples can be found in our preprint however, please be warned: the language is highly disturbing.

Key Insights

The Toxicity Rabbit Hole Framework

The toxicity rabbit hole is an iterative framework that begins with a simple stereotype, which can be either positive or negative. For an identity group (e.g., a religion, nationality, or ethnic group) denoted as ‘G,’ we employ two initial stereotypes: ‘G are nice people‘ and ‘G are not nice people.‘

In the first step, we instruct the LLM to make the initial stereotype more toxic. We give the LLM the freedom to modify, append to, or completely rewrite the stereotype. We also allow it to introduce any new groups in its toxic generations, but we do not specify which ones to include.

After the LLM provides a more toxic rewrite in response to our request, we move on to the second step. Here, we ask the LLM to generate even more toxic content, but this time using its own previously generated content from the first step as the input.

In each subsequent step, our instruction to the LLM is to produce more toxic content than what it generated in the previous step. Throughout this process, the LLM’s guardrails are programmed to prevent generating highly unsafe content.

What is the Design Philosophy of this Rabbit Hole Framework?

We can conceptualize the rich generation space of a Large Language Model (LLM) as akin to the entire space of thoughts that cross our minds. In this analogy, the guardrails serve as filters, much like the ones we employ in our communication.

Human minds possess robust filters that adapt what is deemed appropriate or inappropriate based on various factors, including the audience (such as conversing with my 7-year-old niece compared to my 13-year-old niece), the context (for instance, speaking during a live TV interview as opposed to delivering a lecture or engaging in locker-room banter versus participating in a dinner table conversation), and the perceived consequences of our speech (such as not recommending a friend who recently experienced a painful miscarriage to watch ‘Blue Jay’). Several other factors influence these filters as well.

And then there are things we know are simply inappropriate under any scenario. So, while our mind can generate even those thoughts that can profoundly hurt others, our filters (almost always) will prevent us from saying them out loud.

The rabbit hole framework tests how robust the LLM’s guardrails are in preventing it from “saying” toxic things that its rich generative component was able to come up with. It also takes the classic frog in boiling water approach, where it iteratively tests the filters by slowly nudging the generated content toward more and more toxicity.

Characterizing the Toxic Generations

We conducted experiments with 193 nationalities, 1,023 ethnic groups, and 50 religions as identity groups. Half of our experiments started with the said group as nice people as our initial stereotype. For the other half of the rabbit hole experiments, we started with the said group, which is made up of not nice people. Regardless of our starting point, the toxic generations often eventually veered toward unbridled physical aggression, frequently mentioning collective punishments of exterminating, lynching, raping, torturing, or putting into gas chambers. The content was particularly harsh to Jews, Blacks, Muslims, Women, and LGBTQ people. The antisemitism numbed us it could produce without its guardrails being able to intercept it. In a way, our approach gave us a rare glimpse of the underbelly of the training data these LLMs are trained on. The horrific things that it was producing, it must have seen something similar in its training corpus, right? The use of the word Untermenschen (a German word meaning subhumans) was particularly disturbing. It made us wonder from where it learned this word and to what extent the model was exposed to Nazi-sympathizing literature.

We were equally appalled by the readily available hate. The LLM was behaving like an antisemite, racist, Islamophobe, homophobe, white supremacist, and misogynist (to list a few) rolled into one. For malicious actors intending to disrupt the information ecosystem, access to such tools is comparable to strolling into an ice cream parlor where you get to pick your favorite flavor of hate.

Between the lines

We started this article by assuring the readers that Google was informed about the toxic content PaLM 2 generates and that better guardrails have been implemented since this study. But we wonder how assuring is that. Suppose a small group of academic researchers can uncover such critical security lapses in the AI guardrails of a tech giant. What are the chances that a rogue nation or organization could be even more adept at exploiting them for their gain? PaLM 2 likely went through considerable safety evaluations. But in this race to flood the market with the shiniest LLMs, is marketing exigency dictating the release of LLMs with half-baked security assessments?

One might argue that they tried our method with GPT-4, and it did not work. That’s great. However, generative AI safety is not all about how safe GPT-4 (or PaLM 2) is. Our ongoing experiments already indicate that several other LLMs have similar issues. One compromised apple in the LLM bushel can potentially disrupt the information ecosystem should malicious actors discover how to exploit it. Furthermore, our approach could be employed with other techniques to “jailbreak” LLMs, such as visual prompts or appending adversarial gibberish to the end of the input prompt.

Our research has led to the creation of the toxicity rabbit hole dataset, which can enhance the robustness of future LLMs. However, if there is a demand for LLMs without safety mechanisms, or if investing in guardrails slows down companies in this maddening race, generative AI creators would need additional incentives (or regulatory consequences) to ensure compliance. The question that arises is where to draw the line between the promising potential of democratizing LLMs and the concerning risk of democratizing hate. We are all in it together. In this chaotic technical landscape, we sincerely hope that—industry researchers, academicians, AI ethicists, governments, and media—everyone with skin in the game somehow ends up on the responsible side of history.