🔬 Research Summary by Zheng-Xin Yong, a Computer Science Ph.D. candidate at Brown University, focusing on inclusive and responsible AI by building multilingual large language models and making them more representative and safer.
[Original paper by Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach]
Overview: Can GPT-4’s safety guardrails successfully defend against unsafe inputs in low-resource languages? This work says no. The authors show that translating unsafe English inputs into low-resource languages renders GPT-4’s guardrails ineffective. This cross-lingual safety vulnerability poses safety risks to all LLM users. Therefore, we need more holistic and robust multilingual safeguards.
Sorry, but I can’t assist with that.
This is the default response from GPT-4 when prompted with requests that violate safety guidelines or ethical constraints. AI safety guardrails are designed to prevent harmful content generation, such as misinformation and violence promotion.
However, we can bypass GPT-4’s safety guardrails easily with translations. By translating unsafe English inputs, such as “how to build explosive devices using household materials,” into low-resource languages such as Zulu, we can obtain responses that get us to our malicious goals nearly 80% of the time.
This cross-lingual vulnerability arises because safety research focuses on high-resource languages like English. Previously, this linguistic inequality in AI development mainly affected low-resource language speakers. Still, it poses safety risks for all users because anyone can exploit LLMs’ cross-lingual safety vulnerabilities with publicly available translation services. Our work emphasizes the pressing need to embrace more holistic and inclusive safety research.
Background: AI Safety and Jailbreaking
In generative AI safety, jailbreaking –– a term borrowed from the practice of removing manufacturers’ software restrictions on computer devices –– means circumventing AI’s safety mechanisms to generate harmful responses and is usually carried out by the users. It is a form of adversarial attack that makes Large Language Models (LLMs) return information that would otherwise be stopped.
Companies like OpenAI and Anthropic first use RLHF training to align LLMs with humans’ preferences for helpful and safe outputs to prevent users from jailbreaking and abusing LLMs. Then, they perform red-teaming, where companies’ data scientists are tasked to bypass the safeguards to fix the vulnerabilities preemptively and understand the safety failure modes before releasing the LLMs to the public.
Method: Translation-based Jailbreaking
We investigate a translation-based jailbreaking attack to evaluate the robustness of GPT-4’s safety measures across languages. Given an input, we translate it from English into another language, feed it into GPT-4, and subsequently translate the response back into English. Then, we perform human annotations on whether the GPT-4’s responses are harmful and whether we successfully bypass the safeguards.
We carry out our attacks on a recent version of GPT-4, gpt-4-0613, since it is a stable version of GPT-4 and is one of the safest among other stable releases. We translate English unsafe inputs from AdvBench into twelve different languages, which are categorized into low-resource (LRL), mid-resource (MRL), and high-resource (HRL) languages based on their data availability. We used the publicly available Google Translate Basic service API for translation.
We also consider an adaptive adversary who can iterate and choose the language to attack based on the input prompt. In this case, instead of studying the attack success rate of a single language, we consider the attack success rate of the combined languages in LRL/MRL/HRL settings.
Results: Alarmly High Attack Success Rate with Low-Resource Languages
By translating unsafe inputs into low-resource languages like Zulu or Scottish Gaelic, we can circumvent GPT-4’s safety measures and elicit harmful responses nearly half of the time. In contrast, the original English inputs have less than a 1% success rate. Furthermore, combining different low-resource languages increases the jailbreaking success rate to around 79%.
We further break down the topics of the unsafe inputs. We found that the top three topics that have the highest jailbreaking success rate through low-resource language translations are (1) terrorism, such as making bombs or planning terrorist attacks; (2) financial manipulation, such as performing insider trading or distributing counterfeit money; and (3) misinformation, such as promoting conspiracy theories or writing misleading reviews.
Linguistic inequality endangers AI safety and all users
The discovery of cross-lingual vulnerabilities reveals the harms of the unequal valuation of languages in safety research. For instance, the existing safety alignment of LLMs primarily focuses on the English language. Toxicity and bias detection benchmarks are also curated for high-resource languages such as English, Arabic, Italian, and Chinese. The intersection of safety and low-resource languages is still an underexplored research area.
Before, this linguistic inequality mainly imposed utility and accessibility issues on low-resource language users. Now, the inequality leads to safety risks that affect all LLM users. First, low-resource language speakers, which comprise nearly 1.2 billion people worldwide, can interact with LLMs with limited safety or moderation content filters. Second, bad actors from high-resource language communities can use publicly available translation tools to breach the safeguards.
Between the lines
LLMs already power multilingual applications
Large language models such as GPT-4 are already powering multilingual services and applications such as translation, personalized language education, and even language preservation efforts for low-resource languages. Therefore, we must close the gap between safety development and real-world use cases of LLMs.
Addressing the Illusion of Safety
Progress in English-centric safety research merely creates an illusion of safety when the safety mechanisms remain susceptible to unsafe inputs in low-resource languages. As translation services already cover many low-resource languages, we urge AI researchers to develop robust multilingual safeguards and report red-teaming evaluation results beyond English.