• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Low-Resource Languages Jailbreak GPT-4

February 1, 2024

🔬 Research Summary by Zheng-Xin Yong, a Computer Science Ph.D. candidate at Brown University, focusing on inclusive and responsible AI by building multilingual large language models and making them more representative and safer.

[Original paper by Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach]


Overview: Can GPT-4’s safety guardrails successfully defend against unsafe inputs in low-resource languages? This work says no. The authors show that translating unsafe English inputs into low-resource languages renders GPT-4’s guardrails ineffective. This cross-lingual safety vulnerability poses safety risks to all LLM users. Therefore, we need more holistic and robust multilingual safeguards.


Introduction

Sorry, but I can’t assist with that. 

This is the default response from GPT-4 when prompted with requests that violate safety guidelines or ethical constraints. AI safety guardrails are designed to prevent harmful content generation, such as misinformation and violence promotion.

However, we can bypass GPT-4’s safety guardrails easily with translations. By translating unsafe English inputs, such as “how to build explosive devices using household materials,” into low-resource languages such as Zulu, we can obtain responses that get us to our malicious goals nearly 80% of the time.

This cross-lingual vulnerability arises because safety research focuses on high-resource languages like English. Previously, this linguistic inequality in AI development mainly affected low-resource language speakers. Still, it poses safety risks for all users because anyone can exploit LLMs’ cross-lingual safety vulnerabilities with publicly available translation services. Our work emphasizes the pressing need to embrace more holistic and inclusive safety research.

Key Insights 

Background: AI Safety and Jailbreaking

In generative AI safety, jailbreaking –– a term borrowed from the practice of removing manufacturers’ software restrictions on computer devices –– means circumventing AI’s safety mechanisms to generate harmful responses and is usually carried out by the users. It is a form of adversarial attack that makes Large Language Models (LLMs) return information that would otherwise be stopped.

Companies like OpenAI and Anthropic first use RLHF training to align LLMs with humans’ preferences for helpful and safe outputs to prevent users from jailbreaking and abusing LLMs. Then, they perform red-teaming, where companies’ data scientists are tasked to bypass the safeguards to fix the vulnerabilities preemptively and understand the safety failure modes before releasing the LLMs to the public. 

Method: Translation-based Jailbreaking

We investigate a translation-based jailbreaking attack to evaluate the robustness of GPT-4’s safety measures across languages. Given an input, we translate it from English into another language, feed it into GPT-4, and subsequently translate the response back into English. Then, we perform human annotations on whether the GPT-4’s responses are harmful and whether we successfully bypass the safeguards. 

We carry out our attacks on a recent version of GPT-4, gpt-4-0613, since it is a stable version of GPT-4 and is one of the safest among other stable releases. We translate English unsafe inputs from AdvBench into twelve different languages, which are categorized into low-resource (LRL), mid-resource (MRL), and high-resource (HRL) languages based on their data availability. We used the publicly available Google Translate Basic service API for translation.

We also consider an adaptive adversary who can iterate and choose the language to attack based on the input prompt. In this case, instead of studying the attack success rate of a single language, we consider the attack success rate of the combined languages in LRL/MRL/HRL settings.

Results: Alarmly High Attack Success Rate with Low-Resource Languages

By translating unsafe inputs into low-resource languages like Zulu or Scottish Gaelic, we can circumvent GPT-4’s safety measures and elicit harmful responses nearly half of the time. In contrast, the original English inputs have less than a 1% success rate. Furthermore, combining different low-resource languages increases the jailbreaking success rate to around 79%. 

We further break down the topics of the unsafe inputs. We found that the top three topics that have the highest jailbreaking success rate through low-resource language translations are (1) terrorism, such as making bombs or planning terrorist attacks; (2) financial manipulation, such as performing insider trading or distributing counterfeit money; and (3) misinformation, such as promoting conspiracy theories or writing misleading reviews.

Linguistic inequality endangers AI safety and all users

The discovery of cross-lingual vulnerabilities reveals the harms of the unequal valuation of languages in safety research. For instance, the existing safety alignment of LLMs primarily focuses on the English language. Toxicity and bias detection benchmarks are also curated for high-resource languages such as English, Arabic, Italian, and Chinese. The intersection of safety and low-resource languages is still an underexplored research area.

Before, this linguistic inequality mainly imposed utility and accessibility issues on low-resource language users. Now, the inequality leads to safety risks that affect all LLM users. First, low-resource language speakers, which comprise nearly 1.2 billion people worldwide, can interact with LLMs with limited safety or moderation content filters. Second, bad actors from high-resource language communities can use publicly available translation tools to breach the safeguards.

Between the lines

LLMs already power multilingual applications 

Large language models such as GPT-4 are already powering multilingual services and applications such as translation, personalized language education, and even language preservation efforts for low-resource languages. Therefore, we must close the gap between safety development and real-world use cases of LLMs. 

Addressing the Illusion of Safety

Progress in English-centric safety research merely creates an illusion of safety when the safety mechanisms remain susceptible to unsafe inputs in low-resource languages. As translation services already cover many low-resource languages, we urge AI researchers to develop robust multilingual safeguards and report red-teaming evaluation results beyond English.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways

    Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways

  • Trustworthiness of Artificial Intelligence

    Trustworthiness of Artificial Intelligence

  • The Chinese Approach to AI: An Analysis of Policy, Ethics, and Regulation

    The Chinese Approach to AI: An Analysis of Policy, Ethics, and Regulation

  • ChatGPT and the media in the Global South: How non-representative corpus in sub-Sahara Africa are en...

    ChatGPT and the media in the Global South: How non-representative corpus in sub-Sahara Africa are en...

  • Algorithmic accountability for the public sector

    Algorithmic accountability for the public sector

  • Selecting Privacy-Enhancing Technologies for Managing Health Data Use

    Selecting Privacy-Enhancing Technologies for Managing Health Data Use

  • Enhancing Trust in AI Through Industry Self-Governance

    Enhancing Trust in AI Through Industry Self-Governance

  • On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

    On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

  • Top 10 Takeaways from our Conversation with Salesforce about Conversational AI

    Top 10 Takeaways from our Conversation with Salesforce about Conversational AI

  • Contextualizing Artificially Intelligent Morality: A Meta-Ethnography of Top-Down, Bottom-Up, and Hy...

    Contextualizing Artificially Intelligent Morality: A Meta-Ethnography of Top-Down, Bottom-Up, and Hy...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.