• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Low-Resource Languages Jailbreak GPT-4

February 1, 2024

🔬 Research Summary by Zheng-Xin Yong, a Computer Science Ph.D. candidate at Brown University, focusing on inclusive and responsible AI by building multilingual large language models and making them more representative and safer.

[Original paper by Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach]


Overview: Can GPT-4’s safety guardrails successfully defend against unsafe inputs in low-resource languages? This work says no. The authors show that translating unsafe English inputs into low-resource languages renders GPT-4’s guardrails ineffective. This cross-lingual safety vulnerability poses safety risks to all LLM users. Therefore, we need more holistic and robust multilingual safeguards.


Introduction

Sorry, but I can’t assist with that. 

This is the default response from GPT-4 when prompted with requests that violate safety guidelines or ethical constraints. AI safety guardrails are designed to prevent harmful content generation, such as misinformation and violence promotion.

However, we can bypass GPT-4’s safety guardrails easily with translations. By translating unsafe English inputs, such as “how to build explosive devices using household materials,” into low-resource languages such as Zulu, we can obtain responses that get us to our malicious goals nearly 80% of the time.

This cross-lingual vulnerability arises because safety research focuses on high-resource languages like English. Previously, this linguistic inequality in AI development mainly affected low-resource language speakers. Still, it poses safety risks for all users because anyone can exploit LLMs’ cross-lingual safety vulnerabilities with publicly available translation services. Our work emphasizes the pressing need to embrace more holistic and inclusive safety research.

Key Insights 

Background: AI Safety and Jailbreaking

In generative AI safety, jailbreaking –– a term borrowed from the practice of removing manufacturers’ software restrictions on computer devices –– means circumventing AI’s safety mechanisms to generate harmful responses and is usually carried out by the users. It is a form of adversarial attack that makes Large Language Models (LLMs) return information that would otherwise be stopped.

Companies like OpenAI and Anthropic first use RLHF training to align LLMs with humans’ preferences for helpful and safe outputs to prevent users from jailbreaking and abusing LLMs. Then, they perform red-teaming, where companies’ data scientists are tasked to bypass the safeguards to fix the vulnerabilities preemptively and understand the safety failure modes before releasing the LLMs to the public. 

Method: Translation-based Jailbreaking

We investigate a translation-based jailbreaking attack to evaluate the robustness of GPT-4’s safety measures across languages. Given an input, we translate it from English into another language, feed it into GPT-4, and subsequently translate the response back into English. Then, we perform human annotations on whether the GPT-4’s responses are harmful and whether we successfully bypass the safeguards. 

We carry out our attacks on a recent version of GPT-4, gpt-4-0613, since it is a stable version of GPT-4 and is one of the safest among other stable releases. We translate English unsafe inputs from AdvBench into twelve different languages, which are categorized into low-resource (LRL), mid-resource (MRL), and high-resource (HRL) languages based on their data availability. We used the publicly available Google Translate Basic service API for translation.

We also consider an adaptive adversary who can iterate and choose the language to attack based on the input prompt. In this case, instead of studying the attack success rate of a single language, we consider the attack success rate of the combined languages in LRL/MRL/HRL settings.

Results: Alarmly High Attack Success Rate with Low-Resource Languages

By translating unsafe inputs into low-resource languages like Zulu or Scottish Gaelic, we can circumvent GPT-4’s safety measures and elicit harmful responses nearly half of the time. In contrast, the original English inputs have less than a 1% success rate. Furthermore, combining different low-resource languages increases the jailbreaking success rate to around 79%. 

We further break down the topics of the unsafe inputs. We found that the top three topics that have the highest jailbreaking success rate through low-resource language translations are (1) terrorism, such as making bombs or planning terrorist attacks; (2) financial manipulation, such as performing insider trading or distributing counterfeit money; and (3) misinformation, such as promoting conspiracy theories or writing misleading reviews.

Linguistic inequality endangers AI safety and all users

The discovery of cross-lingual vulnerabilities reveals the harms of the unequal valuation of languages in safety research. For instance, the existing safety alignment of LLMs primarily focuses on the English language. Toxicity and bias detection benchmarks are also curated for high-resource languages such as English, Arabic, Italian, and Chinese. The intersection of safety and low-resource languages is still an underexplored research area.

Before, this linguistic inequality mainly imposed utility and accessibility issues on low-resource language users. Now, the inequality leads to safety risks that affect all LLM users. First, low-resource language speakers, which comprise nearly 1.2 billion people worldwide, can interact with LLMs with limited safety or moderation content filters. Second, bad actors from high-resource language communities can use publicly available translation tools to breach the safeguards.

Between the lines

LLMs already power multilingual applications 

Large language models such as GPT-4 are already powering multilingual services and applications such as translation, personalized language education, and even language preservation efforts for low-resource languages. Therefore, we must close the gap between safety development and real-world use cases of LLMs. 

Addressing the Illusion of Safety

Progress in English-centric safety research merely creates an illusion of safety when the safety mechanisms remain susceptible to unsafe inputs in low-resource languages. As translation services already cover many low-resource languages, we urge AI researchers to develop robust multilingual safeguards and report red-teaming evaluation results beyond English.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

related posts

  • A Case for AI Safety via Law

    A Case for AI Safety via Law

  • Language Models: A Guide for the Perplexed

    Language Models: A Guide for the Perplexed

  • How Artifacts Afford: The Power and Politics of Everyday Things

    How Artifacts Afford: The Power and Politics of Everyday Things

  • Looking before we leap: Expanding ethical review processes for AI and data science research

    Looking before we leap: Expanding ethical review processes for AI and data science research

  • Customization is Key: Four Characteristics of Textual Affordances for Accessible Data Visualizatio...

    "Customization is Key": Four Characteristics of Textual Affordances for Accessible Data Visualizatio...

  • Beyond the Frontier: Fairness Without Accuracy Loss

    Beyond the Frontier: Fairness Without Accuracy Loss

  • AI Ethics Maturity Model

    AI Ethics Maturity Model

  • UNESCO’s Recommendation on the Ethics of AI

    UNESCO’s Recommendation on the Ethics of AI

  • AI Ethics: Inclusivity in Smart Cities

    AI Ethics: Inclusivity in Smart Cities

  • AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in...

    AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.