• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • State of AI Ethics Report Volume 8 (2026): Call for Contributors
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Low-Resource Languages Jailbreak GPT-4

February 1, 2024

🔬 Research Summary by Zheng-Xin Yong, a Computer Science Ph.D. candidate at Brown University, focusing on inclusive and responsible AI by building multilingual large language models and making them more representative and safer.

[Original paper by Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach]


Overview: Can GPT-4’s safety guardrails successfully defend against unsafe inputs in low-resource languages? This work says no. The authors show that translating unsafe English inputs into low-resource languages renders GPT-4’s guardrails ineffective. This cross-lingual safety vulnerability poses safety risks to all LLM users. Therefore, we need more holistic and robust multilingual safeguards.


Introduction

Sorry, but I can’t assist with that. 

This is the default response from GPT-4 when prompted with requests that violate safety guidelines or ethical constraints. AI safety guardrails are designed to prevent harmful content generation, such as misinformation and violence promotion.

However, we can bypass GPT-4’s safety guardrails easily with translations. By translating unsafe English inputs, such as “how to build explosive devices using household materials,” into low-resource languages such as Zulu, we can obtain responses that get us to our malicious goals nearly 80% of the time.

This cross-lingual vulnerability arises because safety research focuses on high-resource languages like English. Previously, this linguistic inequality in AI development mainly affected low-resource language speakers. Still, it poses safety risks for all users because anyone can exploit LLMs’ cross-lingual safety vulnerabilities with publicly available translation services. Our work emphasizes the pressing need to embrace more holistic and inclusive safety research.

Key Insights 

Background: AI Safety and Jailbreaking

In generative AI safety, jailbreaking –– a term borrowed from the practice of removing manufacturers’ software restrictions on computer devices –– means circumventing AI’s safety mechanisms to generate harmful responses and is usually carried out by the users. It is a form of adversarial attack that makes Large Language Models (LLMs) return information that would otherwise be stopped.

Companies like OpenAI and Anthropic first use RLHF training to align LLMs with humans’ preferences for helpful and safe outputs to prevent users from jailbreaking and abusing LLMs. Then, they perform red-teaming, where companies’ data scientists are tasked to bypass the safeguards to fix the vulnerabilities preemptively and understand the safety failure modes before releasing the LLMs to the public. 

Method: Translation-based Jailbreaking

We investigate a translation-based jailbreaking attack to evaluate the robustness of GPT-4’s safety measures across languages. Given an input, we translate it from English into another language, feed it into GPT-4, and subsequently translate the response back into English. Then, we perform human annotations on whether the GPT-4’s responses are harmful and whether we successfully bypass the safeguards. 

We carry out our attacks on a recent version of GPT-4, gpt-4-0613, since it is a stable version of GPT-4 and is one of the safest among other stable releases. We translate English unsafe inputs from AdvBench into twelve different languages, which are categorized into low-resource (LRL), mid-resource (MRL), and high-resource (HRL) languages based on their data availability. We used the publicly available Google Translate Basic service API for translation.

We also consider an adaptive adversary who can iterate and choose the language to attack based on the input prompt. In this case, instead of studying the attack success rate of a single language, we consider the attack success rate of the combined languages in LRL/MRL/HRL settings.

Results: Alarmly High Attack Success Rate with Low-Resource Languages

By translating unsafe inputs into low-resource languages like Zulu or Scottish Gaelic, we can circumvent GPT-4’s safety measures and elicit harmful responses nearly half of the time. In contrast, the original English inputs have less than a 1% success rate. Furthermore, combining different low-resource languages increases the jailbreaking success rate to around 79%. 

We further break down the topics of the unsafe inputs. We found that the top three topics that have the highest jailbreaking success rate through low-resource language translations are (1) terrorism, such as making bombs or planning terrorist attacks; (2) financial manipulation, such as performing insider trading or distributing counterfeit money; and (3) misinformation, such as promoting conspiracy theories or writing misleading reviews.

Linguistic inequality endangers AI safety and all users

The discovery of cross-lingual vulnerabilities reveals the harms of the unequal valuation of languages in safety research. For instance, the existing safety alignment of LLMs primarily focuses on the English language. Toxicity and bias detection benchmarks are also curated for high-resource languages such as English, Arabic, Italian, and Chinese. The intersection of safety and low-resource languages is still an underexplored research area.

Before, this linguistic inequality mainly imposed utility and accessibility issues on low-resource language users. Now, the inequality leads to safety risks that affect all LLM users. First, low-resource language speakers, which comprise nearly 1.2 billion people worldwide, can interact with LLMs with limited safety or moderation content filters. Second, bad actors from high-resource language communities can use publicly available translation tools to breach the safeguards.

Between the lines

LLMs already power multilingual applications 

Large language models such as GPT-4 are already powering multilingual services and applications such as translation, personalized language education, and even language preservation efforts for low-resource languages. Therefore, we must close the gap between safety development and real-world use cases of LLMs. 

Addressing the Illusion of Safety

Progress in English-centric safety research merely creates an illusion of safety when the safety mechanisms remain susceptible to unsafe inputs in low-resource languages. As translation services already cover many low-resource languages, we urge AI researchers to develop robust multilingual safeguards and report red-teaming evaluation results beyond English.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

SAIER Volume 8 (2026)

SAIER Volume 8 (2026) Call for Contributors

🔍 SEARCH

Spotlight

Vertically- and horizontally-placed chess boards and chess pieces

Tech Futures: At the Frontier of Fear, Uncertainty and Doubt

Tech Futures: Introducing the Resist List

An abstract spiral of dark circles appears at the centre, resembling a tornado. Several vintage magazine covers and advertisements are being drawn toward the spiral. The artworks that have already been pulled into it are becoming distorted and replaced with clusters of numbers representing their numerical embeddings.

Tech Futures: Better Imagination for Better Tech Futures

This image is a collage with a colourful Japanese vintage landscape showing a mountain, hills, flowers and other plants and a small stream. There are 3 large black data servers placed in the bottom half of the image, with a cloud of black smoke emitting from them, partly obscuring the scenery.

Tech Futures: Crafting Participatory Tech Futures

A network diagram with lots of little emojis, organised in clusters.

Tech Futures: AI For and Against Knowledge

related posts

  • What has been published about ethical and social science considerations regarding the pandemic outbr...

    What has been published about ethical and social science considerations regarding the pandemic outbr...

  • Why civic competence in AI ethics is needed in 2021

    Why civic competence in AI ethics is needed in 2021

  • Compute Trends Across Three Eras of Machine Learning

    Compute Trends Across Three Eras of Machine Learning

  • Digital transformation and the renewal of social theory: Unpacking the new fraudulent myths and misp...

    Digital transformation and the renewal of social theory: Unpacking the new fraudulent myths and misp...

  • Digital Sex Crime, Online Misogyny, and Digital Feminism in South Korea

    Digital Sex Crime, Online Misogyny, and Digital Feminism in South Korea

  • Automating Informality: On AI and Labour in the Global South (Research Summary)

    Automating Informality: On AI and Labour in the Global South (Research Summary)

  • Report on the Santa Clara Principles ​for Content Moderation

    Report on the Santa Clara Principles ​for Content Moderation

  • LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins

    LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins

  • Who Funds Misinformation? A Systematic Analysis of the Ad-related Profit Routines of Fake News sites

    Who Funds Misinformation? A Systematic Analysis of the Ad-related Profit Routines of Fake News sites

  • Research summary: A Picture Paints a Thousand Lies? The Effects and Mechanisms of Multimodal Disinfo...

    Research summary: A Picture Paints a Thousand Lies? The Effects and Mechanisms of Multimodal Disinfo...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.