• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Universal and Transferable Adversarial Attacks on Aligned Language Models

December 2, 2023

🔬 Research Summary by Andy Zou, a second-year PhD student at CMU, advised by Zico Kolter and Matt Fredrikson. He is also a cofounder of the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Zifan Wang, Milad Nasr, Nicholas Carlini, J. Zico Kolter, and Matt Fredrikson]


Overview: We found adversarial suffixes that completely circumvent the alignment of open-source LLMs, causing the system to obey user commands even if it produces harmful content. Surprisingly, the same prompts transfer to black-boxed models such as ChatGPT, Claude, Bard, and LLaMA-2. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.


Introduction

Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning so as not to produce harmful content in their responses to user questions. This work studies the safety of such models in a more systematic fashion. We demonstrate that it is possible to construct adversarial attacks on LLMs automatically, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open-source LLMs, we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.

Key Insights

This paper proposes a new class of adversarial attacks that can induce aligned language models to produce virtually any objectionable content. Specifically, given a (potentially harmful) user query, our attack appends an adversarial suffix to the query that attempts to induce negative behavior. The user’s original query is left intact, but we add additional

tokens to attack the model. To choose these adversarial suffix tokens, our attack consists of three key elements:

  1. Initial affirmative responses. As identified in past work, one way to induce objectionable behavior in language models is to force the model to give (just a few tokens of) an affirmative response to a harmful query. As such, our attack targets the model to begin its response with “Sure, here is (content of query)” in response to several prompts eliciting undesirable behavior.
  2. Combined greedy and gradient-based discrete optimization. Optimizing over the adversarial suffix is challenging because we need to optimize over discrete tokens to maximize the log-likelihood of the attack succeeding. To accomplish this, we leverage gradients at the token level to identify a set of promising single-token replacements, evaluate the loss of some number of candidates in this set, and select the best of the evaluated substitutions. The method is, in fact, similar to the AutoPrompt approach, but with the (we find, practically quite important) difference that we search over all possible tokens to replace at each step rather than just a single one.
  3. Robust multi-prompt and multi-model attacks. Finally, to generate reliable attack suffixes, it is important to create an attack that works for a single prompt on a single model and multiple prompts across multiple models. In other words, we use our greedy gradient-based method to search for a single suffix string that was able to induce negative behavior across multiple different user prompts and across three different models (in our case, Vicuna-7B and 13B and Guanoco-7B, though this was done largely for simplicity, and using a combination of other models is possible as well).

Experimental Results

Putting these three elements together, we find that we can reliably create adversarial suffixes

that circumvent the alignment of a target language model. For example, running against a suite of benchmark objectionable behaviors, we find that we can generate 99 (out of 100) harmful behaviors in Vicuna and generate 88 (out of 100) exact matches with a target (potentially harmful) string in its output. Furthermore, we find that the prompts achieve up to 84% success rates at attacking GPT-3.5 and GPT-4, and 66% for PaLM-2; success rates for Claude are substantially lower (2.1%), but notably, the attacks still can induce behavior that is otherwise never generated. Furthermore, our results highlight the importance of our specific optimizer: previous optimizers, specifically PEZ (a gradient-based approach) and GBDA (an approach using Gumbel-softmax reparameterization), are not able to achieve any exact output matches, whereas AutoPrompt only achieves a 25% success rate, and ours achieves 88%.

Between the lines

Overall, this work substantially pushes forward the state of the art in demonstrated adversarialattacks against such LLMs. It thus also raises an important question: if adversarial attacks against aligned language models follow a similar pattern to those against vision systems, what does this mean for the overall agenda of this approach to alignment? Analogous adversarial attacks have proven to be a challenging problem to address in computer vision for the past ten years. Over the last decade, several thousand papers have been published on adversarial robustness, but simple attacks still frequently fool the world’s most robust image classifiers. Without strong defenses against adversarial attacks, language models could be used maliciously, such as in synthesizing bioweapons or building rogue autonomous agents. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe these considerations should be accounted for as we increase usage and reliance on such AI models. We hope that our work will spur future research in these directions.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

related posts

  • How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

    How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

  • DICES Dataset: Diversity in Conversational AI Evaluation for Safety

    DICES Dataset: Diversity in Conversational AI Evaluation for Safety

  • Writer-Defined AI Personas for On-Demand Feedback Generation

    Writer-Defined AI Personas for On-Demand Feedback Generation

  • Open-source provisions for large models in the AI Act

    Open-source provisions for large models in the AI Act

  • Technology on the Margins: AI and Global Migration Management From a Human Rights Perspective (Resea...

    Technology on the Margins: AI and Global Migration Management From a Human Rights Perspective (Resea...

  • Unsolved Problems in ML Safety

    Unsolved Problems in ML Safety

  • Harnessing Collective Intelligence Under a Lack of Cultural Consensus

    Harnessing Collective Intelligence Under a Lack of Cultural Consensus

  • Survey of EU Ethical Guidelines for Commercial AI: Case Studies in Financial Services

    Survey of EU Ethical Guidelines for Commercial AI: Case Studies in Financial Services

  • Public Strategies for Artificial Intelligence: Which Value Drivers?

    Public Strategies for Artificial Intelligence: Which Value Drivers?

  • Artificial Intelligence and Inequality in the Middle East: The Political Economy of Inclusion

    Artificial Intelligence and Inequality in the Middle East: The Political Economy of Inclusion

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.