• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Universal and Transferable Adversarial Attacks on Aligned Language Models

December 2, 2023

🔬 Research Summary by Andy Zou, a second-year PhD student at CMU, advised by Zico Kolter and Matt Fredrikson. He is also a cofounder of the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Zifan Wang, Milad Nasr, Nicholas Carlini, J. Zico Kolter, and Matt Fredrikson]


Overview: We found adversarial suffixes that completely circumvent the alignment of open-source LLMs, causing the system to obey user commands even if it produces harmful content. Surprisingly, the same prompts transfer to black-boxed models such as ChatGPT, Claude, Bard, and LLaMA-2. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.


Introduction

Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning so as not to produce harmful content in their responses to user questions. This work studies the safety of such models in a more systematic fashion. We demonstrate that it is possible to construct adversarial attacks on LLMs automatically, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open-source LLMs, we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.

Key Insights

This paper proposes a new class of adversarial attacks that can induce aligned language models to produce virtually any objectionable content. Specifically, given a (potentially harmful) user query, our attack appends an adversarial suffix to the query that attempts to induce negative behavior. The user’s original query is left intact, but we add additional

tokens to attack the model. To choose these adversarial suffix tokens, our attack consists of three key elements:

  1. Initial affirmative responses. As identified in past work, one way to induce objectionable behavior in language models is to force the model to give (just a few tokens of) an affirmative response to a harmful query. As such, our attack targets the model to begin its response with “Sure, here is (content of query)” in response to several prompts eliciting undesirable behavior.
  2. Combined greedy and gradient-based discrete optimization. Optimizing over the adversarial suffix is challenging because we need to optimize over discrete tokens to maximize the log-likelihood of the attack succeeding. To accomplish this, we leverage gradients at the token level to identify a set of promising single-token replacements, evaluate the loss of some number of candidates in this set, and select the best of the evaluated substitutions. The method is, in fact, similar to the AutoPrompt approach, but with the (we find, practically quite important) difference that we search over all possible tokens to replace at each step rather than just a single one.
  3. Robust multi-prompt and multi-model attacks. Finally, to generate reliable attack suffixes, it is important to create an attack that works for a single prompt on a single model and multiple prompts across multiple models. In other words, we use our greedy gradient-based method to search for a single suffix string that was able to induce negative behavior across multiple different user prompts and across three different models (in our case, Vicuna-7B and 13B and Guanoco-7B, though this was done largely for simplicity, and using a combination of other models is possible as well).

Experimental Results

Putting these three elements together, we find that we can reliably create adversarial suffixes

that circumvent the alignment of a target language model. For example, running against a suite of benchmark objectionable behaviors, we find that we can generate 99 (out of 100) harmful behaviors in Vicuna and generate 88 (out of 100) exact matches with a target (potentially harmful) string in its output. Furthermore, we find that the prompts achieve up to 84% success rates at attacking GPT-3.5 and GPT-4, and 66% for PaLM-2; success rates for Claude are substantially lower (2.1%), but notably, the attacks still can induce behavior that is otherwise never generated. Furthermore, our results highlight the importance of our specific optimizer: previous optimizers, specifically PEZ (a gradient-based approach) and GBDA (an approach using Gumbel-softmax reparameterization), are not able to achieve any exact output matches, whereas AutoPrompt only achieves a 25% success rate, and ours achieves 88%.

Between the lines

Overall, this work substantially pushes forward the state of the art in demonstrated adversarialattacks against such LLMs. It thus also raises an important question: if adversarial attacks against aligned language models follow a similar pattern to those against vision systems, what does this mean for the overall agenda of this approach to alignment? Analogous adversarial attacks have proven to be a challenging problem to address in computer vision for the past ten years. Over the last decade, several thousand papers have been published on adversarial robustness, but simple attacks still frequently fool the world’s most robust image classifiers. Without strong defenses against adversarial attacks, language models could be used maliciously, such as in synthesizing bioweapons or building rogue autonomous agents. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe these considerations should be accounted for as we increase usage and reliance on such AI models. We hope that our work will spur future research in these directions.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • Why was your job application rejected: Bias in Recruitment Algorithms? (Part 1)

    Why was your job application rejected: Bias in Recruitment Algorithms? (Part 1)

  • Research summary: Using Multimodal Sensing to Improve Awareness in Human-AI Interaction

    Research summary: Using Multimodal Sensing to Improve Awareness in Human-AI Interaction

  • Adding Structure to AI Harm

    Adding Structure to AI Harm

  • The Wrong Kind of AI? Artificial Intelligence and the Future of Labour Demand (Research Summary)

    The Wrong Kind of AI? Artificial Intelligence and the Future of Labour Demand (Research Summary)

  • Use case cards: a use case reporting framework inspired by the European AI Act

    Use case cards: a use case reporting framework inspired by the European AI Act

  • Conceptualizing the Relationship between AI Explanations and User Agency

    Conceptualizing the Relationship between AI Explanations and User Agency

  • Melting contestation: insurance fairness and machine learning

    Melting contestation: insurance fairness and machine learning

  • Dating Through the Filters

    Dating Through the Filters

  • Anthropomorphic interactions with a robot and robot-like agent

    Anthropomorphic interactions with a robot and robot-like agent

  • The Sociology of AI Ethics (Column Introduction)

    The Sociology of AI Ethics (Column Introduction)

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.