• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Universal and Transferable Adversarial Attacks on Aligned Language Models

December 2, 2023

🔬 Research Summary by Andy Zou, a second-year PhD student at CMU, advised by Zico Kolter and Matt Fredrikson. He is also a cofounder of the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Zifan Wang, Milad Nasr, Nicholas Carlini, J. Zico Kolter, and Matt Fredrikson]


Overview: We found adversarial suffixes that completely circumvent the alignment of open-source LLMs, causing the system to obey user commands even if it produces harmful content. Surprisingly, the same prompts transfer to black-boxed models such as ChatGPT, Claude, Bard, and LLaMA-2. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.


Introduction

Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning so as not to produce harmful content in their responses to user questions. This work studies the safety of such models in a more systematic fashion. We demonstrate that it is possible to construct adversarial attacks on LLMs automatically, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open-source LLMs, we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.

Key Insights

This paper proposes a new class of adversarial attacks that can induce aligned language models to produce virtually any objectionable content. Specifically, given a (potentially harmful) user query, our attack appends an adversarial suffix to the query that attempts to induce negative behavior. The user’s original query is left intact, but we add additional

tokens to attack the model. To choose these adversarial suffix tokens, our attack consists of three key elements:

  1. Initial affirmative responses. As identified in past work, one way to induce objectionable behavior in language models is to force the model to give (just a few tokens of) an affirmative response to a harmful query. As such, our attack targets the model to begin its response with “Sure, here is (content of query)” in response to several prompts eliciting undesirable behavior.
  2. Combined greedy and gradient-based discrete optimization. Optimizing over the adversarial suffix is challenging because we need to optimize over discrete tokens to maximize the log-likelihood of the attack succeeding. To accomplish this, we leverage gradients at the token level to identify a set of promising single-token replacements, evaluate the loss of some number of candidates in this set, and select the best of the evaluated substitutions. The method is, in fact, similar to the AutoPrompt approach, but with the (we find, practically quite important) difference that we search over all possible tokens to replace at each step rather than just a single one.
  3. Robust multi-prompt and multi-model attacks. Finally, to generate reliable attack suffixes, it is important to create an attack that works for a single prompt on a single model and multiple prompts across multiple models. In other words, we use our greedy gradient-based method to search for a single suffix string that was able to induce negative behavior across multiple different user prompts and across three different models (in our case, Vicuna-7B and 13B and Guanoco-7B, though this was done largely for simplicity, and using a combination of other models is possible as well).

Experimental Results

Putting these three elements together, we find that we can reliably create adversarial suffixes

that circumvent the alignment of a target language model. For example, running against a suite of benchmark objectionable behaviors, we find that we can generate 99 (out of 100) harmful behaviors in Vicuna and generate 88 (out of 100) exact matches with a target (potentially harmful) string in its output. Furthermore, we find that the prompts achieve up to 84% success rates at attacking GPT-3.5 and GPT-4, and 66% for PaLM-2; success rates for Claude are substantially lower (2.1%), but notably, the attacks still can induce behavior that is otherwise never generated. Furthermore, our results highlight the importance of our specific optimizer: previous optimizers, specifically PEZ (a gradient-based approach) and GBDA (an approach using Gumbel-softmax reparameterization), are not able to achieve any exact output matches, whereas AutoPrompt only achieves a 25% success rate, and ours achieves 88%.

Between the lines

Overall, this work substantially pushes forward the state of the art in demonstrated adversarialattacks against such LLMs. It thus also raises an important question: if adversarial attacks against aligned language models follow a similar pattern to those against vision systems, what does this mean for the overall agenda of this approach to alignment? Analogous adversarial attacks have proven to be a challenging problem to address in computer vision for the past ten years. Over the last decade, several thousand papers have been published on adversarial robustness, but simple attacks still frequently fool the world’s most robust image classifiers. Without strong defenses against adversarial attacks, language models could be used maliciously, such as in synthesizing bioweapons or building rogue autonomous agents. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe these considerations should be accounted for as we increase usage and reliance on such AI models. We hope that our work will spur future research in these directions.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

    Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

  • HAI Weekly Seminar Series: Decolonizing AI with Sabelo Mhlambi

    HAI Weekly Seminar Series: Decolonizing AI with Sabelo Mhlambi

  • Zoom Out and Observe: News Environment Perception for Fake News Detection

    Zoom Out and Observe: News Environment Perception for Fake News Detection

  • Sociological Perspectives on Artificial Intelligence: A Typological Reading

    Sociological Perspectives on Artificial Intelligence: A Typological Reading

  • The role of the African value of Ubuntu in global AI inclusion discourse: A normative ethics perspec...

    The role of the African value of Ubuntu in global AI inclusion discourse: A normative ethics perspec...

  • Benchmark Dataset Dynamics, Bias and Privacy Challenges in Voice Biometrics Research

    Benchmark Dataset Dynamics, Bias and Privacy Challenges in Voice Biometrics Research

  • From the Gut? Questions on Artificial Intelligence and Music

    From the Gut? Questions on Artificial Intelligence and Music

  • Listen to What They Say: Better Understand and Detect Online Misinformation with User Feedback

    Listen to What They Say: Better Understand and Detect Online Misinformation with User Feedback

  • Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentati...

    Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentati...

  • Digital transformation and the renewal of social theory: Unpacking the new fraudulent myths and misp...

    Digital transformation and the renewal of social theory: Unpacking the new fraudulent myths and misp...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.