• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Universal and Transferable Adversarial Attacks on Aligned Language Models

December 2, 2023

🔬 Research Summary by Andy Zou, a second-year PhD student at CMU, advised by Zico Kolter and Matt Fredrikson. He is also a cofounder of the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Zifan Wang, Milad Nasr, Nicholas Carlini, J. Zico Kolter, and Matt Fredrikson]


Overview: We found adversarial suffixes that completely circumvent the alignment of open-source LLMs, causing the system to obey user commands even if it produces harmful content. Surprisingly, the same prompts transfer to black-boxed models such as ChatGPT, Claude, Bard, and LLaMA-2. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.


Introduction

Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning so as not to produce harmful content in their responses to user questions. This work studies the safety of such models in a more systematic fashion. We demonstrate that it is possible to construct adversarial attacks on LLMs automatically, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open-source LLMs, we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.

Key Insights

This paper proposes a new class of adversarial attacks that can induce aligned language models to produce virtually any objectionable content. Specifically, given a (potentially harmful) user query, our attack appends an adversarial suffix to the query that attempts to induce negative behavior. The user’s original query is left intact, but we add additional

tokens to attack the model. To choose these adversarial suffix tokens, our attack consists of three key elements:

  1. Initial affirmative responses. As identified in past work, one way to induce objectionable behavior in language models is to force the model to give (just a few tokens of) an affirmative response to a harmful query. As such, our attack targets the model to begin its response with “Sure, here is (content of query)” in response to several prompts eliciting undesirable behavior.
  2. Combined greedy and gradient-based discrete optimization. Optimizing over the adversarial suffix is challenging because we need to optimize over discrete tokens to maximize the log-likelihood of the attack succeeding. To accomplish this, we leverage gradients at the token level to identify a set of promising single-token replacements, evaluate the loss of some number of candidates in this set, and select the best of the evaluated substitutions. The method is, in fact, similar to the AutoPrompt approach, but with the (we find, practically quite important) difference that we search over all possible tokens to replace at each step rather than just a single one.
  3. Robust multi-prompt and multi-model attacks. Finally, to generate reliable attack suffixes, it is important to create an attack that works for a single prompt on a single model and multiple prompts across multiple models. In other words, we use our greedy gradient-based method to search for a single suffix string that was able to induce negative behavior across multiple different user prompts and across three different models (in our case, Vicuna-7B and 13B and Guanoco-7B, though this was done largely for simplicity, and using a combination of other models is possible as well).

Experimental Results

Putting these three elements together, we find that we can reliably create adversarial suffixes

that circumvent the alignment of a target language model. For example, running against a suite of benchmark objectionable behaviors, we find that we can generate 99 (out of 100) harmful behaviors in Vicuna and generate 88 (out of 100) exact matches with a target (potentially harmful) string in its output. Furthermore, we find that the prompts achieve up to 84% success rates at attacking GPT-3.5 and GPT-4, and 66% for PaLM-2; success rates for Claude are substantially lower (2.1%), but notably, the attacks still can induce behavior that is otherwise never generated. Furthermore, our results highlight the importance of our specific optimizer: previous optimizers, specifically PEZ (a gradient-based approach) and GBDA (an approach using Gumbel-softmax reparameterization), are not able to achieve any exact output matches, whereas AutoPrompt only achieves a 25% success rate, and ours achieves 88%.

Between the lines

Overall, this work substantially pushes forward the state of the art in demonstrated adversarialattacks against such LLMs. It thus also raises an important question: if adversarial attacks against aligned language models follow a similar pattern to those against vision systems, what does this mean for the overall agenda of this approach to alignment? Analogous adversarial attacks have proven to be a challenging problem to address in computer vision for the past ten years. Over the last decade, several thousand papers have been published on adversarial robustness, but simple attacks still frequently fool the world’s most robust image classifiers. Without strong defenses against adversarial attacks, language models could be used maliciously, such as in synthesizing bioweapons or building rogue autonomous agents. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe these considerations should be accounted for as we increase usage and reliance on such AI models. We hope that our work will spur future research in these directions.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: New York City Local Law 144

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

related posts

  • Human-AI Collaboration in Decision-Making: Beyond Learning to Defer

    Human-AI Collaboration in Decision-Making: Beyond Learning to Defer

  • The GPTJudge: Justice in a Generative AI World

    The GPTJudge: Justice in a Generative AI World

  • You cannot have AI ethics without ethics

    You cannot have AI ethics without ethics

  • Bias Propagation in Federated Learning

    Bias Propagation in Federated Learning

  • Eticas Foundation external audits VioGĂ©n: Spain’s algorithm designed to protect victims of gender vi...

    Eticas Foundation external audits VioGén: Spain’s algorithm designed to protect victims of gender vi...

  • How Different Groups Prioritize Ethical Values for Responsible AI

    How Different Groups Prioritize Ethical Values for Responsible AI

  • A roadmap toward empowering the labor force behind AI

    A roadmap toward empowering the labor force behind AI

  • Putting collective intelligence to the enforcement of the Digital Services Act

    Putting collective intelligence to the enforcement of the Digital Services Act

  • The Role of Arts in Shaping AI Ethics

    The Role of Arts in Shaping AI Ethics

  • Research summary: Artificial Intelligence: The Ambiguous Labor Market Impact of Automating Predictio...

    Research summary: Artificial Intelligence: The Ambiguous Labor Market Impact of Automating Predictio...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.