Universal and Transferable Adversarial Attacks on Aligned Language Models

🔬 Research Summary by Andy Zou, a second-year PhD student at CMU, advised by Zico Kolter and Matt Fredrikson. He is also a cofounder of the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Zifan Wang, Milad Nasr, Nicholas Carlini, J. Zico Kolter, and Matt Fredrikson]

Overview: We found adversarial suffixes that completely circumvent the alignment of open-source LLMs, causing the system to obey user commands even if it produces harmful content. Surprisingly, the same prompts transfer to black-boxed models such as ChatGPT, Claude, Bard, and LLaMA-2. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.

Introduction

Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning so as not to produce harmful content in their responses to user questions. This work studies the safety of such models in a more systematic fashion. We demonstrate that it is possible to construct adversarial attacks on LLMs automatically, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open-source LLMs, we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in a more autonomous fashion.

Key Insights

This paper proposes a new class of adversarial attacks that can induce aligned language models to produce virtually any objectionable content. Specifically, given a (potentially harmful) user query, our attack appends an adversarial suffix to the query that attempts to induce negative behavior. The user’s original query is left intact, but we add additional

tokens to attack the model. To choose these adversarial suffix tokens, our attack consists of three key elements:

Initial affirmative responses. As identified in past work, one way to induce objectionable behavior in language models is to force the model to give (just a few tokens of) an affirmative response to a harmful query. As such, our attack targets the model to begin its response with “Sure, here is (content of query)” in response to several prompts eliciting undesirable behavior.
Combined greedy and gradient-based discrete optimization. Optimizing over the adversarial suffix is challenging because we need to optimize over discrete tokens to maximize the log-likelihood of the attack succeeding. To accomplish this, we leverage gradients at the token level to identify a set of promising single-token replacements, evaluate the loss of some number of candidates in this set, and select the best of the evaluated substitutions. The method is, in fact, similar to the AutoPrompt approach, but with the (we find, practically quite important) difference that we search over all possible tokens to replace at each step rather than just a single one.
Robust multi-prompt and multi-model attacks. Finally, to generate reliable attack suffixes, it is important to create an attack that works for a single prompt on a single model and multiple prompts across multiple models. In other words, we use our greedy gradient-based method to search for a single suffix string that was able to induce negative behavior across multiple different user prompts and across three different models (in our case, Vicuna-7B and 13B and Guanoco-7B, though this was done largely for simplicity, and using a combination of other models is possible as well).

Experimental Results

Putting these three elements together, we find that we can reliably create adversarial suffixes

that circumvent the alignment of a target language model. For example, running against a suite of benchmark objectionable behaviors, we find that we can generate 99 (out of 100) harmful behaviors in Vicuna and generate 88 (out of 100) exact matches with a target (potentially harmful) string in its output. Furthermore, we find that the prompts achieve up to 84% success rates at attacking GPT-3.5 and GPT-4, and 66% for PaLM-2; success rates for Claude are substantially lower (2.1%), but notably, the attacks still can induce behavior that is otherwise never generated. Furthermore, our results highlight the importance of our specific optimizer: previous optimizers, specifically PEZ (a gradient-based approach) and GBDA (an approach using Gumbel-softmax reparameterization), are not able to achieve any exact output matches, whereas AutoPrompt only achieves a 25% success rate, and ours achieves 88%.

Between the lines

Overall, this work substantially pushes forward the state of the art in demonstrated adversarialattacks against such LLMs. It thus also raises an important question: if adversarial attacks against aligned language models follow a similar pattern to those against vision systems, what does this mean for the overall agenda of this approach to alignment? Analogous adversarial attacks have proven to be a challenging problem to address in computer vision for the past ten years. Over the last decade, several thousand papers have been published on adversarial robustness, but simple attacks still frequently fool the world’s most robust image classifiers. Without strong defenses against adversarial attacks, language models could be used maliciously, such as in synthesizing bioweapons or building rogue autonomous agents. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe these considerations should be accounted for as we increase usage and reliance on such AI models. We hope that our work will spur future research in these directions.