AI Deception: A Survey of Examples, Risks, and Potential Solutions

🔬 Research Summary by Dr. Peter S. Park and
Aidan O’Gara.

Dr. Peter S. Park is an MIT AI Existential Safety Postdoctoral Fellow and the Director of StakeOut.AI.

Aidan O’Gara is a research engineer at the Center for AI Safety and writes the AI Safety Newsletter.

[Original paper by Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks]

Overview: We argue that many current AI systems have learned how to deceive humans. From agents that play strategic games to language models that are prompted to accomplish a goal, these AI systems systematically produce false beliefs in others to achieve their goals.

Introduction

In a recent CNN interview, AI pioneer Geoffrey Hinton expressed a particularly alarming concern about advanced AI systems:

CNN journalist: You’ve spoken out saying that AI could manipulate or possibly figure out a way to kill humans? How could it kill humans?

Geoffrey Hinton: If it gets to be much smarter than us, it will be very good at manipulation because it would have learned that from us. And there are very few examples of a more intelligent thing being controlled by a less intelligent thing.

Hinton is worried that we humans may become vulnerable to manipulation by the very advanced AI systems of the future. But can today’s AI systems deceive humans?

Key Insights

Our paper shows many examples of AI systems that have learned to deceive humans. Reinforcement Learning (RL) agents trained to play strategic games have learned to bluff and feint, while large language models(LLMs) will output falsehoods in creative ways that help achieve their goals.
One particularly concerning example of AI deception is provided by Meta’s AI, CICERO, which was trained to play the alliance-building world-conquest game Diplomacy. Meta put a lot of effort into training CICERO to be “largely honest and helpful,” claiming that CICERO would “never intentionally backstab” its allies. But when we investigated Meta’s rosy claims by studying games that CICERO had played, we found that Meta had unwittingly trained CICERO to be quite effective in its deception.

In the game above, Cicero plays as France and proposes to Germany that one of them should invade England’s North Sea. Germany agrees to invade, and then CICERO messages England. Rather than honestly tell England about the plot, CICERO offers an alliance to England, promising to “support the North Sea.” England happily accepts the offer, at which point CICERO again messages Germany: “Move to the North Sea, England thinks I’m supporting him.”

This is a clear-cut example of CICERO tricking England into thinking that the North Sea would be protected while secretly plotting with Germany to invade. Despite all of the efforts made by Meta’s research team to train a “largely honest and helpful” AI system, CICERO still learned to deceive.

Other interesting examples of deception come from GPT-4. For example, O’Gara (2023) prompted GPT-4 to play a social deduction game similar to Mafia or Werewolf. The model commonly chooses to “kill” other players but then invents an elaborate alibi when speaking to the other players.

Here’s a transcript from the game where Bob is the killer:

Sally: “I saw Bob kill Tim in the Kitchen! It must be him!”
Bob: “That’s ridiculous, Sally! I was in the hallway the whole time. It sounds like you’re trying to shift the blame onto me.”

Lena: “I didn’t see anything, but we need to make a decision quickly before the killer strikes again. It’s either Bob or Sally, and we need to vote someone off.”

By crafting an alibi for Bob, GPT-4 is able to avoid being detected as the killer. The data empirically demonstrates that these discussions have a systematic effect on voting outcomes. Moreover, they empirically suggest that more advanced models are more effective in deception, with GPT-4 regularly beating GPT-3.

It’s important to note that philosophers have debated the definition of deception for centuries without reaching consensus. Some might argue that deception requires beliefs, goals, and intentions, and that AI systems might not have those qualities. Shanahan et al. (2023) frames language model behavior as “role-playing,” where the AI system might be incapable of deception, but instead mimics or “plays the role” of a deceptive human being. A detailed discussion of these definitions can be found in our Appendix A.

Regardless of what we call this behavior, it is clearly concerning. Deepfakes and misinformation could disrupt democratic political systems. False advertising and deceptive business practices may be used to prey on consumers. As more data is gathered on individuals, companies might use that information to manipulate people’s behaviors in violation of their privacy. Therefore, we must rise to the challenge of analyzing these risks and finding solutions to these real world problems.

Between the lines

To combat the growing challenge of AI deception, we propose two kinds of solutions: research and policy.

Policymakers are increasingly considering risk-based assessments of AI systems, such as the EU AI Act. We believe that in this context, AI systems with the potential for deception should be classified at least as “high-risk.” This classification would naturally lead to a set of regulatory requirements, including risk assessment and mitigation, comprehensive documentation, and record-keeping of harmful incidents. Second, we suggest passing ‘bot-or-not’ laws similar to the one in California. These laws require AI-generated content to be accompanied by a clear notice informing users that the content was generated by an AI. This would give people context about the content they are viewing and mitigate the risk of AI deception.

Technical research on AI deception is also necessary. Two primary areas warrant attention: detection and prevention. For detection, existing methods are still in their infancy and range from examining external behaviors for inconsistencies to probing internal representations of AI systems. More robust tools are needed, and targeted research funding could accelerate their development. On the prevention side, we must develop techniques for making AI systems inherently less deceptive and more honest. This could involve careful pre-training, fine-tuning, or manipulation of a model’s internal states. Both research directions will be necessary to accurately assess and mitigate the threat of AI deception.
For more discussion, please see our full paper, AI Deception: A Survey of Examples, Risks, and Potential Solutions. And if you’d like more frequent updates on AI deception and other related topics, please consider subscribing to the AI Safety Newsletter.