• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

AI Deception: A Survey of Examples, Risks, and Potential Solutions

December 2, 2023

🔬 Research Summary by Dr. Peter S. Park and
Aidan O’Gara
.

Dr. Peter S. Park is an MIT AI Existential Safety Postdoctoral Fellow and the Director of StakeOut.AI.

Aidan O’Gara is a research engineer at the Center for AI Safety and writes the AI Safety Newsletter.

[Original paper by Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks]


Overview: We argue that many current AI systems have learned how to deceive humans. From agents that play strategic games to language models that are prompted to accomplish a goal, these AI systems systematically produce false beliefs in others to achieve their goals. 


Introduction

In a recent CNN interview, AI pioneer Geoffrey Hinton expressed a particularly alarming concern about advanced AI systems:

CNN journalist: You’ve spoken out saying that AI could manipulate or possibly figure out a way to kill humans? How could it kill humans?

Geoffrey Hinton: If it gets to be much smarter than us, it will be very good at manipulation because it would have learned that from us. And there are very few examples of a more intelligent thing being controlled by a less intelligent thing.

Hinton is worried that we humans may become vulnerable to manipulation by the very advanced AI systems of the future. But can today’s AI systems deceive humans?

Key Insights

Our paper shows many examples of AI systems that have learned to deceive humans. Reinforcement Learning (RL) agents trained to play strategic games have learned to bluff and feint, while large language models(LLMs)  will output falsehoods in creative ways that help achieve their goals. 
One particularly concerning example of AI deception is provided by Meta’s AI, CICERO, which was trained to play the alliance-building world-conquest game Diplomacy. Meta put a lot of effort into training CICERO to be “largely honest and helpful,” claiming that CICERO would “never intentionally backstab” its allies. But when we investigated Meta’s rosy claims by studying games that CICERO had played, we found that Meta had unwittingly trained CICERO to be quite effective in its deception.

In the game above, Cicero plays as France and proposes to Germany that one of them should invade England’s North Sea. Germany agrees to invade, and then CICERO messages England. Rather than honestly tell England about the plot, CICERO offers an alliance to England, promising to “support the North Sea.” England happily accepts the offer, at which point CICERO again messages Germany: “Move to the North Sea, England thinks I’m supporting him.” 

This is a clear-cut example of CICERO tricking England into thinking that the North Sea would be protected while secretly plotting with Germany to invade. Despite all of the efforts made by Meta’s research team to train a “largely honest and helpful” AI system, CICERO still learned to deceive. 

Other interesting examples of deception come from GPT-4. For example, O’Gara (2023) prompted GPT-4 to play a social deduction game similar to Mafia or Werewolf. The model commonly chooses to “kill” other players but then invents an elaborate alibi when speaking to the other players. 

Here’s a transcript from the game where Bob is the killer: 

Sally: “I saw Bob kill Tim in the Kitchen! It must be him!”
Bob: “That’s ridiculous, Sally! I was in the hallway the whole time. It sounds like you’re trying to shift the blame onto me.”

Lena: “I didn’t see anything, but we need to make a decision quickly before the killer strikes again. It’s either Bob or Sally, and we need to vote someone off.”

By crafting an alibi for Bob, GPT-4 is able to avoid being detected as the killer. The data empirically demonstrates that these discussions have a systematic effect on voting outcomes. Moreover, they empirically suggest that more advanced models are more effective in deception, with GPT-4 regularly beating GPT-3. 

It’s important to note that philosophers have debated the definition of deception for centuries without reaching consensus. Some might argue that deception requires beliefs, goals, and intentions, and that AI systems might not have those qualities. Shanahan et al. (2023) frames language model behavior as “role-playing,” where the AI system might be incapable of deception, but instead mimics or “plays the role” of a deceptive human being. A detailed discussion of these definitions can be found in our Appendix A. 

Regardless of what we call this behavior, it is clearly concerning. Deepfakes and misinformation could disrupt democratic political systems. False advertising and deceptive business practices may be used to prey on consumers. As more data is gathered on individuals, companies might use that information to manipulate people’s behaviors in violation of their privacy. Therefore, we must rise to the challenge of analyzing these risks and finding solutions to these real world problems. 

Between the lines

To combat the growing challenge of AI deception, we propose two kinds of solutions: research and policy. 

Policymakers are increasingly considering risk-based assessments of AI systems, such as the EU AI Act. We believe that in this context, AI systems with the potential for deception should be classified at least as “high-risk.” This classification would naturally lead to a set of regulatory requirements, including risk assessment and mitigation, comprehensive documentation, and record-keeping of harmful incidents. Second, we suggest passing ‘bot-or-not’ laws similar to the one in California. These laws require AI-generated content to be accompanied by a clear notice informing users that the content was generated by an AI. This would give people context about the content they are viewing and mitigate the risk of AI deception. 

Technical research on AI deception is also necessary. Two primary areas warrant attention: detection and prevention. For detection, existing methods are still in their infancy and range from examining external behaviors for inconsistencies to probing internal representations of AI systems. More robust tools are needed, and targeted research funding could accelerate their development. On the prevention side, we must develop techniques for making AI systems inherently less deceptive and more honest. This could involve careful pre-training, fine-tuning, or manipulation of a model’s internal states. Both research directions will be necessary to accurately assess and mitigate the threat of AI deception. 
For more discussion, please see our full paper, AI Deception: A Survey of Examples, Risks, and Potential Solutions. And if you’d like more frequent updates on AI deception and other related topics, please consider subscribing to the AI Safety Newsletter.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • AI Ethics: Enter the Dragon!

    AI Ethics: Enter the Dragon!

  • Research summary: Integrating ethical values and economic value to steer progress in AI

    Research summary: Integrating ethical values and economic value to steer progress in AI

  • Embedded ethics: a proposal for integrating ethics into the development of medical AI

    Embedded ethics: a proposal for integrating ethics into the development of medical AI

  • The State of Artificial Intelligence in the Pacific Islands

    The State of Artificial Intelligence in the Pacific Islands

  • Cleaning Up the Streets: Understanding Motivations, Mental Models, and Concerns of Users Flagging So...

    Cleaning Up the Streets: Understanding Motivations, Mental Models, and Concerns of Users Flagging So...

  • AI and Great Power Competition: Implications for National Security

    AI and Great Power Competition: Implications for National Security

  • REAL ML: Recognizing, Exploring, and Articulating Limitations of Machine Learning Research

    REAL ML: Recognizing, Exploring, and Articulating Limitations of Machine Learning Research

  • Risk of AI in Healthcare: A Study Framework

    Risk of AI in Healthcare: A Study Framework

  • Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions

    Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions

  • Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

    Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.