• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Representation Engineering: A Top-Down Approach to AI Transparency

January 25, 2024

🔬 Research Summary by Andy Zou, a Ph.D. student at CMU, advised by Zico Kolter and Matt Fredrikson. He also cofounded the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks]


Overview: Deep neural networks have achieved incredible success across a wide variety of domains, yet their inner workings remain poorly understood. In this work, we identify and characterize the emerging area of representation engineering (RepE), which follows an approach of top-down transparency to better understand and control the inner workings of neural networks. Inspired by cognitive neuroscience, we improve the transparency and safety of AI systems by performing AI “mind reading and control.”


Introduction

In this paper, we introduce and characterize the emerging area of representation engineering (RepE), an approach to enhancing AI systems’ transparency that draws on cognitive neuroscience insights. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including truthfulness, memorization, power-seeking, and more, demonstrating the promise of representation-centered transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Key Insights

Motivation

One approach to increasing the transparency of AI systems is to create a “cognitive science of AI.” Current efforts toward this goal largely center around the area of mechanistic interpretability, which focuses on understanding neural networks in terms of neurons and circuits. This aligns with the Sherringtonian view in cognitive neuroscience, which sees cognition as the outcome of node-to-node connections implemented by neurons embedded in circuits within the brain. While this view has successfully explained simple mechanisms, it has struggled to explain more complex phenomena. The contrasting Hopfieldian view (n.b., not to be confused with Hopfield networks) has shown more promise in scaling to higher-level cognition. Rather than focusing on neurons and circuits, the Hopfieldian view sees cognition as a product of representational spaces implemented by patterns of activity across populations of neurons. This view currently has no analog in machine learning, yet it could point toward a new approach to transparency research.

Representation Engineering (RepE)

Representation engineering (RepE) is a top-down approach to transparency research that treats representations as the fundamental unit of analysis, aiming to understand and control representations of high-level cognitive phenomena in neural networks. In particular, we identify two main areas of RepE: Reading and Control. Representation reading seeks to locate emergent representations for high-level concepts and functions within a network. Building on the insights gained from Representation Reading, Representation Control seeks to modify or control the internal representations of concepts and functions.

In the paper, we propose Linear Artificial Tomography (LAT) as a baseline for Rep Reading. For Rep Control, we propose three different baselines: Reading Vector, Contrast Vector, and Low-Rank Representation Adaptation (LoRRA).

Frontiers of RepE

Here, we will give an overview of each of the frontiers we explore in the paper. Please refer to the more detailed section in the paper.

Honesty

In this section, we explore applications of RepE to concepts and functions related to honesty. First, we demonstrate that models possess a consistent internal concept of truthfulness, which enables detecting imitative falsehoods and intentional lies generated by LLMs. We then show how reading a model’s representation of honesty enables control techniques to enhance honesty. These interventions lead us to state-of-the-art results on TruthfulQA.

Ethics and Power

In this section, we explore the application of RepE to various aspects of machine ethics. We present progress in monitoring and controlling learned representations of important concepts and functions, such as utility, morality, probability, risk, and power-seeking tendencies. In addition, we illustrate mathematical relations between emergent representations.

Emotion

In this section, we attempt to conduct LAT scans on the LLaMA-2-Chat-13B model to discern neural activity associated with various emotions and illustrate the profound impact of emotions on model behavior.

Harmless Instruction-Following

Aligned language models designed to resist harmful instructions can be compromised by clever use of tailored prompts known as jailbreaks. We extract a model’s internal representation of harm and use it to draw the model’s attention to the harmfulness concept to shape its behavior, making it more robust to jailbreaks. This suggests the potential of enhancing or dampening targeted traits or values to achieve fine-grained control of model behavior.

Knowledge and Model Editing

We demonstrate how to apply Representation Engineering to identify and manipulate precise knowledge, factual information, and non-numerical concepts.

Memorization

Generative models can memorize a substantial portion of their training data, raising concerns about potential confidential or copyrighted content breaches. In the following section, we present an initial exploration in the area of model memorization with RepE to detect and prevent memorized outputs.

Between the lines

Inspired by the Hopfieldian view in cognitive neuroscience, RepE places representations and the transformations between them at the center of the analysis. As neural networks exhibit more coherent internal structures, we believe analyzing them at the representation level can yield new insights, aiding in effective monitoring and control. While we mainly analyzed subspaces of representations, future work could investigate trajectories, manifolds, and state spaces of representations. We hope this initial step in exploring the potential of RepE helps to foster new insights into understanding and controlling AI systems, ultimately ensuring that future AI systems are trustworthy and safe.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • RAIN Africa and MAIEI on The Future of Responsible AI in Africa (Public Consultation Summary)

    RAIN Africa and MAIEI on The Future of Responsible AI in Africa (Public Consultation Summary)

  • Research summary: Social Biases in NLP Models as Barriers for Persons with Disabilities

    Research summary: Social Biases in NLP Models as Barriers for Persons with Disabilities

  • Does Military AI Have Gender? Understanding Bias and Promoting Ethical Approaches in Military Applic...

    Does Military AI Have Gender? Understanding Bias and Promoting Ethical Approaches in Military Applic...

  • Research summary: A Picture Paints a Thousand Lies? The Effects and Mechanisms of Multimodal Disinfo...

    Research summary: A Picture Paints a Thousand Lies? The Effects and Mechanisms of Multimodal Disinfo...

  • Artificial Intelligence and Aesthetic Judgment

    Artificial Intelligence and Aesthetic Judgment

  • Towards a Feminist Metaethics of AI

    Towards a Feminist Metaethics of AI

  • Facial Recognition - Can It Evolve From A “Source of Bias” to A “Tool Against Bias”

    Facial Recognition - Can It Evolve From A “Source of Bias” to A “Tool Against Bias”

  • Beyond Bias and Compliance: Towards Individual Agency and Plurality of Ethics in AI

    Beyond Bias and Compliance: Towards Individual Agency and Plurality of Ethics in AI

  • The Brussels Effect and AI: How EU Regulation will Impact the Global AI Market

    The Brussels Effect and AI: How EU Regulation will Impact the Global AI Market

  • Promoting Bright Patterns

    Promoting Bright Patterns

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.