• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Representation Engineering: A Top-Down Approach to AI Transparency

January 25, 2024

🔬 Research Summary by Andy Zou, a Ph.D. student at CMU, advised by Zico Kolter and Matt Fredrikson. He also cofounded the Center for AI Safety (safe.ai).

[Original paper by Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks]


Overview: Deep neural networks have achieved incredible success across a wide variety of domains, yet their inner workings remain poorly understood. In this work, we identify and characterize the emerging area of representation engineering (RepE), which follows an approach of top-down transparency to better understand and control the inner workings of neural networks. Inspired by cognitive neuroscience, we improve the transparency and safety of AI systems by performing AI “mind reading and control.”


Introduction

In this paper, we introduce and characterize the emerging area of representation engineering (RepE), an approach to enhancing AI systems’ transparency that draws on cognitive neuroscience insights. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including truthfulness, memorization, power-seeking, and more, demonstrating the promise of representation-centered transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Key Insights

Motivation

One approach to increasing the transparency of AI systems is to create a “cognitive science of AI.” Current efforts toward this goal largely center around the area of mechanistic interpretability, which focuses on understanding neural networks in terms of neurons and circuits. This aligns with the Sherringtonian view in cognitive neuroscience, which sees cognition as the outcome of node-to-node connections implemented by neurons embedded in circuits within the brain. While this view has successfully explained simple mechanisms, it has struggled to explain more complex phenomena. The contrasting Hopfieldian view (n.b., not to be confused with Hopfield networks) has shown more promise in scaling to higher-level cognition. Rather than focusing on neurons and circuits, the Hopfieldian view sees cognition as a product of representational spaces implemented by patterns of activity across populations of neurons. This view currently has no analog in machine learning, yet it could point toward a new approach to transparency research.

Representation Engineering (RepE)

Representation engineering (RepE) is a top-down approach to transparency research that treats representations as the fundamental unit of analysis, aiming to understand and control representations of high-level cognitive phenomena in neural networks. In particular, we identify two main areas of RepE: Reading and Control. Representation reading seeks to locate emergent representations for high-level concepts and functions within a network. Building on the insights gained from Representation Reading, Representation Control seeks to modify or control the internal representations of concepts and functions.

In the paper, we propose Linear Artificial Tomography (LAT) as a baseline for Rep Reading. For Rep Control, we propose three different baselines: Reading Vector, Contrast Vector, and Low-Rank Representation Adaptation (LoRRA).

Frontiers of RepE

Here, we will give an overview of each of the frontiers we explore in the paper. Please refer to the more detailed section in the paper.

Honesty

In this section, we explore applications of RepE to concepts and functions related to honesty. First, we demonstrate that models possess a consistent internal concept of truthfulness, which enables detecting imitative falsehoods and intentional lies generated by LLMs. We then show how reading a model’s representation of honesty enables control techniques to enhance honesty. These interventions lead us to state-of-the-art results on TruthfulQA.

Ethics and Power

In this section, we explore the application of RepE to various aspects of machine ethics. We present progress in monitoring and controlling learned representations of important concepts and functions, such as utility, morality, probability, risk, and power-seeking tendencies. In addition, we illustrate mathematical relations between emergent representations.

Emotion

In this section, we attempt to conduct LAT scans on the LLaMA-2-Chat-13B model to discern neural activity associated with various emotions and illustrate the profound impact of emotions on model behavior.

Harmless Instruction-Following

Aligned language models designed to resist harmful instructions can be compromised by clever use of tailored prompts known as jailbreaks. We extract a model’s internal representation of harm and use it to draw the model’s attention to the harmfulness concept to shape its behavior, making it more robust to jailbreaks. This suggests the potential of enhancing or dampening targeted traits or values to achieve fine-grained control of model behavior.

Knowledge and Model Editing

We demonstrate how to apply Representation Engineering to identify and manipulate precise knowledge, factual information, and non-numerical concepts.

Memorization

Generative models can memorize a substantial portion of their training data, raising concerns about potential confidential or copyrighted content breaches. In the following section, we present an initial exploration in the area of model memorization with RepE to detect and prevent memorized outputs.

Between the lines

Inspired by the Hopfieldian view in cognitive neuroscience, RepE places representations and the transformations between them at the center of the analysis. As neural networks exhibit more coherent internal structures, we believe analyzing them at the representation level can yield new insights, aiding in effective monitoring and control. While we mainly analyzed subspaces of representations, future work could investigate trajectories, manifolds, and state spaces of representations. We hope this initial step in exploring the potential of RepE helps to foster new insights into understanding and controlling AI systems, ultimately ensuring that future AI systems are trustworthy and safe.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

A network diagram with lots of little emojis, organised in clusters.

Tech Futures: AI For and Against Knowledge

A brightly coloured illustration which can be viewed in any direction. It has many elements to it working together: men in suits around a table, someone in a data centre, big hands controlling the scenes and holding a phone, people in a production line. Motifs such as network diagrams and melting emojis are placed throughout the busy vignettes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part II

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

related posts

  • Against Interpretability: a Critical Examination

    Against Interpretability: a Critical Examination

  • Exploring Clusters of Research in Three Areas of AI Safety

    Exploring Clusters of Research in Three Areas of AI Safety

  • On Measuring Fairness in Generative Modelling (NeurIPS 2023)

    On Measuring Fairness in Generative Modelling (NeurIPS 2023)

  • Animism, Rinri, Modernization; the Base of Japanese Robotics

    Animism, Rinri, Modernization; the Base of Japanese Robotics

  • FaiRIR: Mitigating Exposure Bias from Related Item Recommendations in Two-Sided Platforms

    FaiRIR: Mitigating Exposure Bias from Related Item Recommendations in Two-Sided Platforms

  • A technical study on the feasibility of using proxy methods for algorithmic bias monitoring in a pri...

    A technical study on the feasibility of using proxy methods for algorithmic bias monitoring in a pri...

  • Research Summary: The cognitive science of fake news

    Research Summary: The cognitive science of fake news

  • Making Kin with the Machines

    Making Kin with the Machines

  • Blending Brushstrokes with Bytes: My Artistic Odyssey from Analog to AI

    Blending Brushstrokes with Bytes: My Artistic Odyssey from Analog to AI

  • Research summary: Detecting Misinformation on WhatsApp without Breaking Encryption

    Research summary: Detecting Misinformation on WhatsApp without Breaking Encryption

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.