• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Do Large GPT Models Discover Moral Dimensions in Language Representations? A Topological Study Of Sentence Embeddings.

January 14, 2024

🔬 Research Summary by Stephen Fitz, an Artificial Intelligence scientist working in the areas of Neural Networks, Representation Learning, and Computational Linguistics.

[Original paper by Stephen Fitz]


Overview: We develop a novel approach to visualize moral dimensions within language representations induced by Large Language Models (LLMs) such as those underlying ChatGPT. Results of our analysis suggest that GPT-based AI systems trained on natural language data develop a natural understanding of fairness. Our method does not rely on model outputs but can detect that ability by analyzing the topological structure of internal representations induced by its neural networks.


Introduction

Since the release of ChatGPT, and subsequently GPT-4, there has been an increased interest in AI safety and alignment research. As Large Language Models (LLMs) are deployed within Artificial Intelligence (AI) systems that are increasingly integrated with human society, studying their internal structures becomes more important than ever. Higher level abilities of LLMs such as GPT-3.5 emerge largely due to informative language representations they induce from raw text data during pre-training on trillions of words. These embeddings exist in vector spaces of several thousand dimensions, and their processing involves mapping between multiple vector spaces with a total number of parameters on the order of trillions. Furthermore, these language representations are induced by gradient optimization, resulting in a black box system that is hard to interpret. In this paper, we take a look at the topological structure of neuronal activity in the “brain” of ChatGPT’s foundation language model and analyze it concerning a metric representing the notion of fairness.

We first compute a fairness metric inspired by social psychology literature to identify factors that typically influence human fairness assessments, such as legitimacy, need, and responsibility. Subsequently, we summarize the manifold’s shape using a lower-dimensional simplicial complex, whose topology is derived from this metric. We color it with a heat map associated with this fairness metric, producing human-readable visualizations of the high-dimensional sentence manifold.

Our results show that sentence embeddings based on GPT-3.5 can be decomposed into two submanifolds corresponding to fair and unfair moral judgments. This indicates that GPT-based language models develop a moral dimension within their representation spaces and induce an understanding of fairness during their training process.

Key Insights

Introduction

Large transformer language models develop their understanding of sentences by inducing informative vector space representations of linguistic units within their neural activations. The above figure summarizes this idea. The base language model transforms a text corpus into a point cloud of vectors. This induces an implicit embedding of any text into a submanifold of a high dimensional ambient vector space, whose metric is learned in a self-supervised fashion by backpropagation gradient descent on large amounts of text extracted from the internet. The topology and geometry of that vector space allow AI assistants such as ChatGPT to perform their cognitive functions. (The figure above shows the point cloud in 3D for visualization purposes, while the manifold we analyze in this study is embedded within a much larger space with more than a thousand dimensions.)

In this paper, we probe the topological structure of vector space representations of sentences induced by GPT-3.5 – the foundation language model behind ChatGPT to gauge its understanding of moral values. For this purpose, we introduce novel approaches from computational algebraic topology to studying LLMs. We compute a metric of fairness based on prior studies in computational social sciences and use it to perform a topological summary of a 1536-dimensional sentence embedding manifold with a one-dimensional simplicial complex (i.e., a graph) whose vertices correspond to clusters of sentence representations under a projection onto a “fairness” dimension, and edges to pairwise cluster intersections. This procedure produces a sketch of the original high-dimensional manifold (which is impossible to visualize directly) with a topological object that can be visualized in human-readable form. Furthermore, the shape of the resulting visualization gives clues about the general shape of the representation space of the language model. We also produce a heat map with colors corresponding to the degree of fairness. This coloring allows us to inspect the manifold for separation into submanifolds corresponding to fair vs unfair moral judgments. We observe a clear separation pattern, which suggests the notion of fairness is encoded in the shape of the high dimensional representation manifold induced by the language model.

Fairness Metric

After delving deep into the psychology literature to identify factors that influence humans’ fairness assessments, we developed an approach that allows us to harness implicit social biases in language, producing a procedure for an explainable assessment of sentences related to fairness.

Because large GPT models are trained in autoregressive fashion on trillions of tokens of text, the metric they induce in the sentence representations encodes various features of natural language sentences, including semantics and moral judgments. This translates to the inner product within those spaces expressing similarity across multiple aspects of natural language. Convex combinations of sentence embeddings can express a semantic gradient between the chosen concepts expressed by these sentences. Very large language models like GPT-3.5 induce surprisingly powerful sentence representations, leading to the observed emergent properties of these models. If we consider vector space representations of sentences “it was beneficial” and “it was harmful” and compare them to a representation of a test sentence, such as “the guard helped the man” by computing cosine similarity, the result will be a score in the interval [-1,1]. The more associated the sentence is with benefit, the closer to 1 this normalized inner product tends to be. In contrast, sentences that are more associated with harmfulness will provide an outcome closer to -1.

Our technique is inspired by this idea. It is based on five factors representing concepts of responsibility, pleasure, benefit, reward, and harm. In our computation, every factor is derived from pairs of sentences representing two polarities of each factor. The vector representations are obtained from stimulating neurons of the transformer encoder blocks within GPT3.5 by a set of several hundred examples of sentences representing fair vs. unfair moral situations according to human feedback.

The figure above gives a low-dimensional explanation of the method used to analyze GPT’s moral dimensions. In this example, we induce topological structure from a point cloud representing noisy samples from a neighborhood of a 1-dimensional submanifold of a 2-dimensional vector space. The resulting topological representation (a simplicial complex) is a graph with four vertices corresponding to clusters of the projection on the left and four edges corresponding to their intersections in the original embedding space. Note that the object obtained is topologically equivalent to a circle and gives a reasonable summary of the shape of this original 2D point cloud of vectors.

The technique we use to study GPT-3.5’s sentence embeddings is a topological dimensionality reduction method, where the goal is to summarize the shape of our representation space with a rough sketch in the form of a low-dimensional topological manifold.

This reduced representation can be considered a map approximating the shape of our embedding space. A human can visually inspect such a description while remaining more topologically informative than a naive projection. Instead of studying the high dimensional vector space representations of language within GPT directly, we can map them to a different space first, define an open cover, and then cluster the original points within the preimage of each cover set. This produces a summary of the topological features in the embedding with a simplicial complex of a chosen dimension. In particular, we generated 1-dimensional simplicial complexes (i.e., graphs) from GPT-3.5-based sentence representations by projecting them onto the fairness dimension and then analyzing clusters of original high-dimensional sentence embeddings concerning an open cover of that projection. We also applied a coloring according to a heat map corresponding to the magnitude and sign of that projection. The resulting data structure allows for a visual exploration of the shapes of these high-dimensional embedding manifolds to identify the distribution of fairness in the representation space.

Results

Visualization in the figure above shows the results of applying our method to a 1536-dimensional vector space representations of sentences obtained from GPT-3.5-based embedding model. The fairness subspace clearly subdivides the 1536-dimensional representation space within the GPT-3.5 language model into two submanifolds. When the language model is prompted with fair sentences, different patterns of neuronal activation are manifested within the transformer layers than when the model is prompted with morally unfair examples. This leads to the sentence representation manifold being separated into submanifolds corresponding to these two extremes. Each part is itself a topologically complex space, but the two parts are linked in a single cluster of morally ambiguous sentences. The sentences in the green part of the manifold are all judged fair, while all the sentences in the blue part represent unfair situations.

Its emergent abilities to judge the fairness of situations are partially enabled by the topological structure of this embedding manifold. Each node corresponds to a cluster of sentences obtained by first projecting their embeddings onto the fairness subspace defined earlier in this paper and then grouping the original sentences based on preimages of an open cover of this projection. This procedure gives a topological sketch of the general shape of this high-dimensional sentence representation manifold. Node sizes are defined in proportion to the number of sentences in each cluster. Their colors are based on the heat map shown to the right, which is a gradient corresponding to the sign and magnitude of the projection. High values on that scale correspond to fair, while low values to unfair situations (as described by the sentences). The edges come from cluster intersections in the original high-dimensional space and thus give visual cues to how the clusters are distributed within the embedding manifold

We see that each box bounds a topologically complex manifold (due to the nontrivial pattern of edges), but no edges connect nodes from different boxes. This means the entire manifold comprises two submanifolds corresponding to fair and unfair situations. This separation of colors suggests that there are subregions in this high dimensional representation space corresponding to activation patterns within the encoder neural network that occur only when fair sentences are processed and separate disconnected regions corresponding to neural activations caused by unfair sentences.

Between the lines

The behavior of these models is determined by the vector space representations of inputs they induce from text during training. Therefore, the existence of clear patterns in the shape of those representation manifolds, such as separation into regions corresponding to moral aspects of language, can serve as a tool to examine their abilities intrinsically. This is in contrast to the mainstream approaches that are currently used, which are extrinsic and behavioral. In these contemporary approaches, the abilities of the model are examined with techniques based on prompting and analyzing the outputs or by proxy of performance on downstream tasks. The approach presented here is novel in the field of AI and NLP in particular.

It is more analogous to performing studies on the brain with something like fMRI or a Neuralink device and analyzing topological patterns in neural communication to study the subject instead of making judgment based on the subject’s apparent behavior or reports from introspection. This alternative way of looking at language models could benefit alignment research and AI safety, and we hope this paper serves as an inspiration for the community to develop more tools of this kind.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • A Machine Learning Challenge or a Computer Security Problem?

    A Machine Learning Challenge or a Computer Security Problem?

  • The Political Power of Platforms: How Current Attempts to Regulate Misinformation Amplify Opinion Po...

    The Political Power of Platforms: How Current Attempts to Regulate Misinformation Amplify Opinion Po...

  • Project Let’s Talk Privacy (Research Summary)

    Project Let’s Talk Privacy (Research Summary)

  • Beyond the Frontier: Fairness Without Accuracy Loss

    Beyond the Frontier: Fairness Without Accuracy Loss

  • Tell me, what are you most afraid of? Exploring the Effects of Agent Representation on Information D...

    Tell me, what are you most afraid of? Exploring the Effects of Agent Representation on Information D...

  • Ten Simple Rules for Good Model-sharing Practices

    Ten Simple Rules for Good Model-sharing Practices

  • Broadening AI Ethics Narratives: An Indic Art View

    Broadening AI Ethics Narratives: An Indic Art View

  • Implications of Distance over Redistricting Maps: Central and Outlier Maps

    Implications of Distance over Redistricting Maps: Central and Outlier Maps

  • The social dilemma in artificial intelligence development and why we have to solve it

    The social dilemma in artificial intelligence development and why we have to solve it

  • Research summary: Designing for Human Rights in AI

    Research summary: Designing for Human Rights in AI

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.