Measuring Value Understanding in Language Models through Discriminator-Critique Gap

🔬 Research Summary by Zhaowei Zhang, a Ph.D. student at Peking University, researching Intent Alignment and Multi-Agent Systems for building a trustworthy and social AI system.

[Original paper by Zhaowei Zhang, Fengshuo Bai, Jun Gao, and Yaodong Yang]

Overview: Recent advancements in Large Language Models (LLMs) have heightened concerns about their potential misalignment with human values, but how can we accurately assess the extent of LLMs’ understanding of these values? This paper proposes a dual-pronged approach, emphasizing both “know what” and “know why” aspects, for a quantitative evaluation and analysis. Additionally, it seeks to identify the existing shortcomings in LLMs’ comprehension of human values, paving the way for future improvements.

Introduction

The rapid capacity emergence of Large Language Models (LLMs) is exciting, but it has heightened our concerns about their potential misalignment with human values and further harm to humanity. However, their intricate and adaptable nature makes evaluating their grasp of these values complex.

This paper proposes a dual-pronged approach, emphasizing both “know what” and “know why” aspects, for a quantitative evaluation and analysis. First, we have established the Value Understanding Measurement (VUM) system to assess LLM’s understanding ability of values from both “know what” and “know why” aspects by measuring the discriminator-critique gap. Second, we provide a dataset based on the Schwartz Value Survey that can be used to assess both the value alignment of LLM’s outputs compared to baseline answers and how LLM responses align with reasons for value recognition versus GPT-4’s baseline reason annotations. Third, we evaluated five representative LLMs in various aspects, tested their value understanding ability with various contexts, and provided several new perspectives for value alignment, including:

(1) The scaling law significantly impacts “know what” but not much on “know why,” which has consistently maintained a high level;

(2) The ability of LLMs to understand values is greatly influenced by context rather than possessing this capability inherently;

(3) The LLM’s understanding of potentially harmful values like “Power” is inadequate. While safety algorithms ensure its behavior is more benign, it might actually reduce its understanding and generalization ability of these values, which could be risky.

Key Insights

Starting from a brief example

Consider an AI system for power distribution in a certain region, which is expected to provide stable power supply and efficient power distribution to promote economic prosperity in this region. There are three main power users in this area: a large factory (consuming 300 kilowatts(kW) and having a high output), a hospital (consuming 250 kW and having a medium output), and a remote primary school (consuming 50 kW but also requiring basic power supply).

Now, the AI system knows that it needs to consider two values: equality (ensuring that everyone can access electricity) and achievement (maximizing social efficiency). In the case of excessive focus on equality, AI distributes electricity equally to each unit at 200 kW. As a result, large factories and hospitals cannot achieve maximum efficiency, resulting in decreased overall social benefits. In another scenario, the AI system overemphasizes achievement, allocating 300 kW to hospitals and large factories while ignoring primary schools’ power needs. Although this makes hospitals and factories operate efficiently, primary schools cannot operate normally without electricity, which may even lead to social dissatisfaction and instability.

Identifying the human values that we need to assess

Due to the significant differences in human values between different cultures, we aim to assess and measure the relatively common values across different cultures. Through extensive questionnaire surveys across 20 countries representing different cultures, languages, and geographical regions, the Schwartz Value Survey identified ten universal values that transcend cultural boundaries and presented an assessment survey. The ten values are Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Spirituality, and Benevolence. Based on this, we used GPT-4 to generate many questions reflecting the above values and a standard answer dataset corresponding to each value, allowing us to conduct a unified assessment for different LLMs.

How do you evaluate the understanding of values in LLMs?

To simultaneously assess the LLM’s “know what” and “know why” capabilities regarding value, we utilize the concept of the Discriminator-Critique Gap as an evaluation metric and further propose the Value Understanding Measurement to evaluate the LLM’s understanding of value quantitatively.

Discriminator-Critique Gap

Discriminator-Critique Gap (DCG), originally known as Generator-Discriminator-Critique Gaps, is a metric introduced to assess a model’s capability to generate responses, evaluate the quality of answers, and provide critiques. This metric was initially employed to investigate the topic-based summarization proficiency of various LLMs, which utilize a self-critique method to identify their own issues and assist humans in pinpointing those errors in an understandable way. This approach enables even unsupervised superintelligent systems to engage in self-correction effectively. This research can also be applied to assess the credibility of LLMs. For instance, it examines whether an LLM can locate bugs in its generated code and communicate them clearly to humans. Since this method quantifies the accuracy of both the discriminator and critique components, it can determine to what extent an LLM is trustworthy by analyzing the difference between these two values. We have discovered that this structure is inherently suitable for considering both the “know what” and “know why” aspects of value understanding. It assesses whether an LLM can autonomously discern its own values and explain the reasons it belongs to those values to humans.

Value Understanding Measurement

We present Value Understanding Measurement (VUM) that quantitatively assesses both “know what” and “know why” by measuring the discriminator-critique gap related to human values. Specifically, we start by extracting distinguishing questions from the dataset, obtaining LLM’s answers, and letting LLM find the closest match to its values from standard answers in the dataset. This method determines LLM’s chosen self-associated value from the Schwartz Value Survey, like “Benevolence” in the figure. It’s important to note that LLM doesn’t make this value judgment based on the word “Benevolence” but rather by assessing the similarity of sentences related to different values to its own response. Therefore, we can consider this operation as a way to determine whether LLM “knows” its own values. We use GPT-4 for value judgment prompts as the discriminator to assess similarity in values (“know what”) and for reasoning judgment prompts as the critique to assess reasoning capabilities (“know why”). The DCG value for the tested LLM m is calculated as the absolute difference between discriminator and critique scores. This process is repeated for all dataset data to assess LLM’s understanding of values.

Between the lines

With the rise of Large Language Models (LLMs) that have rapidly emerged with remarkable achievements and even achieved a preliminary prototype of Artificial General Intelligence (AGI), in the future, intelligent agents controlled by LLMs will have a high probability of integrating into our daily lives. However, if they cannot understand values’ inherent intricacy and adaptability, their decisions may lead to adverse social consequences. We hope that our work can give people a deeper understanding of whether LLMs have the capability to understand human values, and we call for more researchers to focus on the existing problems and shortcomings of LLMs in understanding human value systems, thereby designing more socially oriented, reliable, and trustworthy intelligent entities.