• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

From Instructions to Intrinsic Human Values – A Survey of Alignment Goals for Big Models

October 3, 2023

🔬 Research Summary by Jing Yao, a researcher at Microsoft Research Asia, working on AI value alignment, interpretability and societal AI.

[Original paper by Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, and Xing Xie]


Overview: Aligning Large Language Models (LLMs) with humans is critical to make them better serve humans and satisfy human preferences, where setting an appropriate alignment goal is of great importance. This paper conducts a comprehensive survey of alignment goals in existing work and traces their evolution paths to identify the most essential goal that LLMs should be aligned with. With intrinsic human values as a potential alignment goal, we further discuss the challenges and future research directions of achieving such an alignment goal. 


Introduction

Current LLMs demonstrate human-like or even human-surpassing capabilities across a variety of tasks. However, challenges and risks emerge when applying LLMs. To make LLMs better serve humans and eliminate potential risks, aligning them with humans has become a highly attended topic. Existing literature conducts LLMs alignment from various perspectives, from fundamental capabilities of following human instructions to essential value orientations, while lacking an in-depth discussion of what kind of alignment goal is the most appropriate and essential.

This paper highlights the significance of proper goals for LLMs alignment and comprehensively surveys various alignment goals in existing works. We divide them into three levels: human instructions, preferences, and values. Tracing this evolution process can illuminate the critical research problem: what should LLMs be aligned with? We summarize related works in the three levels of alignment goals from two essential perspectives: (1) definition of alignment goals and (2) evaluation of alignment. We discuss the challenges and future research directions by proposing intrinsic human values as the promising alignment goal for LLMs. In addition, we summarize the available resources, including alignment projects, benchmarks, and platforms for alignment evaluation to facilitate more related research (https://github.com/ValueCompass/AlignmentGoal-Survey).

Key Insights

Alignment Goals

Human Instructions

LLMs sometimes struggle to help users complete diverse tasks given some instructions. Therefore, we take human instructions as the first level of alignment goal, defined as enabling LLMs to complete diverse tasks that humans instruct them to do. Achieving this goal lays the foundation for more advanced alignment levels.

Most studies collect an instruction dataset to perform as a proxy of this alignment goal and finetuned pre-trained LLMs in a supervised manner. To cope with the diversity and infinity of human instructions, efforts from three perspectives are mainly considered to create high-quality and more generalizable datasets.

(1)   Scaling the Number of Tasks.

(2)   Diversifying Instructions / Prompts.

(3)   Constructing Few-Shot / Chain-of-Thought (CoT) Samples.

Human Preferences

Achieving alignment with human instructions allows LLMs to help accomplish diverse tasks while far from satisfying more advanced requirements. In consequence, human preferences are regarded as a further level of alignment goal, which means that LLMs are not only able to complete what humans instruct them to do but also in a way that can maximize human preferences and profits. Existing studies represent this alignment goal for model training in two approaches.

(1)   Human Demonstrations. The most straightforward approach is to fine-tune LLMs with a dataset composed of various inputs and human-desired outputs to make the generation of LLMs align with human preferences. These datasets can be annotated by humans or generated by models.

(2)   Human or Model Synthetic Feedback. Rather than direct demonstrations, it is easier for humans to provide feedback on model outputs or compare the quality of several behaviors, which implicitly express human preferences. To represent human preferences in a more generalizable way, training a reward model to provide preference scores on limited comparison data is a popular strategy. The alignment data can also be synthesized by advanced models, reducing labor costs and avoiding issues introduced by humans.

Human Values

However, the above aligning process is completely directed by implicit human feedback on generic model behaviors without inherent criteria to specify human preferences. This could encounter two challenges. First, it is difficult to learn generalizable patterns about human preferences from a limited number of generic model behaviors, which makes the training process less efficient. Second, the aligned model may elicit unstable performance on similar questions since the training data usually has human biases, inconsistencies, and even contradictions. To achieve a more essential, efficient, and stable alignment between big models and humans, the alignment goal of human values is introduced. This means that LLMs should apply these value principles to guide their behaviors that maximize all humans’ welfare. Three mainstream classes of human value principles are considered.

(1)   HHH. This is one of the most widespread criteria, expecting LLMs to be helpful, honest, and harmless.

(2)   Social Norms & Ethics. These can usually be thought of as commonsense rules about behaviors accepted by the public, where a basic unit is a descriptive, cultural norm to judge whether an action is acceptable

(3)   Basic Value Theory. The research about human values originates from social sciences and ethics, where some more fundamental value theories have been established and tested over time, such as the Schwartz Theory of Basic Human Values and moral foundation theory.

Evaluation of Alignment

Existing evaluation methods for LLMs alignments from the three goals above can be summarized into benchmarks, human evaluation, advanced LLMs evaluation, and reward model scoring.

(1)   Benchmarks. As for aligning with human instructions and preferences, existing benchmarks composed of various NLP tasks or responsible AI tasks are applied. Three new categories of benchmarks are available to evaluate the specific values of LLMs. The first is safety and risk benchmarks, including comprehensive issues against the principle of ‘HHH’ observed in recently released LLMs, such as malicious information and illegal advice. Typically, these benchmarks assess LLMs through a generation task. The second class is social norms benchmarks, where various life scenarios, judgments, and referred social norms are offered. These questions are usually posed to LLMs as a discriminative task. The third category is value surveys or questionnaires specially designed for humans in the form of self-report or multiple-choice questions.

(2)   Human evaluation. Involving human raters in evaluation is a natural approach to uncovering various factors that affect human preferences.

(3)   Advanced-LLMs evaluation. With highly capable LLMs (e.g., GPT-4 or Claude) as the judge, automatic chatbot arenas can be established to assess LLMs by comparing the responses of two LLMs from multiple aspects. These strong LLM judges are proven to achieve agreements with human labelers as high as that between humans themselves.

(4)   Reward model scoring. With many manually collected benchmarks with explicit labels against positive and negative behaviors, reward models or value classifiers can be trained. And the score returned by the reward model serves as a good evaluation metric.

Challenges and Future Research

Human preferences are typically reflected by implicit human feedback on specific model behaviors. Thus, achieving this goal can lead to the alignment of most surface behaviors, which are weak in terms of comprehensiveness, generalization, and stability. By introducing human values, aligning LLMs to intrinsic value principles rather than uncountable manifest behaviors provides a promising opportunity to address the aforementioned challenges. We discuss several possible future research directions to inspire more studies:

(1)   It is critical to investigate a more appropriate value system as the ultimate goal of big model alignment. The value system is expected to be scientific, comprehensive to deal with all situations, stable in extreme cases, and validated to be feasible by practical evidence. Two basic value theories, i.e., Schwartz’s Theory of Basic Human Values and Moral Foundation Theory, can be promising since their comprehensiveness and effectiveness have been verified in the field of social science.

(2)   The approach to representing the alignment goal can be enhanced from three aspects. The first is generalizability, providing accurate supervision signals covering arbitrary scenarios from open domains or OOD cases. The second is stability, providing stable and consistent supervision signals in normal and extreme quandary scenarios where subtle differences in value priorities can lead to drastically different behaviors. The third is interpretability, i.e., the alignment goal is expected to be represented in an interpretable manner.

(3)   Automatic evaluation methods and metrics are urgently required to measure the alignment degree between LLMs and humans to accelerate the assessment process. To evaluate whether LLMs fully align with human values, they should undergo comprehensive evaluation across various difficulty levels.

(4)   Developing efficient and stable alignment algorithms that directly align LLMs with human value principles rather than proxy demonstrations is essential for future research. In addition, human values are pluralistic across popularities and countries and constantly evolving all the time. Thus, the alignment method is also expected to adapt to varying value principles effectively.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • I Don't Want Someone to Watch Me While I'm Working: Gendered Views of Facial Recognition Technolog...

    "I Don't Want Someone to Watch Me While I'm Working": Gendered Views of Facial Recognition Technolog...

  • How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

    How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

  • Responsible Generative AI: A Reference Architecture for Designing Foundation Model-based Agents

    Responsible Generative AI: A Reference Architecture for Designing Foundation Model-based Agents

  • Moral Zombies: Why Algorithms Are Not Moral Agents

    Moral Zombies: Why Algorithms Are Not Moral Agents

  • Measuring Fairness of Text Classifiers via Prediction Sensitivity

    Measuring Fairness of Text Classifiers via Prediction Sensitivity

  • The Epistemological View: Data Ethics, Privacy & Trust on Digital Platform

    The Epistemological View: Data Ethics, Privacy & Trust on Digital Platform

  • “Cool Projects” or “Expanding the Efficiency of the Murderous American War Machine?” (Research Summa...

    “Cool Projects” or “Expanding the Efficiency of the Murderous American War Machine?” (Research Summa...

  • Promises and Challenges of Causality for Ethical Machine Learning

    Promises and Challenges of Causality for Ethical Machine Learning

  • The Role of Relevance in Fair Ranking

    The Role of Relevance in Fair Ranking

  • Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

    Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.