🔬 Research Summary by Jing Yao, a researcher at Microsoft Research Asia, working on AI value alignment, interpretability and societal AI.
[Original paper by Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, and Xing Xie]
Overview: Aligning Large Language Models (LLMs) with humans is critical to make them better serve humans and satisfy human preferences, where setting an appropriate alignment goal is of great importance. This paper conducts a comprehensive survey of alignment goals in existing work and traces their evolution paths to identify the most essential goal that LLMs should be aligned with. With intrinsic human values as a potential alignment goal, we further discuss the challenges and future research directions of achieving such an alignment goal.
Introduction
Current LLMs demonstrate human-like or even human-surpassing capabilities across a variety of tasks. However, challenges and risks emerge when applying LLMs. To make LLMs better serve humans and eliminate potential risks, aligning them with humans has become a highly attended topic. Existing literature conducts LLMs alignment from various perspectives, from fundamental capabilities of following human instructions to essential value orientations, while lacking an in-depth discussion of what kind of alignment goal is the most appropriate and essential.
This paper highlights the significance of proper goals for LLMs alignment and comprehensively surveys various alignment goals in existing works. We divide them into three levels: human instructions, preferences, and values. Tracing this evolution process can illuminate the critical research problem: what should LLMs be aligned with? We summarize related works in the three levels of alignment goals from two essential perspectives: (1) definition of alignment goals and (2) evaluation of alignment. We discuss the challenges and future research directions by proposing intrinsic human values as the promising alignment goal for LLMs. In addition, we summarize the available resources, including alignment projects, benchmarks, and platforms for alignment evaluation to facilitate more related research (https://github.com/ValueCompass/AlignmentGoal-Survey).
Key Insights
Alignment Goals
Human Instructions
LLMs sometimes struggle to help users complete diverse tasks given some instructions. Therefore, we take human instructions as the first level of alignment goal, defined as enabling LLMs to complete diverse tasks that humans instruct them to do. Achieving this goal lays the foundation for more advanced alignment levels.
Most studies collect an instruction dataset to perform as a proxy of this alignment goal and finetuned pre-trained LLMs in a supervised manner. To cope with the diversity and infinity of human instructions, efforts from three perspectives are mainly considered to create high-quality and more generalizable datasets.
(1) Scaling the Number of Tasks.
(2) Diversifying Instructions / Prompts.
(3) Constructing Few-Shot / Chain-of-Thought (CoT) Samples.
Human Preferences
Achieving alignment with human instructions allows LLMs to help accomplish diverse tasks while far from satisfying more advanced requirements. In consequence, human preferences are regarded as a further level of alignment goal, which means that LLMs are not only able to complete what humans instruct them to do but also in a way that can maximize human preferences and profits. Existing studies represent this alignment goal for model training in two approaches.
(1) Human Demonstrations. The most straightforward approach is to fine-tune LLMs with a dataset composed of various inputs and human-desired outputs to make the generation of LLMs align with human preferences. These datasets can be annotated by humans or generated by models.
(2) Human or Model Synthetic Feedback. Rather than direct demonstrations, it is easier for humans to provide feedback on model outputs or compare the quality of several behaviors, which implicitly express human preferences. To represent human preferences in a more generalizable way, training a reward model to provide preference scores on limited comparison data is a popular strategy. The alignment data can also be synthesized by advanced models, reducing labor costs and avoiding issues introduced by humans.
Human Values
However, the above aligning process is completely directed by implicit human feedback on generic model behaviors without inherent criteria to specify human preferences. This could encounter two challenges. First, it is difficult to learn generalizable patterns about human preferences from a limited number of generic model behaviors, which makes the training process less efficient. Second, the aligned model may elicit unstable performance on similar questions since the training data usually has human biases, inconsistencies, and even contradictions. To achieve a more essential, efficient, and stable alignment between big models and humans, the alignment goal of human values is introduced. This means that LLMs should apply these value principles to guide their behaviors that maximize all humans’ welfare. Three mainstream classes of human value principles are considered.
(1) HHH. This is one of the most widespread criteria, expecting LLMs to be helpful, honest, and harmless.
(2) Social Norms & Ethics. These can usually be thought of as commonsense rules about behaviors accepted by the public, where a basic unit is a descriptive, cultural norm to judge whether an action is acceptable
(3) Basic Value Theory. The research about human values originates from social sciences and ethics, where some more fundamental value theories have been established and tested over time, such as the Schwartz Theory of Basic Human Values and moral foundation theory.
Evaluation of Alignment
Existing evaluation methods for LLMs alignments from the three goals above can be summarized into benchmarks, human evaluation, advanced LLMs evaluation, and reward model scoring.
(1) Benchmarks. As for aligning with human instructions and preferences, existing benchmarks composed of various NLP tasks or responsible AI tasks are applied. Three new categories of benchmarks are available to evaluate the specific values of LLMs. The first is safety and risk benchmarks, including comprehensive issues against the principle of ‘HHH’ observed in recently released LLMs, such as malicious information and illegal advice. Typically, these benchmarks assess LLMs through a generation task. The second class is social norms benchmarks, where various life scenarios, judgments, and referred social norms are offered. These questions are usually posed to LLMs as a discriminative task. The third category is value surveys or questionnaires specially designed for humans in the form of self-report or multiple-choice questions.
(2) Human evaluation. Involving human raters in evaluation is a natural approach to uncovering various factors that affect human preferences.
(3) Advanced-LLMs evaluation. With highly capable LLMs (e.g., GPT-4 or Claude) as the judge, automatic chatbot arenas can be established to assess LLMs by comparing the responses of two LLMs from multiple aspects. These strong LLM judges are proven to achieve agreements with human labelers as high as that between humans themselves.
(4) Reward model scoring. With many manually collected benchmarks with explicit labels against positive and negative behaviors, reward models or value classifiers can be trained. And the score returned by the reward model serves as a good evaluation metric.
Challenges and Future Research
Human preferences are typically reflected by implicit human feedback on specific model behaviors. Thus, achieving this goal can lead to the alignment of most surface behaviors, which are weak in terms of comprehensiveness, generalization, and stability. By introducing human values, aligning LLMs to intrinsic value principles rather than uncountable manifest behaviors provides a promising opportunity to address the aforementioned challenges. We discuss several possible future research directions to inspire more studies:
(1) It is critical to investigate a more appropriate value system as the ultimate goal of big model alignment. The value system is expected to be scientific, comprehensive to deal with all situations, stable in extreme cases, and validated to be feasible by practical evidence. Two basic value theories, i.e., Schwartz’s Theory of Basic Human Values and Moral Foundation Theory, can be promising since their comprehensiveness and effectiveness have been verified in the field of social science.
(2) The approach to representing the alignment goal can be enhanced from three aspects. The first is generalizability, providing accurate supervision signals covering arbitrary scenarios from open domains or OOD cases. The second is stability, providing stable and consistent supervision signals in normal and extreme quandary scenarios where subtle differences in value priorities can lead to drastically different behaviors. The third is interpretability, i.e., the alignment goal is expected to be represented in an interpretable manner.
(3) Automatic evaluation methods and metrics are urgently required to measure the alignment degree between LLMs and humans to accelerate the assessment process. To evaluate whether LLMs fully align with human values, they should undergo comprehensive evaluation across various difficulty levels.
(4) Developing efficient and stable alignment algorithms that directly align LLMs with human value principles rather than proxy demonstrations is essential for future research. In addition, human values are pluralistic across popularities and countries and constantly evolving all the time. Thus, the alignment method is also expected to adapt to varying value principles effectively.