Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

🔬 Research Summary by Zeqiu Wu and Yushi Hu

Zeqiu Wu is a final-year PhD student at University of Washington, where she works on language models that converse with and learn from information-seeking humans.

Yushi Hu is a 2nd-year PhD student at University of Washington, where he works on large multimodal models.

[Original paper by Zeqiu Wu*, Yushi Hu*, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi]

Overview: This paper explores a new framework called Fine-Grained RLHF that improves how LLMs are trained using human feedback. Instead of just asking people which LLM output they prefer overall, the researchers had annotators label specific parts of outputs by the type of error (e.g., sentence 2 is not truthful). Experiments show this more detailed “fine-grained” feedback allows the LLM to better learn what kinds of outputs people want.

Introduction

Reinforcement learning from human feedback (RLHF) is a crucial training strategy to achieve state-of-the-art language models (LMs) like GPT-4. Specifically, humans are presented with two or more outputs and asked to select one or rank them, and this signal is then used to train a reward model to provide rewards for the LM to train against with RL. However, it can be challenging to compare the overall quality of model outputs when they contain a mixture of diverse undesired behaviors, which can lead to unreliable training rewards.

We propose to improve rewards for LM training via RLHF by using more fine-grained human feedback for LM output, associating categories of undesired behavior (e.g., false or irrelevant generations) and a text span at a density (e.g., sentence or sub-sentence-level). We introduce the fine-grained RLHF framework, which first uses collected human feedback to train fine-grained reward models such that each focuses on one category and provides dense rewards at the density associated with that category. We then integrate these reward models into Proximal Policy Optimization (PPO), a commonly used RL algorithm for training LMs with preference-based human feedback.

We conduct experiments on two language generation tasks—detoxification and long-form question answering (QA). We empirically show the efficacy and data efficiency of training models with fine-grained rewards compared to a holistic sequence-level reward. We also show that having multiple reward models allows us to combine reward models with different weights, thus controlling the model training process toward a customized combination of desired behaviors.

Key Insights

Efficient LM detoxification with dense rewards

The task of detoxification aims to reduce toxicity in the model generation. We use the Perspective API from Google to measure toxicity, which returns a toxicity value between 0 (not toxic) and 1 (toxic). We compare training with holistic and dense sentence-level rewards for (non-)toxicity. In other words, we either query the API to get the toxicity score for the whole generated sequence or each single sentence. Fine-grained RLHF with sentence-level reward attains much lower toxicity than holistic RLHF with even lower perplexity (used to approximate the generation fluency; the lower, the better). We also show that learning from denser fine-grained rewards is more sample efficient than holistic reward. One explanation is that a fine-grained reward locates where the toxic content is, which is a stronger training signal compared with a scalar reward for the whole generation output.

Fine-grained feedback and reward models for long-form QA

We collect QA-Feedback, a dataset of long-form question answering, with human preferences and fine-grained feedback. QA-Feedback is based on ASQA, a dataset that focuses on answering ambiguous factoid questions. There are three types of fine-grained human feedback, and we train a fine-grained reward model for each of them:

C1: irrelevance, repetition, and incoherence (rel.); The reward model has the density level of sub-sentences, i.e., returns a score for each sub-sentence. If the sub-sentence is irrelevant, repetitive, or incoherent, the reward is -1; otherwise, the reward is +1.

C2: incorrect or unverifiable facts (fact.); The reward model has the density level of sentences, i.e., returns a score for each sentence. If the sentence has any factual error, the reward is -1; otherwise, the reward is +1.

C3: incomplete information (comp.); The reward model checks if the response is complete and covers all the information in the reference passages that are related to the question. This reward model gives one reward for the whole response.

Effective fine-grained RLHF for long-form QA

We compare fine-grained RLHF with the initial LM supervised fine-tuned on 1K examples before RL training (SFT), RLHF that uses human preference-based reward (Preference RLHF), and LM supervised fine-tuned on all training examples (SFT-Full). Human evaluation shows that fine-grained RLHF outperforms SFT and Preference RLHF on all three error types. Also, RLHF (preference-based and fine-grained) is particularly effective in reducing factual errors.

Customization of LM behaviors with multiple reward models

We claim that changing the weights of reward models during RL training can lead to different LM behaviors, which enables LM behavior customization. For example, we show in the paper that by changing the weight of the relevance (see C1 above) reward model and keeping the weight of the other two reward models fixed, we can customize how detailed and lengthy the LM responses would be.

We also find that there is a trade-off between the three reward models. The relevance reward model prefers more concise responses, while the information completeness reward model prefers longer and more informative responses. Thus, these two rewards compete against each other during training and eventually reach a balance. Meanwhile, the factuality reward model continuously improves the factual correctness of the response. Finally, removing any one of the reward models will degrade the performance.

Between the lines

The fine-grained RLHF framework can be applied to any text generation task to enhance LM performance by offering more nuanced training rewards. It allows LMs to be trained against any reward models as desired. For example, future work involving fact-checking, sentiment classification, and toxicity detection, among others, can all be incorporated within this framework. As discussed above, fine-grained RLHF also allows LM behavior customization, which is a benefit particularly valuable for applications like educational tools where model personalization is crucial.

Interesting future work includes defining or designing a framework to obtain fine-grained rewards that can be generalized to various generation tasks. In addition, we carefully control the quality of annotated feedback, while in practice, end users of a deployed model don’t always give clean feedback. Therefore, how to extract effective learning signals from noisy human feedback or even natural language feedback in the wild still needs further investigation. Finally, the emerging capability of large language models to do self-reflection can potentially reduce feedback annotation costs by having models generate feedback for themselves.