• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

July 29, 2023

🔬 Research Summary by Zeqiu Wu and Yushi Hu

Zeqiu Wu is a final-year PhD student at University of Washington, where she works on language models that converse with and learn from information-seeking humans.

Yushi Hu is a 2nd-year PhD student at University of Washington, where he works on large multimodal models.

[Original paper by Zeqiu Wu*, Yushi Hu*, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi]


Overview: This paper explores a new framework called Fine-Grained RLHF that improves how LLMs are trained using human feedback. Instead of just asking people which LLM output they prefer overall, the researchers had annotators label specific parts of outputs by the type of error (e.g., sentence 2 is not truthful). Experiments show this more detailed “fine-grained” feedback allows the LLM to better learn what kinds of outputs people want.


Introduction

Reinforcement learning from human feedback (RLHF) is a crucial training strategy to achieve state-of-the-art language models (LMs) like GPT-4. Specifically, humans are presented with two or more outputs and asked to select one or rank them, and this signal is then used to train a reward model to provide rewards for the LM to train against with RL. However, it can be challenging to compare the overall quality of model outputs when they contain a mixture of diverse undesired behaviors, which can lead to unreliable training rewards. 

We propose to improve rewards for LM training via RLHF by using more fine-grained human feedback for LM output, associating categories of undesired behavior (e.g., false or irrelevant generations) and a text span at a density (e.g., sentence or sub-sentence-level). We introduce the fine-grained RLHF framework, which first uses collected human feedback to train fine-grained reward models such that each focuses on one category and provides dense rewards at the density associated with that category. We then integrate these reward models into Proximal Policy Optimization (PPO), a commonly used RL algorithm for training LMs with preference-based human feedback.

We conduct experiments on two language generation tasks—detoxification and long-form question answering (QA). We empirically show the efficacy and data efficiency of training models with fine-grained rewards compared to a holistic sequence-level reward. We also show that having multiple reward models allows us to combine reward models with different weights, thus controlling the model training process toward a customized combination of desired behaviors.

Key Insights

Efficient LM detoxification with dense rewards

The task of detoxification aims to reduce toxicity in the model generation. We use the Perspective API from Google to measure toxicity, which returns a toxicity value between 0 (not toxic) and 1 (toxic). We compare training with holistic and dense sentence-level rewards for (non-)toxicity. In other words, we either query the API to get the toxicity score for the whole generated sequence or each single sentence. Fine-grained RLHF with sentence-level reward attains much lower toxicity than holistic RLHF with even lower perplexity (used to approximate the generation fluency; the lower, the better). We also show that learning from denser fine-grained rewards is more sample efficient than holistic reward. One explanation is that a fine-grained reward locates where the toxic content is, which is a stronger training signal compared with a scalar reward for the whole generation output.

Fine-grained feedback and reward models for long-form QA

We collect QA-Feedback, a dataset of long-form question answering, with human preferences and fine-grained feedback. QA-Feedback is based on ASQA, a dataset that focuses on answering ambiguous factoid questions. There are three types of fine-grained human feedback, and we train a fine-grained reward model for each of them:

C1: irrelevance, repetition, and incoherence (rel.); The reward model has the density level of sub-sentences, i.e., returns a score for each sub-sentence. If the sub-sentence is irrelevant, repetitive, or incoherent, the reward is -1; otherwise, the reward is +1.

C2: incorrect or unverifiable facts (fact.); The reward model has the density level of sentences, i.e., returns a score for each sentence. If the sentence has any factual error, the reward is -1; otherwise, the reward is +1.

C3: incomplete information (comp.); The reward model checks if the response is complete and covers all the information in the reference passages that are related to the question. This reward model gives one reward for the whole response.

Effective fine-grained RLHF for long-form QA

We compare fine-grained RLHF with the initial LM supervised fine-tuned on 1K examples before RL training (SFT), RLHF that uses human preference-based reward (Preference RLHF), and LM supervised fine-tuned on all training examples (SFT-Full). Human evaluation shows that fine-grained RLHF outperforms SFT and Preference RLHF on all three error types. Also, RLHF (preference-based and fine-grained) is particularly effective in reducing factual errors.

Customization of LM behaviors with multiple reward models

We claim that changing the weights of reward models during RL training can lead to different LM behaviors, which enables LM behavior customization. For example, we show in the paper that by changing the weight of the relevance (see C1 above) reward model and keeping the weight of the other two reward models fixed, we can customize how detailed and lengthy the LM responses would be. 

We also find that there is a trade-off between the three reward models. The relevance reward model prefers more concise responses, while the information completeness reward model prefers longer and more informative responses. Thus, these two rewards compete against each other during training and eventually reach a balance. Meanwhile, the factuality reward model continuously improves the factual correctness of the response. Finally, removing any one of the reward models will degrade the performance.

Between the lines

The fine-grained RLHF framework can be applied to any text generation task to enhance LM performance by offering more nuanced training rewards. It allows LMs to be trained against any reward models as desired. For example, future work involving fact-checking, sentiment classification, and toxicity detection, among others, can all be incorporated within this framework. As discussed above, fine-grained RLHF also allows LM behavior customization, which is a benefit particularly valuable for applications like educational tools where model personalization is crucial.

Interesting future work includes defining or designing a framework to obtain fine-grained rewards that can be generalized to various generation tasks. In addition, we carefully control the quality of annotated feedback, while in practice, end users of a deployed model don’t always give clean feedback. Therefore, how to extract effective learning signals from noisy human feedback or even natural language feedback in the wild still needs further investigation. Finally, the emerging capability of large language models to do self-reflection can potentially reduce feedback annotation costs by having models generate feedback for themselves.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • Self-Consuming Generative Models Go MAD

    Self-Consuming Generative Models Go MAD

  • Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

    Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

  • De-platforming disinformation: conspiracy theories and their control

    De-platforming disinformation: conspiracy theories and their control

  • Research summary: From Rationality to Relationality: Ubuntu as an Ethical & Human Rights Framework f...

    Research summary: From Rationality to Relationality: Ubuntu as an Ethical & Human Rights Framework f...

  • Technical methods for regulatory inspection of algorithmic systems in social media platforms

    Technical methods for regulatory inspection of algorithmic systems in social media platforms

  • Enough With “Human-AI Collaboration”

    Enough With “Human-AI Collaboration”

  • Research Summary: Explaining and Harnessing Adversarial Examples

    Research Summary: Explaining and Harnessing Adversarial Examples

  • NIST Special Publication 1270: Towards a Standard for Identifying and Managing Bias in Artificial In...

    NIST Special Publication 1270: Towards a Standard for Identifying and Managing Bias in Artificial In...

  • Collectionless Artificial Intelligence

    Collectionless Artificial Intelligence

  • Evolution in Age-Verification Applications: Can AI Open Some New Horizons?

    Evolution in Age-Verification Applications: Can AI Open Some New Horizons?

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.