• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

September 15, 2023

🔬 Research Summary by Stephen Casper, an MIT PhD student working on AI interpretability, diagnostics, and safety.

[Original paper by Stephen Casper,* Xander Davies,* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell]


Overview: Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.


Introduction

Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF’s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models. 

Key Insights

Contributions

  1. Concrete challenges with RLHF: We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the policy. We also distinguish between challenges that are relatively tractable versus ones that are more fundamental limitations of alignment with RLHF.
  2. Incorporating RLHF into a broader technical safety framework: We discuss how RLHF is not a complete framework for developing safe AI and highlight additional approaches that can help to better understand, improve, and complement it. 
  3. Governance and transparency: We consider the challenge of improving industry norms and regulations affecting models trained with RLHF. Specifically, we discuss how companies’ disclosure of certain details using RLHF to train AI systems can improve accountability and auditing.

Some Problems with RLHF are Fundamental

In some respects, technical progress is tractable, and this should be seen as a cause for concerted work and optimism. However, other problems with RLHF cannot fully be solved and instead must be avoided or compensated for with other approaches. Hence, we emphasize the importance of two strategies: (1) evaluating technical progress in light of the fundamental limitations of RLHF and other methods and (2) addressing the sociotechnical challenges of aligning with human values. This will require committing to defense-in-depth safety measures and more openly sharing research findings with the wider scientific community.

RLHF = Rehashing Lessons from Historical Failures?

RLHF offers new capabilities but faces many old problems. Researchers in the safety, ethics, and human-computer interaction fields have demonstrated technical and fundamental challenges with the components of RLHF for decades. In 2023, Paul Christiano (the first author of the 2017 paper, Christiano et al. (2017), prototyping RLHF) described it as a “basic solution” intended to make it easier to “productively work on more challenging alignment problems” such as debate, recursive reward modeling, etc. 

Instead of being used as a stepping stone toward more robust techniques, RLHF’s most prominent impacts have involved advancing AI capabilities. We do not argue that this is good or bad – this raises benefits and concerns for AI alignment. However, we emphasize that the successes of RLHF should not obfuscate its limitations or gaps between the framework under which it is studied and real-world applications. An approach to AI alignment in high-stakes settings that relies on RLHF without additional safeguards risks doubling down on outdated approaches to AI alignment.

Transparency

A sustained commitment to transparency (e.g., to the public or auditors) would improve safety and accountability. First, disclosing some details behind large RLHF training runs would clarify an organization’s norms for model scrutiny and safety checks. Second, increased transparency about efforts to mitigate risks would improve safety incentives and suggest methods for external stakeholders to hold companies accountable. Third, transparency would improve the AI community’s understanding of RLHF and support the ability to track technical progress on its challenges. Specific policy prescriptions are beyond the paper’s scope, but we discuss specific types of details that, if disclosed, could indicate risks. These should be accounted for when auditing AI systems developed using RLHF. 

Future Work

Much more basic research and applied work can be done to improve RLHF and integrate it into a more complete agenda for safer AI. We discuss frameworks that can be used to better understand RLHF, techniques that can help solve challenges and other alignment strategies that will be important to compensate for its shortcomings. RLHF is still a developing and imperfectly understood technique, so more work to study and improve will be valuable. 

Between the lines

The longer I study RLHF, the more uncertain I feel about whether or not its use for finetuning state-of-the-art deployed LLMs is a sign of health from a safety-focused field. RLHF suffers many challenges, and deployed LLMs trained with it have exhibited many failures (e.g., revealing private information, hallucination, encoding biases, sycophancy, expressing undesirable preferences, jailbreaking, and adversarial vulnerabilities). 

On one hand, these failures are a good thing because identifying them can teach us important lessons about developing more trustworthy AI. On the other hand, many of these failures were not foreseen – they escaped internal safety evaluations by the companies releasing these systems. This suggests major limitations in guaranteeing that state-of-the-art AI is reliable. Companies might not be exercising enough caution. When chatbots fail, the stakes are relatively low. But when more advanced AI systems are deployed in high-stakes settings, hopefully, they will not have as many failures as we have seen with RLHF-ed LLMs.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • The Ethics of AI Value Chains: An Approach for Integrating and Expanding AI Ethics Research, Practic...

    The Ethics of AI Value Chains: An Approach for Integrating and Expanding AI Ethics Research, Practic...

  • Supporting Human-LLM collaboration in Auditing LLMs with LLMs

    Supporting Human-LLM collaboration in Auditing LLMs with LLMs

  • Robust Distortion-free Watermarks for Language Models

    Robust Distortion-free Watermarks for Language Models

  • Self-Improving Diffusion Models with Synthetic Data

    Self-Improving Diffusion Models with Synthetic Data

  • Risky Analysis: Assessing and Improving AI Governance Tools

    Risky Analysis: Assessing and Improving AI Governance Tools

  • Brave: what it means to be an AI Ethicist

    Brave: what it means to be an AI Ethicist

  • Risk of AI in Healthcare: A Study Framework

    Risk of AI in Healthcare: A Study Framework

  • Towards an Understanding of Developers' Perceptions of Transparency in Software Development: A Preli...

    Towards an Understanding of Developers' Perceptions of Transparency in Software Development: A Preli...

  • Embedding Ethical Principles into AI Predictive Tools for Migration Management in Humanitarian Actio...

    Embedding Ethical Principles into AI Predictive Tools for Migration Management in Humanitarian Actio...

  • The Case for Anticipating Undesirable Consequences of Computing Innovations Early, Often, and Across...

    The Case for Anticipating Undesirable Consequences of Computing Innovations Early, Often, and Across...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.