• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • State of AI Ethics Report Volume 8 (2026): Call for Contributors
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

September 15, 2023

🔬 Research Summary by Stephen Casper, an MIT PhD student working on AI interpretability, diagnostics, and safety.

[Original paper by Stephen Casper,* Xander Davies,* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell]


Overview: Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.


Introduction

Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF’s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models. 

Key Insights

Contributions

  1. Concrete challenges with RLHF: We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the policy. We also distinguish between challenges that are relatively tractable versus ones that are more fundamental limitations of alignment with RLHF.
  2. Incorporating RLHF into a broader technical safety framework: We discuss how RLHF is not a complete framework for developing safe AI and highlight additional approaches that can help to better understand, improve, and complement it. 
  3. Governance and transparency: We consider the challenge of improving industry norms and regulations affecting models trained with RLHF. Specifically, we discuss how companies’ disclosure of certain details using RLHF to train AI systems can improve accountability and auditing.

Some Problems with RLHF are Fundamental

In some respects, technical progress is tractable, and this should be seen as a cause for concerted work and optimism. However, other problems with RLHF cannot fully be solved and instead must be avoided or compensated for with other approaches. Hence, we emphasize the importance of two strategies: (1) evaluating technical progress in light of the fundamental limitations of RLHF and other methods and (2) addressing the sociotechnical challenges of aligning with human values. This will require committing to defense-in-depth safety measures and more openly sharing research findings with the wider scientific community.

RLHF = Rehashing Lessons from Historical Failures?

RLHF offers new capabilities but faces many old problems. Researchers in the safety, ethics, and human-computer interaction fields have demonstrated technical and fundamental challenges with the components of RLHF for decades. In 2023, Paul Christiano (the first author of the 2017 paper, Christiano et al. (2017), prototyping RLHF) described it as a “basic solution” intended to make it easier to “productively work on more challenging alignment problems” such as debate, recursive reward modeling, etc. 

Instead of being used as a stepping stone toward more robust techniques, RLHF’s most prominent impacts have involved advancing AI capabilities. We do not argue that this is good or bad – this raises benefits and concerns for AI alignment. However, we emphasize that the successes of RLHF should not obfuscate its limitations or gaps between the framework under which it is studied and real-world applications. An approach to AI alignment in high-stakes settings that relies on RLHF without additional safeguards risks doubling down on outdated approaches to AI alignment.

Transparency

A sustained commitment to transparency (e.g., to the public or auditors) would improve safety and accountability. First, disclosing some details behind large RLHF training runs would clarify an organization’s norms for model scrutiny and safety checks. Second, increased transparency about efforts to mitigate risks would improve safety incentives and suggest methods for external stakeholders to hold companies accountable. Third, transparency would improve the AI community’s understanding of RLHF and support the ability to track technical progress on its challenges. Specific policy prescriptions are beyond the paper’s scope, but we discuss specific types of details that, if disclosed, could indicate risks. These should be accounted for when auditing AI systems developed using RLHF. 

Future Work

Much more basic research and applied work can be done to improve RLHF and integrate it into a more complete agenda for safer AI. We discuss frameworks that can be used to better understand RLHF, techniques that can help solve challenges and other alignment strategies that will be important to compensate for its shortcomings. RLHF is still a developing and imperfectly understood technique, so more work to study and improve will be valuable. 

Between the lines

The longer I study RLHF, the more uncertain I feel about whether or not its use for finetuning state-of-the-art deployed LLMs is a sign of health from a safety-focused field. RLHF suffers many challenges, and deployed LLMs trained with it have exhibited many failures (e.g., revealing private information, hallucination, encoding biases, sycophancy, expressing undesirable preferences, jailbreaking, and adversarial vulnerabilities). 

On one hand, these failures are a good thing because identifying them can teach us important lessons about developing more trustworthy AI. On the other hand, many of these failures were not foreseen – they escaped internal safety evaluations by the companies releasing these systems. This suggests major limitations in guaranteeing that state-of-the-art AI is reliable. Companies might not be exercising enough caution. When chatbots fail, the stakes are relatively low. But when more advanced AI systems are deployed in high-stakes settings, hopefully, they will not have as many failures as we have seen with RLHF-ed LLMs.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

SAIER Volume 8 (2026)

SAIER Volume 8 (2026) Call for Contributors

🔍 SEARCH

Spotlight

Vertically- and horizontally-placed chess boards and chess pieces

Tech Futures: At the Frontier of Fear, Uncertainty and Doubt

Tech Futures: Introducing the Resist List

An abstract spiral of dark circles appears at the centre, resembling a tornado. Several vintage magazine covers and advertisements are being drawn toward the spiral. The artworks that have already been pulled into it are becoming distorted and replaced with clusters of numbers representing their numerical embeddings.

Tech Futures: Better Imagination for Better Tech Futures

This image is a collage with a colourful Japanese vintage landscape showing a mountain, hills, flowers and other plants and a small stream. There are 3 large black data servers placed in the bottom half of the image, with a cloud of black smoke emitting from them, partly obscuring the scenery.

Tech Futures: Crafting Participatory Tech Futures

A network diagram with lots of little emojis, organised in clusters.

Tech Futures: AI For and Against Knowledge

related posts

  • Exploring the Subtleties of Privacy Protection in Machine Learning Research in QuĂ©bec 

    Exploring the Subtleties of Privacy Protection in Machine Learning Research in Québec 

  • The State of AI Ethics Report (Oct 2020)

    The State of AI Ethics Report (Oct 2020)

  • Regulating computer vision & the ongoing relevance of AI ethics

    Regulating computer vision & the ongoing relevance of AI ethics

  • Understanding the Effect of Counterfactual Explanations on Trust and Reliance on AI for Human-AI Col...

    Understanding the Effect of Counterfactual Explanations on Trust and Reliance on AI for Human-AI Col...

  • The E.U.’s Artificial Intelligence Act: An Ordoliberal Assessment

    The E.U.’s Artificial Intelligence Act: An Ordoliberal Assessment

  • In Consideration of Indigenous Data Sovereignty: Data Mining as a Colonial Practice

    In Consideration of Indigenous Data Sovereignty: Data Mining as a Colonial Practice

  • Confidence-Building Measures for Artificial Intelligence

    Confidence-Building Measures for Artificial Intelligence

  • Knowing Your Annotator: Rapidly Testing the Reliability of Affect Annotation

    Knowing Your Annotator: Rapidly Testing the Reliability of Affect Annotation

  • Predatory Medicine: Exploring and Measuring the Vulnerability of Medical AI to Predatory Science

    Predatory Medicine: Exploring and Measuring the Vulnerability of Medical AI to Predatory Science

  • Why AI Ethics Is a Critical Theory

    Why AI Ethics Is a Critical Theory

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.