• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

September 15, 2023

🔬 Research Summary by Stephen Casper, an MIT PhD student working on AI interpretability, diagnostics, and safety.

[Original paper by Stephen Casper,* Xander Davies,* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell]


Overview: Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.


Introduction

Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF’s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models. 

Key Insights

Contributions

  1. Concrete challenges with RLHF: We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the policy. We also distinguish between challenges that are relatively tractable versus ones that are more fundamental limitations of alignment with RLHF.
  2. Incorporating RLHF into a broader technical safety framework: We discuss how RLHF is not a complete framework for developing safe AI and highlight additional approaches that can help to better understand, improve, and complement it. 
  3. Governance and transparency: We consider the challenge of improving industry norms and regulations affecting models trained with RLHF. Specifically, we discuss how companies’ disclosure of certain details using RLHF to train AI systems can improve accountability and auditing.

Some Problems with RLHF are Fundamental

In some respects, technical progress is tractable, and this should be seen as a cause for concerted work and optimism. However, other problems with RLHF cannot fully be solved and instead must be avoided or compensated for with other approaches. Hence, we emphasize the importance of two strategies: (1) evaluating technical progress in light of the fundamental limitations of RLHF and other methods and (2) addressing the sociotechnical challenges of aligning with human values. This will require committing to defense-in-depth safety measures and more openly sharing research findings with the wider scientific community.

RLHF = Rehashing Lessons from Historical Failures?

RLHF offers new capabilities but faces many old problems. Researchers in the safety, ethics, and human-computer interaction fields have demonstrated technical and fundamental challenges with the components of RLHF for decades. In 2023, Paul Christiano (the first author of the 2017 paper, Christiano et al. (2017), prototyping RLHF) described it as a “basic solution” intended to make it easier to “productively work on more challenging alignment problems” such as debate, recursive reward modeling, etc. 

Instead of being used as a stepping stone toward more robust techniques, RLHF’s most prominent impacts have involved advancing AI capabilities. We do not argue that this is good or bad – this raises benefits and concerns for AI alignment. However, we emphasize that the successes of RLHF should not obfuscate its limitations or gaps between the framework under which it is studied and real-world applications. An approach to AI alignment in high-stakes settings that relies on RLHF without additional safeguards risks doubling down on outdated approaches to AI alignment.

Transparency

A sustained commitment to transparency (e.g., to the public or auditors) would improve safety and accountability. First, disclosing some details behind large RLHF training runs would clarify an organization’s norms for model scrutiny and safety checks. Second, increased transparency about efforts to mitigate risks would improve safety incentives and suggest methods for external stakeholders to hold companies accountable. Third, transparency would improve the AI community’s understanding of RLHF and support the ability to track technical progress on its challenges. Specific policy prescriptions are beyond the paper’s scope, but we discuss specific types of details that, if disclosed, could indicate risks. These should be accounted for when auditing AI systems developed using RLHF. 

Future Work

Much more basic research and applied work can be done to improve RLHF and integrate it into a more complete agenda for safer AI. We discuss frameworks that can be used to better understand RLHF, techniques that can help solve challenges and other alignment strategies that will be important to compensate for its shortcomings. RLHF is still a developing and imperfectly understood technique, so more work to study and improve will be valuable. 

Between the lines

The longer I study RLHF, the more uncertain I feel about whether or not its use for finetuning state-of-the-art deployed LLMs is a sign of health from a safety-focused field. RLHF suffers many challenges, and deployed LLMs trained with it have exhibited many failures (e.g., revealing private information, hallucination, encoding biases, sycophancy, expressing undesirable preferences, jailbreaking, and adversarial vulnerabilities). 

On one hand, these failures are a good thing because identifying them can teach us important lessons about developing more trustworthy AI. On the other hand, many of these failures were not foreseen – they escaped internal safety evaluations by the companies releasing these systems. This suggests major limitations in guaranteeing that state-of-the-art AI is reliable. Companies might not be exercising enough caution. When chatbots fail, the stakes are relatively low. But when more advanced AI systems are deployed in high-stakes settings, hopefully, they will not have as many failures as we have seen with RLHF-ed LLMs.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

related posts

  • Research summary: Decolonial AI: Decolonial Theory as Sociotechnical Foresight in Artificial Intelli...

    Research summary: Decolonial AI: Decolonial Theory as Sociotechnical Foresight in Artificial Intelli...

  • Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparativ...

    Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparativ...

  • Generative AI-Driven Storytelling: A New Era for Marketing

    Generative AI-Driven Storytelling: A New Era for Marketing

  • Mapping AI Arguments in Journalism and Communication Studies

    Mapping AI Arguments in Journalism and Communication Studies

  • Post-Mortem Privacy 2.0: Theory, Law and Technology

    Post-Mortem Privacy 2.0: Theory, Law and Technology

  • Algorithms Deciding the Future of Legal Decisions

    Algorithms Deciding the Future of Legal Decisions

  • Fairness Amidst Non-IID Graph Data: A Literature Review

    Fairness Amidst Non-IID Graph Data: A Literature Review

  • Technology on the Margins: AI and Global Migration Management From a Human Rights Perspective (Resea...

    Technology on the Margins: AI and Global Migration Management From a Human Rights Perspective (Resea...

  • Bound by the Bounty: Collaboratively Shaping Evaluation Processes for Queer AI Harms

    Bound by the Bounty: Collaboratively Shaping Evaluation Processes for Queer AI Harms

  • What lies behind AGI: ethical concerns related to LLMs

    What lies behind AGI: ethical concerns related to LLMs

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.