🔬 Research Summary by Stephen Casper, an MIT PhD student working on AI interpretability, diagnostics, and safety.
[Original paper by Stephen Casper,* Xander Davies,* Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell]
Overview: Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique for finetuning state-of-the-art AI systems such as GPT-4, Claude, Bard, and Llama-2. Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems.
Introduction
Reinforcement Learning from Human Feedback (RLHF) is the key technique to train today’s most advanced language models, such as GPT-4, Claude, Bard, and Llama-2. In a matter of months, applications built on these systems have gained user bases well into the hundreds of millions. Given RLHF’s status as the default industry alignment technique, we should carefully evaluate its limitations. In a survey of over 250 papers, we review open challenges and fundamental limitations of RLHF with a focus on applications in large language models.
Key Insights
Contributions
- Concrete challenges with RLHF: We taxonomize and survey problems with RLHF, dividing them into three primary categories: challenges with feedback, challenges with the reward model, and challenges with the policy. We also distinguish between challenges that are relatively tractable versus ones that are more fundamental limitations of alignment with RLHF.
- Incorporating RLHF into a broader technical safety framework: We discuss how RLHF is not a complete framework for developing safe AI and highlight additional approaches that can help to better understand, improve, and complement it.
- Governance and transparency: We consider the challenge of improving industry norms and regulations affecting models trained with RLHF. Specifically, we discuss how companies’ disclosure of certain details using RLHF to train AI systems can improve accountability and auditing.
Some Problems with RLHF are Fundamental
In some respects, technical progress is tractable, and this should be seen as a cause for concerted work and optimism. However, other problems with RLHF cannot fully be solved and instead must be avoided or compensated for with other approaches. Hence, we emphasize the importance of two strategies: (1) evaluating technical progress in light of the fundamental limitations of RLHF and other methods and (2) addressing the sociotechnical challenges of aligning with human values. This will require committing to defense-in-depth safety measures and more openly sharing research findings with the wider scientific community.
RLHF = Rehashing Lessons from Historical Failures?
RLHF offers new capabilities but faces many old problems. Researchers in the safety, ethics, and human-computer interaction fields have demonstrated technical and fundamental challenges with the components of RLHF for decades. In 2023, Paul Christiano (the first author of the 2017 paper, Christiano et al. (2017), prototyping RLHF) described it as a “basic solution” intended to make it easier to “productively work on more challenging alignment problems” such as debate, recursive reward modeling, etc.
Instead of being used as a stepping stone toward more robust techniques, RLHF’s most prominent impacts have involved advancing AI capabilities. We do not argue that this is good or bad – this raises benefits and concerns for AI alignment. However, we emphasize that the successes of RLHF should not obfuscate its limitations or gaps between the framework under which it is studied and real-world applications. An approach to AI alignment in high-stakes settings that relies on RLHF without additional safeguards risks doubling down on outdated approaches to AI alignment.
Transparency
A sustained commitment to transparency (e.g., to the public or auditors) would improve safety and accountability. First, disclosing some details behind large RLHF training runs would clarify an organization’s norms for model scrutiny and safety checks. Second, increased transparency about efforts to mitigate risks would improve safety incentives and suggest methods for external stakeholders to hold companies accountable. Third, transparency would improve the AI community’s understanding of RLHF and support the ability to track technical progress on its challenges. Specific policy prescriptions are beyond the paper’s scope, but we discuss specific types of details that, if disclosed, could indicate risks. These should be accounted for when auditing AI systems developed using RLHF.
Future Work
Much more basic research and applied work can be done to improve RLHF and integrate it into a more complete agenda for safer AI. We discuss frameworks that can be used to better understand RLHF, techniques that can help solve challenges and other alignment strategies that will be important to compensate for its shortcomings. RLHF is still a developing and imperfectly understood technique, so more work to study and improve will be valuable.
Between the lines
The longer I study RLHF, the more uncertain I feel about whether or not its use for finetuning state-of-the-art deployed LLMs is a sign of health from a safety-focused field. RLHF suffers many challenges, and deployed LLMs trained with it have exhibited many failures (e.g., revealing private information, hallucination, encoding biases, sycophancy, expressing undesirable preferences, jailbreaking, and adversarial vulnerabilities).
On one hand, these failures are a good thing because identifying them can teach us important lessons about developing more trustworthy AI. On the other hand, many of these failures were not foreseen – they escaped internal safety evaluations by the companies releasing these systems. This suggests major limitations in guaranteeing that state-of-the-art AI is reliable. Companies might not be exercising enough caution. When chatbots fail, the stakes are relatively low. But when more advanced AI systems are deployed in high-stakes settings, hopefully, they will not have as many failures as we have seen with RLHF-ed LLMs.