• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Unsolved Problems in ML Safety

May 28, 2023

🔬 Research Summary by Dan Hendrycks, received his PhD from UC Berkeley where he was advised by Dawn Song and Jacob Steinhardt. He is now the director of the Center for AI Safety.

[Original paper by Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt]


Overview: As ML systems become more capable and integrated into society, the safety of such systems becomes increasingly important. This paper presents four broad areas in ML Safety: Robustness, Monitoring, Alignment, and Systemic Safety. We explore each area’s motivations and provide concrete research directions.


Introduction

Over five months, the Boeing 737 MAX crashed twice, killing 346 people. It was later determined that Boeing had made unsafe design choices and pressured inspectors to bring the plane to market more quickly.

Often, it takes a disaster like this for people to pay attention to safety concerns. As AI systems are rapidly improved and applied to new domains, failures will only become more consequential. It is, therefore, important for the ML research community to proactively design systems with safety in mind. As the adage goes, “An ounce of prevention is worth a pound of cure.”

How can we reduce the probability of high-consequence failures of AI systems? Our goal in writing “Concrete Problems in ML Safety” is to draw attention to this question and list some research directions that address it. We would love to see the ML Safety community grow and hope that our paper can help guide this area of research and document its motivations.

Key Insights

We describe four research problems:

  1. Robustness: how can we make systems reliable in the face of adversaries and highly unusual situations?
  2. Monitoring: how can we detect anomalies, malicious uses, and discover unintended model functionality?
  3. Alignment: how can we build models that represent and safely optimize difficult-to-specify human values?
  4. Systemic Safety: how can we use ML to address broader risks related to how ML systems are handled? Examples include ML for cyber security and improving policy decision-making.

This work is not the first to consider any of these areas. To view related literature, please refer to the original paper.

Robustness

Motivation

Current machine learning systems are not robust enough to handle real-world complexity and long-tail events. For example, failing to recognize a tilted stop sign, occluded, or represented on a LED matrix could cause loss of life.

Additionally, adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. For example, adversaries may bypass the neural networks used to detect intruders or malware.

Some example directions

  • Create robustness benchmarks that incorporate large distribution shifts and long tail events.
  • Prevent competent errors where agents wrongly generalize and execute wrong routines.
  • Improve system abilities to adapt and learn from novel scenarios. 
  • Explore defenses to adversarial attacks with an unknown specification (beyond the typical ‘lp ball’ setting).
  • Develop adversarial defenses that can adapt at test time. 

Monitoring

Motivation

When AI systems are deployed in high-stakes settings, it will be important for human operators to be alerted when there is an anomaly, an attack, or if the model is uncertain so that they can intervene. Also, capabilities have been known to emerge unexpectedly in AI systems. Human operators should understand how models function and what actions they can take to avoid unwanted surprises.

Some example directions

  • Improve model calibration (the appropriateness of output probabilities) and extend expressions of uncertainty to natural language.
  • Train models to more accurately report the knowledge available to them.
  • Detect when data has been poisoned, or back doors have been inserted into models.
  • Develop a testbed to screen for potentially hazardous capabilities, such as the ability to execute malicious user-supplied code, generate illegal or unethical forms of content, etc.

Alignment

Motivation

While most technologies do not have goals and are simply tools, future machine learning systems may act to optimize objectives. Aligning objective functions with human values requires overcoming societal and technical challenges.

Some example directions

  • Align specific technologies, such as recommender systems, with well-being rather than engagement.
  • Detect when ethical decisions are clear-cut or contentious.
  • Train models to learn difficult-to-specify goals in interactive environments.
  • Improve the robustness of reward models.
  • Design minimally invasive agents that prefer easily reversible to irreversible actions.
  • Teach ML systems to abide by rules and constraints specified in natural language.
  • Mitigate and detect unintended instrumental goals such as self-preservation or power-seeking.
  • Have agents balance and optimize many values since there is no agreement about the best set.

Systemic Safety

ML systems are more likely to fail or be misdirected if the larger context in which they operate is insecure or turbulent. One research direction that can help combat this is ML for cyber security. There may be strong incentives for attackers to steal ML models, which could be used in dangerous ways or inherently dangerous and not fit for proliferation. ML could be used to develop better defensive systems that reduce the risk of attacks.

Another research direction in this category is ML for informed decision-making. Even if ML systems are safe in and of themselves, they must still be used safely. During the cold war, misunderstanding and political turbulence exposed humanity to several close calls and brought us to the brink of catastrophe, demonstrating that systemic issues can make technologies unsafe. Using ML to help institutions make more informed decisions may help to combat these risks.

Between the lines

Ultimately, our goal as researchers should not just be to produce interesting work but to help steer the world in a better direction. We hope to highlight some safety problems that may be under-emphasized. This list was far from comprehensive, and we would be enthusiastic about further research into reducing high-consequence risks that may arise in the future.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • To Be or Not to Be Algorithm Aware: A Question of a New Digital Divide? (Research Summary)

    To Be or Not to Be Algorithm Aware: A Question of a New Digital Divide? (Research Summary)

  • HAI Weekly Seminar Series: Decolonizing AI with Sabelo Mhlambi

    HAI Weekly Seminar Series: Decolonizing AI with Sabelo Mhlambi

  • Ethics of AI‑Enabled Recruiting and Selection: A Review and Research Agenda

    Ethics of AI‑Enabled Recruiting and Selection: A Review and Research Agenda

  • Towards Environmentally Equitable AI via Geographical Load Balancing

    Towards Environmentally Equitable AI via Geographical Load Balancing

  • Research summary: Bring the People Back In: Contesting Benchmark Machine Learning

    Research summary: Bring the People Back In: Contesting Benchmark Machine Learning

  • AI-synthesized faces are indistinguishable from real faces and more trustworthy

    AI-synthesized faces are indistinguishable from real faces and more trustworthy

  • Exchanging Lessons Between Algorithmic Fairness and Domain Generalization (Research Summary)

    Exchanging Lessons Between Algorithmic Fairness and Domain Generalization (Research Summary)

  • Aging in an Era of Fake News (Research Summary)

    Aging in an Era of Fake News (Research Summary)

  • Bridging the Gap Between AI and the Public (TEDxYouth@GandyStreet)

    Bridging the Gap Between AI and the Public (TEDxYouth@GandyStreet)

  • Research summary: PolicyKit: Building Governance in Online Communities

    Research summary: PolicyKit: Building Governance in Online Communities

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.