• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Unsolved Problems in ML Safety

May 28, 2023

šŸ”¬ Research Summary by Dan Hendrycks, received his PhD from UC Berkeley where he was advised byĀ Dawn SongĀ andĀ Jacob Steinhardt. He is now the director of theĀ Center for AI Safety.

[Original paper by Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt]


Overview: As ML systems become more capable and integrated into society, the safety of such systems becomes increasingly important. This paper presents four broad areas in ML Safety: Robustness, Monitoring, Alignment, and Systemic Safety. We explore each area’s motivations and provide concrete research directions.


Introduction

Over five months, the Boeing 737 MAX crashed twice, killing 346 people. It was later determined that Boeing had made unsafe design choices and pressured inspectors to bring the plane to market more quickly.

Often, it takes a disaster like this for people to pay attention to safety concerns. As AI systems are rapidly improved and applied to new domains, failures will only become more consequential. It is, therefore, important for the ML research community to proactively design systems with safety in mind. As the adage goes, ā€œAn ounce of prevention is worth a pound of cure.ā€

How can we reduce the probability of high-consequence failures of AI systems? Our goal in writing ā€œConcrete Problems in ML Safetyā€ is to draw attention to this question and list some research directions that address it. We would love to see the ML Safety community grow and hope that our paper can help guide this area of research and document its motivations.

Key Insights

We describe four research problems:

  1. Robustness: how can we make systems reliable in the face of adversaries and highly unusual situations?
  2. Monitoring: how can we detect anomalies, malicious uses, and discover unintended model functionality?
  3. Alignment: how can we build models that represent and safely optimize difficult-to-specify human values?
  4. Systemic Safety: how can we use ML to address broader risks related to how ML systems are handled? Examples include ML for cyber security and improving policy decision-making.

This work is not the first to consider any of these areas. To view related literature, please refer to the original paper.

Robustness

Motivation

Current machine learning systems are not robust enough to handle real-world complexity and long-tail events. For example, failing to recognize a tilted stop sign, occluded, or represented on a LED matrix could cause loss of life.

Additionally, adversaries can easily manipulate vulnerabilities in ML systems and cause them to make mistakes. For example, adversaries may bypass the neural networks used to detect intruders or malware.

Some example directions

  • Create robustness benchmarks that incorporate large distribution shifts and long tail events.
  • Prevent competent errors where agents wrongly generalize and execute wrong routines.
  • Improve system abilities to adapt and learn from novel scenarios. 
  • Explore defenses to adversarial attacks with an unknown specification (beyond the typical ā€˜lp ball’ setting).
  • Develop adversarial defenses that can adapt at test time. 

Monitoring

Motivation

When AI systems are deployed in high-stakes settings, it will be important for human operators to be alerted when there is an anomaly, an attack, or if the model is uncertain so that they can intervene. Also, capabilities have been known to emerge unexpectedly in AI systems. Human operators should understand how models function and what actions they can take to avoid unwanted surprises.

Some example directions

  • Improve model calibration (the appropriateness of output probabilities) and extend expressions of uncertainty to natural language.
  • Train models to more accurately report the knowledge available to them.
  • Detect when data has been poisoned, or back doors have been inserted into models.
  • Develop a testbed to screen for potentially hazardous capabilities, such as the ability to execute malicious user-supplied code, generate illegal or unethical forms of content, etc.

Alignment

Motivation

While most technologies do not have goals and are simply tools, future machine learning systems may act to optimize objectives. Aligning objective functions with human values requires overcoming societal and technical challenges.

Some example directions

  • Align specific technologies, such as recommender systems, with well-being rather than engagement.
  • Detect when ethical decisions are clear-cut or contentious.
  • Train models to learn difficult-to-specify goals in interactive environments.
  • Improve the robustness of reward models.
  • Design minimally invasive agents that prefer easily reversible to irreversible actions.
  • Teach ML systems to abide by rules and constraints specified in natural language.
  • Mitigate and detect unintended instrumental goals such as self-preservation or power-seeking.
  • Have agents balance and optimize many values since there is no agreement about the best set.

Systemic Safety

ML systems are more likely to fail or be misdirected if the larger context in which they operate is insecure or turbulent. One research direction that can help combat this is ML for cyber security. There may be strong incentives for attackers to steal ML models, which could be used in dangerous ways or inherently dangerous and not fit for proliferation. ML could be used to develop better defensive systems that reduce the risk of attacks.

Another research direction in this category is ML for informed decision-making. Even if ML systems are safe in and of themselves, they must still be used safely. During the cold war, misunderstanding and political turbulence exposed humanity to several close calls and brought us to the brink of catastrophe, demonstrating that systemic issues can make technologies unsafe. Using ML to help institutions make more informed decisions may help to combat these risks.

Between the lines

Ultimately, our goal as researchers should not just be to produce interesting work but to help steer the world in a better direction. We hope to highlight some safety problems that may be under-emphasized. This list was far from comprehensive, and we would be enthusiastic about further research into reducing high-consequence risks that may arise in the future.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

šŸ” SEARCH

Spotlight

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

related posts

  • International Institutions for Advanced AI

    International Institutions for Advanced AI

  • Worried But Hopeful: The MAIEI State of AI Ethics Panel Recaps a Difficult Year

    Worried But Hopeful: The MAIEI State of AI Ethics Panel Recaps a Difficult Year

  • Battle of Biometrics: The use and issues of facial recognition in Canada

    Battle of Biometrics: The use and issues of facial recognition in Canada

  • Confidence-Building Measures for Artificial Intelligence

    Confidence-Building Measures for Artificial Intelligence

  • The Ethics of AI Value Chains: An Approach for Integrating and Expanding AI Ethics Research, Practic...

    The Ethics of AI Value Chains: An Approach for Integrating and Expanding AI Ethics Research, Practic...

  • How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

    How Prevalent is Gender Bias in ChatGPT? - Exploring German and English ChatGPT Responses

  • Government AI Readiness 2021 Index

    Government AI Readiness 2021 Index

  • Responsible AI In Healthcare

    Responsible AI In Healthcare

  • Bias Propagation in Federated Learning

    Bias Propagation in Federated Learning

  • Towards Sustainable Conversational AI

    Towards Sustainable Conversational AI

Partners

  • Ā 
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • Ā© 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.