• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • State of AI Ethics Report Volume 8 (2026): Call for Contributors
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Exploring Clusters of Research in Three Areas of AI Safety

May 18, 2022

🔬 Research summary by Ashwin Acharya, an AI Governance and Strategy Researcher at Rethink Priorities.

[Original paper by Helen Toner and Ashwin Acharya]


Overview: Problems of AI safety are the subject of increasing interest for engineers and policymakers alike. This brief uses the CSET Map of Science to investigate how research into three areas of AI safety — robustness, interpretability and reward learning — is progressing. It identifies eight research clusters that contain a significant amount of research relating to these three areas and describes trends and key papers for each of them. 


Introduction

Applying today’s machine learning systems in complex, high-stakes situations is a risky prospect. Safety-critical industries like aviation have a culture of caution when developing and deploying new automation. A growing list of AI accidents and incidents demonstrates the failure modes of machine learning systems, even as the upward trends on AI benchmarks demonstrate machine learning’s promise.

AI safety research aims to identify the causes of unintended behavior in machine learning systems and develop tools to ensure these systems work safely and reliably. The field is commonly divided into three categories:

  • Robustness aims for guarantees that a system will continue to operate within safe limits even in unfamiliar settings;
  • Interpretability seeks to establish that it can be analyzed and understood easily by human operators;
  • Reward learning is concerned with ensuring that its behavior aligns with the system designer’s intentions.

By identifying and analyzing AI safety-related clusters of research in the CSET Map of Science — comprising more than 15,000 papers — this data brief investigates how research into these challenges is progressing in practice. It finds that research into these areas has grown significantly over time, and is led by researchers from the United States.

This brief makes use of CSET’s Map of Science to investigate what AI safety research looks like in practice so far. This resource was constructed by grouping research publications into citation-based “research clusters,” then placing those clusters on a two-dimensional map based on the strength of citation connections between them.

Operational definitions of AI safety categories

Robustness research focuses on AI systems that seem to work well in general but fail in certain circumstances. This includes identifying and defending against deliberate attacks, such as the use of adversarial examples (giving the system inputs intentionally designed to cause it to fail), data poisoning (manipulating training data in order to cause an AI model to learn the wrong thing), and other techniques. It also includes making models more robust to incidental (i.e., not deliberate) failures, such as a model being used in a different setting from what it was trained for (known as being “out of distribution”). Robustness research can seek to identify failure modes, find ways to prevent them, or develop tools to show that a given system will work robustly under a given set of assumptions. 

Interpretability research aims to help us understand the inner workings of machine learning models. Modern machine learning techniques (especially deep learning, also known as deep neural networks) are famous for being “black boxes” whose functioning is opaque to us. In reality, it’s easy enough to look inside the black box, but what we find there—rows upon rows of numbers (“parameters”)—is extremely difficult to make sense of. Interpretability research aims to build tools and approaches that convert the millions or billions of parameters in a machine learning model into forms that allow humans to grasp what’s going on.

Reward learning research seeks to expand the toolbox for how we tell machine learning systems what we want them to do. The standard approach to training a model involves specifying an objective function (or reward function): typically, to maximize something, such as accuracy in labeling examples from a training dataset. This approach works well in settings where we can identify metrics and training data that closely track what we want, but can lead to problems in more complex situations. Reward learning is one set of approaches that tries to mitigate these problems. Instead of directly specifying an objective, these approaches work by setting up the machine learning model to learn not only how to meet its objective but what its objective should be.

Methodology

The authors identified clusters related to each topic using a combination of a keyword search for relevant terms and manual filtering based on a sample of papers in the cluster. They found three clusters related to robustness research, two related to interpretability, and two related to reward learning.

Trendlines

Robustness and interpretability research have grown at a much more rapid pace since roughly 2017. The United States leads in publication output for both categories. Chinese researchers are increasingly interested in robustness research, while the European Union has long been active in interpretability research.

Reward learning has also grown in recent years, but less explosively. The authors note that some work they grouped under “reward learning” applies to robotics work unrelated to deep learning, and that reward learning may generally be a fuzzier category that fits less neatly under the “AI safety” label.

Conclusions

Based on the research clusters identified here, it appears that work in all these areas is growing at a significant pace worldwide. The United States appears to lead in every area, with China showing substantial growth in robustness research and the EU producing a large amount of interpretability work. 

Notably, despite the significant growth of the AI safety–related clusters analyzed in this paper, they still represent only a tiny fraction of total worldwide research on AI in general. The authors identified eight AI safety clusters containing a little over 15,000 papers; but when searching for AI literature in general, they find nearly 2,000 clusters containing over 1.9 million papers. These numbers imply that safety research may make up less than 1 percent of AI research overall. 

The safety clusters identified show some promising progress on safety-related problems and may also point to some gaps. For instance, one could see the heavy emphasis on adversarial examples in robustness clusters in a negative light; perhaps the perceived prestige of that topic has led related problems to be relatively neglected. 

Between the lines

These findings show that AI safety is becoming an increasingly active area of research, and is led by researchers from the United States. A particularly interesting trend is the growth of Chinese research into AI robustness (and, to a lesser extent, interpretability). Do Chinese researchers conceptualize of these areas, and of AI safety in general, in similar terms to their Western counterparts? Where do their views diverge?

The growth of AI safety work raises two important questions: what caused this growth, and how meaningful is it? The authors note some foundational papers for each AI safety field, but were those papers the key driver, or was progress driven by events such as the AI Asilomar Conference? And while AI safety work is growing rapidly, so too is work on AI in general. What do leading researchers and AI labs think of AI safety, and to what extent do they prioritize it when developing and deploying new technologies?

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

SAIER Volume 8 (2026)

SAIER Volume 8 (2026) Call for Contributors

🔍 SEARCH

Spotlight

Vertically- and horizontally-placed chess boards and chess pieces

Tech Futures: At the Frontier of Fear, Uncertainty and Doubt

Tech Futures: Introducing the Resist List

An abstract spiral of dark circles appears at the centre, resembling a tornado. Several vintage magazine covers and advertisements are being drawn toward the spiral. The artworks that have already been pulled into it are becoming distorted and replaced with clusters of numbers representing their numerical embeddings.

Tech Futures: Better Imagination for Better Tech Futures

This image is a collage with a colourful Japanese vintage landscape showing a mountain, hills, flowers and other plants and a small stream. There are 3 large black data servers placed in the bottom half of the image, with a cloud of black smoke emitting from them, partly obscuring the scenery.

Tech Futures: Crafting Participatory Tech Futures

A network diagram with lots of little emojis, organised in clusters.

Tech Futures: AI For and Against Knowledge

related posts

  • Research Summary: Towards Evaluating the Robustness of Neural Networks

    Research Summary: Towards Evaluating the Robustness of Neural Networks

  • Should AI-Powered Search Engines and Conversational Agents Prioritize Sponsored Content?

    Should AI-Powered Search Engines and Conversational Agents Prioritize Sponsored Content?

  • Value-based Fast and Slow AI Nudging

    Value-based Fast and Slow AI Nudging

  • Enough With “Human-AI Collaboration”

    Enough With “Human-AI Collaboration”

  • Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

    Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

  • Intersectional Inquiry, on the Ground and in the Algorithm

    Intersectional Inquiry, on the Ground and in the Algorithm

  • AI Policy Corner: Japan’s AI Promotion Act

    AI Policy Corner: Japan’s AI Promotion Act

  • Probing Networked Agency: Where is the Locus of Moral Responsibility?

    Probing Networked Agency: Where is the Locus of Moral Responsibility?

  • Can an AI be sentient? Cultural perspectives on sentience and on the potential ethical implications ...

    Can an AI be sentient? Cultural perspectives on sentience and on the potential ethical implications ...

  • Research summary: Acting the Part: Examining Information Operations Within #BlackLivesMatter Discour...

    Research summary: Acting the Part: Examining Information Operations Within #BlackLivesMatter Discour...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.