• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Exploring Clusters of Research in Three Areas of AI Safety

May 18, 2022

🔬 Research summary by Ashwin Acharya, an AI Governance and Strategy Researcher at Rethink Priorities.

[Original paper by Helen Toner and Ashwin Acharya]


Overview: Problems of AI safety are the subject of increasing interest for engineers and policymakers alike. This brief uses the CSET Map of Science to investigate how research into three areas of AI safety — robustness, interpretability and reward learning — is progressing. It identifies eight research clusters that contain a significant amount of research relating to these three areas and describes trends and key papers for each of them. 


Introduction

Applying today’s machine learning systems in complex, high-stakes situations is a risky prospect. Safety-critical industries like aviation have a culture of caution when developing and deploying new automation. A growing list of AI accidents and incidents demonstrates the failure modes of machine learning systems, even as the upward trends on AI benchmarks demonstrate machine learning’s promise.

AI safety research aims to identify the causes of unintended behavior in machine learning systems and develop tools to ensure these systems work safely and reliably. The field is commonly divided into three categories:

  • Robustness aims for guarantees that a system will continue to operate within safe limits even in unfamiliar settings;
  • Interpretability seeks to establish that it can be analyzed and understood easily by human operators;
  • Reward learning is concerned with ensuring that its behavior aligns with the system designer’s intentions.

By identifying and analyzing AI safety-related clusters of research in the CSET Map of Science — comprising more than 15,000 papers — this data brief investigates how research into these challenges is progressing in practice. It finds that research into these areas has grown significantly over time, and is led by researchers from the United States.

This brief makes use of CSET’s Map of Science to investigate what AI safety research looks like in practice so far. This resource was constructed by grouping research publications into citation-based “research clusters,” then placing those clusters on a two-dimensional map based on the strength of citation connections between them.

Operational definitions of AI safety categories

Robustness research focuses on AI systems that seem to work well in general but fail in certain circumstances. This includes identifying and defending against deliberate attacks, such as the use of adversarial examples (giving the system inputs intentionally designed to cause it to fail), data poisoning (manipulating training data in order to cause an AI model to learn the wrong thing), and other techniques. It also includes making models more robust to incidental (i.e., not deliberate) failures, such as a model being used in a different setting from what it was trained for (known as being “out of distribution”). Robustness research can seek to identify failure modes, find ways to prevent them, or develop tools to show that a given system will work robustly under a given set of assumptions. 

Interpretability research aims to help us understand the inner workings of machine learning models. Modern machine learning techniques (especially deep learning, also known as deep neural networks) are famous for being “black boxes” whose functioning is opaque to us. In reality, it’s easy enough to look inside the black box, but what we find there—rows upon rows of numbers (“parameters”)—is extremely difficult to make sense of. Interpretability research aims to build tools and approaches that convert the millions or billions of parameters in a machine learning model into forms that allow humans to grasp what’s going on.

Reward learning research seeks to expand the toolbox for how we tell machine learning systems what we want them to do. The standard approach to training a model involves specifying an objective function (or reward function): typically, to maximize something, such as accuracy in labeling examples from a training dataset. This approach works well in settings where we can identify metrics and training data that closely track what we want, but can lead to problems in more complex situations. Reward learning is one set of approaches that tries to mitigate these problems. Instead of directly specifying an objective, these approaches work by setting up the machine learning model to learn not only how to meet its objective but what its objective should be.

Methodology

The authors identified clusters related to each topic using a combination of a keyword search for relevant terms and manual filtering based on a sample of papers in the cluster. They found three clusters related to robustness research, two related to interpretability, and two related to reward learning.

Trendlines

Robustness and interpretability research have grown at a much more rapid pace since roughly 2017. The United States leads in publication output for both categories. Chinese researchers are increasingly interested in robustness research, while the European Union has long been active in interpretability research.

Reward learning has also grown in recent years, but less explosively. The authors note that some work they grouped under “reward learning” applies to robotics work unrelated to deep learning, and that reward learning may generally be a fuzzier category that fits less neatly under the “AI safety” label.

Conclusions

Based on the research clusters identified here, it appears that work in all these areas is growing at a significant pace worldwide. The United States appears to lead in every area, with China showing substantial growth in robustness research and the EU producing a large amount of interpretability work. 

Notably, despite the significant growth of the AI safety–related clusters analyzed in this paper, they still represent only a tiny fraction of total worldwide research on AI in general. The authors identified eight AI safety clusters containing a little over 15,000 papers; but when searching for AI literature in general, they find nearly 2,000 clusters containing over 1.9 million papers. These numbers imply that safety research may make up less than 1 percent of AI research overall. 

The safety clusters identified show some promising progress on safety-related problems and may also point to some gaps. For instance, one could see the heavy emphasis on adversarial examples in robustness clusters in a negative light; perhaps the perceived prestige of that topic has led related problems to be relatively neglected. 

Between the lines

These findings show that AI safety is becoming an increasingly active area of research, and is led by researchers from the United States. A particularly interesting trend is the growth of Chinese research into AI robustness (and, to a lesser extent, interpretability). Do Chinese researchers conceptualize of these areas, and of AI safety in general, in similar terms to their Western counterparts? Where do their views diverge?

The growth of AI safety work raises two important questions: what caused this growth, and how meaningful is it? The authors note some foundational papers for each AI safety field, but were those papers the key driver, or was progress driven by events such as the AI Asilomar Conference? And while AI safety work is growing rapidly, so too is work on AI in general. What do leading researchers and AI labs think of AI safety, and to what extent do they prioritize it when developing and deploying new technologies?

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • The Return on Investment in AI Ethics: A Holistic Framework

    The Return on Investment in AI Ethics: A Holistic Framework

  • The Challenge of Understanding What Users Want: Inconsistent Preferences and Engagement Optimization

    The Challenge of Understanding What Users Want: Inconsistent Preferences and Engagement Optimization

  • You cannot have AI ethics without ethics

    You cannot have AI ethics without ethics

  • Designing a Future Worth Wanting: Applying Virtue Ethics to Human–Computer Interaction

    Designing a Future Worth Wanting: Applying Virtue Ethics to Human–Computer Interaction

  • Research summary: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Lea...

    Research summary: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Lea...

  • Balancing Data Utility and Confidentiality in the 2020 US Census

    Balancing Data Utility and Confidentiality in the 2020 US Census

  • De-platforming disinformation: conspiracy theories and their control

    De-platforming disinformation: conspiracy theories and their control

  • Government AI Readiness 2021 Index

    Government AI Readiness 2021 Index

  • Research summary:  Algorithmic Bias: On the Implicit Biases of Social Technology

    Research summary: Algorithmic Bias: On the Implicit Biases of Social Technology

  • Extensible Consent Management Architectures for Data Trusts

    Extensible Consent Management Architectures for Data Trusts

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.