• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Rethink reporting of evaluation results in AI

July 6, 2023

🔬 Research Summary by Ryan Burnell, a Senior Research Associate at the Alan Turing Institute in London working to apply theories and paradigms from cognitive science to improve the evaluation of AI systems.

[Original paper by Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, and Jose Hernandez-Orallo]


Overview: In order to make informed decisions about where AI systems are safe and useful to deploy, we need to understand the capabilities and limitations of these systems. Yet current approaches to AI evaluation make it exceedingly difficult to build this understanding. This paper details several key problems with common AI evaluation methods and suggests a broad range of solutions to help address them.


Introduction

AI is becoming integral to every aspect of modern life—just look at ChatGPT, which recently became the fastest-adopted internet application of all time. Take a wider look, and you’ll see that AI is being rolled out in various high-stakes contexts, including systems built for autonomous driving and medical diagnosis. In these contexts, the consequences of system failures could be devastating, so it’s important to ensure these systems are safe to use. Yet to make informed decisions about the safety and utility of AI systems, researchers and policymakers need a full understanding of their capabilities and limitations.

Unfortunately, current approaches to AI evaluation make it exceedingly difficult to build such an understanding for two key reasons. First, aggregate metrics make it hard to predict a system’s performance in specific situations. Second, the instance-by-instance evaluation results that could be used to unpack these aggregate metrics are rarely available. Together, these two problems threaten public understanding of AI capabilities. To address these problems, we propose a path forward in which results are presented in more nuanced ways and full evaluation results are made publicly available.

Key Insights

AI evaluation

Across most areas of AI, system evaluations follow a similar structure. A system is first built or trained to perform a particular set of functions. Then, the performance of the system is tested on a set of tasks relevant to the desired functionality of the system, often known as “benchmarks.” For each task, the system will be tested on several example “instances” of the task. For example, an image classification system might be shown different images, and for each image instance, it would be given a score based on its performance (e.g., 1 if it was correct, or 0 if it was incorrect). Finally, performance across these various instances tends to be aggregated together into a small number of summary metrics such as percentage accuracy.

Problems with aggregation

This approach of reducing evaluation results to a small set of aggregate metrics is problematic because it limits our insight into how a system will perform in particular situations. Take, for example, a system that was trained to classify faces as male or female that achieved a classification accuracy of 90%. Based on this aggregate metric, the system appears highly competent. However, a subsequent breakdown of performance revealed a massive bias problem with the system—it misclassified women with darker skin types a staggering 34.7% of the time while erring only 0.8% of the time for men with lighter skin types. This example perfectly demonstrates how aggregation can make it difficult to fully understand the behavior of AI systems.

These aggregation problems are not unique to AI. Still, they are exacerbated by a research culture centered around outdoing the current state-of-the-art performance, topping leaderboards, and winning competitions. This research culture emphasizes aggregate metrics and incentivizes fast publication of new findings at the expense of robust evaluation practices. 

Availability of evaluation results

A second key problem is the lack of public access to evaluation results. As the biased gender identification system example shows, there are many situations in which researchers and policymakers might want to scrutinize system performance to test for biases, safety concerns or just to understand better how a system operates. But the aggregate metrics typically reported in AI papers are not sufficient to carry out these kinds of investigations, which often require access to the full instance-by-instance evaluation results.

It is worrying that researchers rarely make their full evaluation results public—one recent analysis found that only 4% of papers in top AI venues fully report their evaluation results. In some cases, it is possible for researchers to recreate these results by conducting their own evaluation. However, as systems and benchmarks continue to grow in size and complexity, the costs of conducting system evaluations are skyrocketing, and many companies are limiting access to their cutting-edge models. As a result, researchers and policymakers are increasingly unable to scrutinize the performance of cutting-edge systems fully.

A path forward

To address these critical problems, we need to move beyond aggregate metrics and find ways to make evaluation results publicly available. 

Moving beyond aggregate metrics

It is important that in-depth performance breakdowns are presented instead of or alongside aggregate metrics. Breakdowns can be created by identifying problem space features that might be relevant to performance and using those features to analyze, visualize, and predict performance.

These changes to reporting must go hand in hand with changes to how benchmark tasks are constructed. A system’s performance cannot be evaluated unless the benchmark comprehensively covers the problem space.

Making evaluation results available

A growing open-science movement has led to the creation of various platforms that could be used to share evaluation results, such as the Hugging Face Hub, GitHub, OpenML, Papers With Code, and the Open Science Framework. But until now, researchers have had few incentives to put in the extra work needed to clean, document, release, and maintain these results. 

We, therefore, need to start broader conversations about how best to incentivize the public release of evaluation results. For example, within academia, publishing venues and funding agencies could encourage or require sharing these results. Outside of academia, we recommend that regulators and industry organizations create policies around results sharing. There are, of course, situations in which instance-by-instance results cannot be released (e.g., owing to privacy concerns or practical constraints). Still, in most cases, it should be possible and beneficial to do so.

Several initiatives give us confidence that these changes in broader research culture are possible. For example, researchers who developed the Holistic Evaluation of Language Models (HELM) benchmark made the full evaluation results available for a variety of models across the entire benchmark. If other fields, such as psychology and medicine, can make progress on these issues even in the face of considerable data privacy challenges, AI should be able to do the same.

Between the lines

The field of AI evaluation is reaching a critical moment. Progress in AI development is moving faster than ever before, with shiny new systems being released on almost a monthly basis. If we hope to keep up with this pace and maintain a grasp of how safe and capable these systems are, we need to think carefully about our evaluation practices.

In this paper, we focused on two key barriers to the community’s ability to understand system behavior, but there are many other pressing evaluation issues that need addressing. For example, we lack robust tests of the complex cognitive abilities that cutting-edge systems purport to have. Moreover, the speed at which benchmarks are becoming obsolete is rapidly increasing—partly because capabilities are quickly advancing beyond the limits of what our benchmarks are designed to test and partly due to data contamination as the benchmark data is incorporated into the training data for new systems. We don’t have all the answers, but we hope this paper will spark broader discussions about getting system evaluation back on track.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

A brightly coloured illustration which can be viewed in any direction. It has many elements to it working together: men in suits around a table, someone in a data centre, big hands controlling the scenes and holding a phone, people in a production line. Motifs such as network diagrams and melting emojis are placed throughout the busy vignettes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part II

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

related posts

  • Can We Engineer Ethical AI?

    Can We Engineer Ethical AI?

  • Exploring the under-explored areas in teaching tech ethics today

    Exploring the under-explored areas in teaching tech ethics today

  • The Ethics of Artificial Intelligence through the Lens of Ubuntu

    The Ethics of Artificial Intelligence through the Lens of Ubuntu

  • Rethinking Fairness: An Interdisciplinary Survey of Critiques of Hegemonic ML

    Rethinking Fairness: An Interdisciplinary Survey of Critiques of Hegemonic ML

  • Automating Extremism: Mapping the Affective Roles of Artificial Agents in Online Radicalization

    Automating Extremism: Mapping the Affective Roles of Artificial Agents in Online Radicalization

  • Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

  • Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles

    Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles

  • The Ethics of AI Business Practices: A Review of 47 AI Ethics Guidelines

    The Ethics of AI Business Practices: A Review of 47 AI Ethics Guidelines

  • In 2020, Nobody Knows You’re a Chatbot

    In 2020, Nobody Knows You’re a Chatbot

  • Tell me, what are you most afraid of? Exploring the Effects of Agent Representation on Information D...

    Tell me, what are you most afraid of? Exploring the Effects of Agent Representation on Information D...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.