🔬 Research Summary by Ryan Burnell, a Senior Research Associate at the Alan Turing Institute in London working to apply theories and paradigms from cognitive science to improve the evaluation of AI systems.
[Original paper by Ryan Burnell, Wout Schellaert, John Burden, Tomer D. Ullman, Fernando Martinez-Plumed, Joshua B. Tenenbaum, Danaja Rutar, Lucy G. Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, Douwe Kiela, Murray Shanahan, Ellen M. Voorhees, Anthony G. Cohn, Joel Z. Leibo, and Jose Hernandez-Orallo]
Overview: In order to make informed decisions about where AI systems are safe and useful to deploy, we need to understand the capabilities and limitations of these systems. Yet current approaches to AI evaluation make it exceedingly difficult to build this understanding. This paper details several key problems with common AI evaluation methods and suggests a broad range of solutions to help address them.
Introduction
AI is becoming integral to every aspect of modern life—just look at ChatGPT, which recently became the fastest-adopted internet application of all time. Take a wider look, and you’ll see that AI is being rolled out in various high-stakes contexts, including systems built for autonomous driving and medical diagnosis. In these contexts, the consequences of system failures could be devastating, so it’s important to ensure these systems are safe to use. Yet to make informed decisions about the safety and utility of AI systems, researchers and policymakers need a full understanding of their capabilities and limitations.
Unfortunately, current approaches to AI evaluation make it exceedingly difficult to build such an understanding for two key reasons. First, aggregate metrics make it hard to predict a system’s performance in specific situations. Second, the instance-by-instance evaluation results that could be used to unpack these aggregate metrics are rarely available. Together, these two problems threaten public understanding of AI capabilities. To address these problems, we propose a path forward in which results are presented in more nuanced ways and full evaluation results are made publicly available.
Key Insights
AI evaluation
Across most areas of AI, system evaluations follow a similar structure. A system is first built or trained to perform a particular set of functions. Then, the performance of the system is tested on a set of tasks relevant to the desired functionality of the system, often known as “benchmarks.” For each task, the system will be tested on several example “instances” of the task. For example, an image classification system might be shown different images, and for each image instance, it would be given a score based on its performance (e.g., 1 if it was correct, or 0 if it was incorrect). Finally, performance across these various instances tends to be aggregated together into a small number of summary metrics such as percentage accuracy.
Problems with aggregation
This approach of reducing evaluation results to a small set of aggregate metrics is problematic because it limits our insight into how a system will perform in particular situations. Take, for example, a system that was trained to classify faces as male or female that achieved a classification accuracy of 90%. Based on this aggregate metric, the system appears highly competent. However, a subsequent breakdown of performance revealed a massive bias problem with the system—it misclassified women with darker skin types a staggering 34.7% of the time while erring only 0.8% of the time for men with lighter skin types. This example perfectly demonstrates how aggregation can make it difficult to fully understand the behavior of AI systems.
These aggregation problems are not unique to AI. Still, they are exacerbated by a research culture centered around outdoing the current state-of-the-art performance, topping leaderboards, and winning competitions. This research culture emphasizes aggregate metrics and incentivizes fast publication of new findings at the expense of robust evaluation practices.
Availability of evaluation results
A second key problem is the lack of public access to evaluation results. As the biased gender identification system example shows, there are many situations in which researchers and policymakers might want to scrutinize system performance to test for biases, safety concerns or just to understand better how a system operates. But the aggregate metrics typically reported in AI papers are not sufficient to carry out these kinds of investigations, which often require access to the full instance-by-instance evaluation results.
It is worrying that researchers rarely make their full evaluation results public—one recent analysis found that only 4% of papers in top AI venues fully report their evaluation results. In some cases, it is possible for researchers to recreate these results by conducting their own evaluation. However, as systems and benchmarks continue to grow in size and complexity, the costs of conducting system evaluations are skyrocketing, and many companies are limiting access to their cutting-edge models. As a result, researchers and policymakers are increasingly unable to scrutinize the performance of cutting-edge systems fully.
A path forward
To address these critical problems, we need to move beyond aggregate metrics and find ways to make evaluation results publicly available.
Moving beyond aggregate metrics
It is important that in-depth performance breakdowns are presented instead of or alongside aggregate metrics. Breakdowns can be created by identifying problem space features that might be relevant to performance and using those features to analyze, visualize, and predict performance.
These changes to reporting must go hand in hand with changes to how benchmark tasks are constructed. A system’s performance cannot be evaluated unless the benchmark comprehensively covers the problem space.
Making evaluation results available
A growing open-science movement has led to the creation of various platforms that could be used to share evaluation results, such as the Hugging Face Hub, GitHub, OpenML, Papers With Code, and the Open Science Framework. But until now, researchers have had few incentives to put in the extra work needed to clean, document, release, and maintain these results.
We, therefore, need to start broader conversations about how best to incentivize the public release of evaluation results. For example, within academia, publishing venues and funding agencies could encourage or require sharing these results. Outside of academia, we recommend that regulators and industry organizations create policies around results sharing. There are, of course, situations in which instance-by-instance results cannot be released (e.g., owing to privacy concerns or practical constraints). Still, in most cases, it should be possible and beneficial to do so.
Several initiatives give us confidence that these changes in broader research culture are possible. For example, researchers who developed the Holistic Evaluation of Language Models (HELM) benchmark made the full evaluation results available for a variety of models across the entire benchmark. If other fields, such as psychology and medicine, can make progress on these issues even in the face of considerable data privacy challenges, AI should be able to do the same.
Between the lines
The field of AI evaluation is reaching a critical moment. Progress in AI development is moving faster than ever before, with shiny new systems being released on almost a monthly basis. If we hope to keep up with this pace and maintain a grasp of how safe and capable these systems are, we need to think carefully about our evaluation practices.
In this paper, we focused on two key barriers to the community’s ability to understand system behavior, but there are many other pressing evaluation issues that need addressing. For example, we lack robust tests of the complex cognitive abilities that cutting-edge systems purport to have. Moreover, the speed at which benchmarks are becoming obsolete is rapidly increasing—partly because capabilities are quickly advancing beyond the limits of what our benchmarks are designed to test and partly due to data contamination as the benchmark data is incorporated into the training data for new systems. We don’t have all the answers, but we hope this paper will spark broader discussions about getting system evaluation back on track.