🔬 Research Summary by Kate Donahue , a Computer Science PhD student at Cornell who studies on the societal impacts of AI.
[Original paper by Kate Donahue, Sreenivas Gollapudi, and Kostas Kollias]
Overview: This paper studies human-algorithm collaboration in a common setting – picking the best item from a set (e.g., picking the best job candidate from a large set of applicants). In many applications, the algorithm narrows down the set to a smaller subset, from which the human makes the final pick. Here, we demonstrate how the performance of the joint human-algorithm system is affected by features of this setting, like the size of the set the algorithm presents, relative accuracy rates of the human and algorithm, and human cognitive biases.
Introduction
Consider the following setting: you’re an overworked, overtired hiring manager. You need to fill a critical role, for which you’ve gotten 500 applications – but you only have the time to interview 10. You decide to try out a new AI tool to help you – it will read through all 500 applications and give you a shortlist of applicants it thinks are most relevant to the job. Of course, you, as the hiring manager, will be able to have the final pick after you interview candidates. However, filling this role with the best candidate is crucial – are you sure that using this AI tool will make that more likely?
Picking the best item from a set is an extremely common goal – consider picking a product, a driving route, or even a medical diagnosis. Often, algorithmic tools play a crucial role, narrowing down the (potentially intractable) total set to a much smaller set of candidates from which the human picks. Here, a natural question is when the human using the AI tool is more likely to pick the best item than the human alone or the algorithm alone.
In this paper, we explore exactly this human-algorithm collaboration setting, studying how factors such as the size of the set that the algorithm presents, the relative accuracy of the human and algorithm, and human cognitive biases influence overall performance. Specifically, we focus on anchoring: a well-demonstrated effect that humans tend to view items towards the top of a list as better, even independent of their true qualities. While this is a theoretical paper, giving proofs and some simulations, we aim to give high-level insight into this ubiquitous and important real-life problem.
Key Insights
The importance of independent ordering
The first factor we consider is independence: how correlated are the mistakes that the human and algorithm make? There are various reasons why humans and algorithms could make similar mistakes – for example, if they are both relying on similar sources of information. In addition, there’s a human-specific source of dependence: anchoring. As human beings, our cognitive reasoning has all kinds of features and bugs – and one extremely well-documented pattern is our tendency to believe that items at the top of a list are better than the ones at the bottom. For example, when Google returns a list of links, we tend to start at the top and work our way down. This tendency means that we, as humans, may discount our own beliefs in favor of humans. One benchmark goal in human-algorithm collaboration is complementarity, which is when the joint human-algorithm system performs strictly better than the human or algorithm alone. In particular, we’re interested in how human anchoring influences whether or not the system can achieve complementarity.
In this paper, we study anchoring through modeling how the algorithm’s ordering influences the human’s ordering. With complete anchoring, we assume the human’s ordering is very strongly influenced by the algorithm’s, while with perfect independence, we assume that the orderings are completely independent. These models of anchoring sketch out two extremes of human behavior – and similarly result in sharply differing performance. With complete anchoring, unfortunately, we prove that complementarity is impossible – no matter how accurate the human is or how many items the algorithm presents. By contrast, with perfect independence, complementarity is possible. First, we focus on the case where the human and the algorithm have equal accuracy and show that the algorithm can always ensure complementarity by showing the human its top two items. This result is especially encouraging given that we view humans as bandwidth-limited – but even the busiest human can probably take the time to consider exactly two items.
The asymmetric influence of differing accuracy rates
In many cases, the human or algorithm might have different accuracy rates – effectively, how well they could perform the task by themselves. How do these differences influence overall performance? Our previous results rule out complementarity in the strongly anchored case – but what about the completely independent case? Here, we show an intriguing asymmetry between the human and the algorithm. Specifically, we show that human accuracy is more important than algorithm accuracy. Whenever you have two agents with differing levels of accuracy, the overall human-algorithm system performs best when the human is the more accurate one. This asymmetry in importance mirrors the asymmetric roles the human and algorithm play – the human has the “final say” in whether the best item is picked. In contrast, all the algorithm can do is ensure the best item is included in its subset recommendation.
This phenomenon has downstream implications for complementarity. Specifically, we show that this effect means that complementarity is easier to achieve when the human is more accurate than the algorithm. In contrast, it’s more difficult to achieve with a more accurate algorithm. Unfortunately, in many of the current settings where AI tools are currently in use, algorithms already outperform humans – which our results indicate probably means that complementarity will be extremely hard to achieve.
Between the lines
Our paper looks at a ubiquitous setting – that of an AI tool that narrows down items for the human to pick between. We give insights into which features influence the performance of this system, such as relative accuracy, human anchoring, and the size of the set that’s presented. Naturally, there are many other avenues for exploring this space. For example, we could consider cases where the human and algorithm are misaligned – that is, they fundamentally disagree on the “best” candidate. Separately, there’s also been research into learning the best set to present to a human, which would allow the set size to change based on the relative accuracy of the human and algorithm. We could also consider other objectives – beyond simply finding the best item, but maybe finding a “good” item with high probability. If you’re interested in any of these questions or have thoughts on our paper, please feel free to reach out!