🔬 Research Summary by Kate Donahue , a Computer Science PhD student at Cornell who studies on the societal impacts of AI. (Acknowledgment to Michela Meister, for help with an early draft.)
[Original paper by Kate Donahue, Alexandra Chouldechova, and Krishnaram Kenthapadi]
Overview: In many real-world scenarios, humans use AI tools as assistants, while ultimately making the final decision themselves. In this paper, we build a theoretical framework to analyze human-algorithm collaboration, showing when combined systems can have lower error and be more fair – and when they can’t.Â
Introduction
Suppose that you’re a patient sitting in a doctor’s office, nervously waiting to be seen by your doctor. Based on the results of her examination, you might need further treatment, or you might be able to go home without a worry. When your doctor comes in, she opens up your file and walks you through it. “When I interpreted your lab results and medical history, I used this new artificial intelligence assistant,” she explains. “It’s not perfect, but it sometimes catches things I miss”. The notion of your doctor working with an AI assistant surprises you. How good is this assistant – and is your doctor more effective when she uses it?
This scenario is far from hypothetical. In many settings, we have AI systems which are highly effective – in fact, they can outperform doctors themselves at certain tasks. However, these AI tools still make mistakes, sometimes ones that are glaringly obvious to humans. For these reasons, AIs are often used as assistants to humans, rather than autonomous decision-making tools.
Because of this, a key question is understanding how a joint system – human and algorithm – performs. Is it better than a human alone – and what do we mean by “better”? In our paper, we analyze how combined systems perform across two different axes – reducing average error, and in being more fair in how that error is distributed.
Key Insights
Will adding an AI assistant make me better?
First, will the combined system reduce overall error? Bansal et al coined the term “complementarity” to mean a combined human-algorithm system that has lower average error than either the human or the algorithm. Complementarity, when it’s achievable, is a valuable accomplishment – for the doctor and AI assistant example, it would mean that patients, on the whole, would get more accurate results. In our full paper, we give specific conditions for when complementarity can (and cannot) be achieved.
However, average error isn’t the only metric we might care about. Activists and researchers have been thinking about fairness in machine learning for decades. Will adding humans into this ML pipeline make things less fair? In our work, we consider two definitions of fairness. One is “fairness of benefit” – basically, that every type of patient (every person being predicted upon) should see their error decrease when the AI is added. Intuitively, this just means that everyone benefits from the new system. Unfortunately, we show that complementarity (overall error decreasing) is in opposition with fairness of benefit. Another notion of fairness we consider has a more positive result. This notion is “error disparity” – basically, the gap in error between types of patients. Even for a human (or algorithm) acting alone, there’s typically some variability in error rates, with some patients getting more accurate (lower error) predictions than others. This error disparity is often undesirable – all else being equal, we’d want everyone to get the same (low) error rate. We show that adding in an AI can unfortunately increase the error disparity of the combined system – but we also give concrete, easily achievable conditions to ensure that this doesn’t happen.
Modeling the human/AI system
Next, let’s go a little more in depth on the model and how we achieve the results I just described. Our first contribution is making a formal, tractable model for human-algorithm collaboration. In our model, we have three components: the unaided human (here, a doctor making predictions based on the data and her own expertise), the unaided algorithm (an AI assistant making predictions based on the data), and a combined system (a doctor making decisions based on the AI tool as well as her own expertise). We view each of these components as having some distribution of error rates over discrete “regimes” in the input space. A “regime” could be an arbitrarily precise sub-portion of the input space – something like “women in their late 20s with diabetes whose medical scans were done by MRI machine 5”. We don’t assume that we know these regimes – just that they exist. Each of the unaided human, algorithm, and combined system has specific errors across these regimes – for example, the unaided human might have error 0.075 and the unaided algorithm might have error 0.05 (or vice versa). Then, we view this system through the lens of combining error rates: the error of the combined system (human using the algorithm) is a function of those errors (0.075 and 0.05). We also assume that, on a particular regime, the combined system can’t do better or worse than its inputs – here, that would mean that its error on that particular regime would be between 0.05 and 0.075. Basically, this models how a doctor might weigh advice from different sources, hopefully relying more heavily on whichever is more likely to be right.
Variability as the key factor
Based on this model, we’re able to derive some results about when we can and cannot achieve complementarity (the combined system having lower average error than either the unaided human or the unaided algorithm). At a high level, these results show that, all else being equal, complementarity is easier to achieve when human and algorithm error rates are highly variable. That is, when the error differs a lot across different regimes and humans and algorithms have different strengths and weaknesses, the human using the algorithm is more likely to achieve complementarity. This insight is what drives the tension with fairness – complementarity is easier with more variable error rates, which is exactly what most notions of fairness try to eliminate.
Between the lines
As with all research papers, the choice of model and assumptions are crucial for the overall results. When we were writing this paper, we played with many, many models before deciding on the ones we ultimately used to write the paper – they were the best ones for the problem we wanted to explore. However, it would be fascinating to explore extensions that relax some of our assumptions. For example, how do our results change if we allow the human-algorithm system to have lower error on a particular regime than either of its inputs? How do they change if we allow regimes to differ in ways besides their error rates? How could we use this model to guide AI research — or training of humans who work with AIs? If you’re interested in any of these questions, feel free to reach out – I’d love to talk!