Summary contributed by Abhishek Gupta (@atg_abhishek), founder of the Montreal AI Ethics Institute.
*Authors of full paper & link at the bottom
Mini-summary: Human machine collaborations have shown to outperform pure human and pure machine systems time and again. This is motivated by the fact that we have complementary strengths which allow us to distribute pieces of tasks in a manner that allow us to complement and cover each other’s weaknesses. In this paper, the authors explain how this can be done better in a discriminative and decision-theoretic setting. They advocate for the use of joint training approaches that keeps the predictive task and the policy task of selecting when to query a human for support together rather than doing that separately.
Experimental results as shown in the paper highlight how this approach leads to results that are never worse and often quite better than other approaches taken in this domain of human-machine complementarity. Specifically, in areas where there are asymmetric costs in terms of the errors that are incurred by the system, the authors find that their approach significantly outperforms existing approaches. This is a welcome sign for the use of machine learning in contexts where high-stakes decisions are taken, for example in medicine where we want to reduce as much as possible on missing diagnosis. Ultimately, utilizing an approach such as this will allow us all to build more safe and robust systems that help us leverage our relative strengths when it comes to making predictions.
Full summary:
The domain of humans and machines working together has been studied in various different settings but this paper takes a particular look at how this manifests itself in the context of machines having to make a decision on when to defer to human decision-making and then combining human and machine inputs.
They look at both discriminative and decision-theoretic approaches to understanding the impacts of this approach and take into account as a baseline first constructing a predictive model for the task and then a policy for choosing when to query a human treating the predictive model as fixed.
There is a focus on complementarity because of the asymmetric nature of errors made by humans and machines. This becomes even more important when the ML model has limited capacity and we need to make the choice to focus its efforts on particular parts of the task. The way that this work differs from others in the domain is that they emphasize joint training that explicitly considers the relative strengths and weaknesses of humans and machines.
A discriminative approach is one where there is a direct mapping between the features and the outputs without building intermediate probabilistic representations for the components of the system. Under relaxed assumptions as formulated in the paper, the authors take the approach of taking the decision from the human when the human is chosen to be queried before making a prediction. In a decision-theoretic approach, there is the possibility of following up on the first step with the calculation of the expected value of information from querying the human. Taking an inductive approach from the fixed value of information (VOI) system allows for the model to start from well-founded probabilistic reasoning and then fine-tune for complementarity with the human.
From the experiments run by the authors, it is clear that a joint approach leads to better results compared to the fixed counterparts when these are optimized for complementarity. Though, in deeper models, this approach is tied to the counterparts or makes modest improvements but never performs worse. An insight that the authors present is that in a lower capacity model, there is high bias and which makes aligning the training procedure with humans all the more important given that some errors will be inevitable. Another experiment that they ran on the CAMELYON16 dataset showed that the gaps increased significantly in an asymmetric costs regime, especially boding well for cases where there are particular kinds of errors that we would like to avoid in practice – for example missing diagnosis in medicine.
Finally, the authors conclude that the way the distribution of errors shift is such that the model is better able to complement the strengths and weaknesses of both humans and machines, something that we should try and include in the design of our machine learning systems if we are to make safer, and more robust systems.
Original paper by Bryan Wilder, Eric Horvitz, Ece Kamar: https://arxiv.org/abs/2005.00582