Auditing for Human Expertise

🔬 Research summary by Rohan Alur, a second year PhD student in Electrical Engineering and Computer Science at MIT.

[Original paper by Rohan Alur, Loren Laine, Darrick K. Li, Manish Raghavan, Devavrat Shah and Dennis Shung]

Overview: In this work, we develop a statistical framework to test whether an expert tasked with making predictions (e.g., a doctor making patient diagnoses) incorporates information unavailable to any competing predictive algorithm. This ‘information’ may be implicit; for example, experts often exercise judgment or rely on intuition which is difficult to model with an algorithmic prediction rule. A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data. This implies that optimal performance for the given prediction task will require incorporating expert feedback.

Introduction

Trained human experts often handle high-stakes prediction tasks (e.g., patient diagnosis). A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises the question of whether human experts add value that an algorithmic predictor could not capture. We develop a statistical framework to pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. We apply our proposed test to admissions data collected from the emergency department of a large academic hospital system, where we show that physicians’ admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information not captured by a standard algorithmic screening tool. Importantly, this is true even though the screening tool is arguably more accurate than physicians’ discretionary decisions, highlighting that accuracy is insufficient to justify algorithmic automation, even absent normative or legal concerns.

Key Insights

Experts may add value even when they underperform a competing algorithm

A natural first step when assessing the performance of human experts is to compare their predictive accuracy to that of a competing predictive algorithm. However, as we illustrate with toy examples and a real-world case study, this differs from testing whether experts add value to a given prediction task. In particular, it is straightforward to come up with examples where an algorithm can handily beat an expert in predictive accuracy. Still, an expert nonetheless incorporates intuition or unobserved information useful for improving predictions. For example, a doctor who underperforms a predictive algorithm trained only on electronic medical records may nonetheless glean valuable information from direct conversations with patients. This suggests that comparing human performance to an algorithm is insufficient for determining whether a given prediction task can or should be automated – even absent normative or legal concerns about automation in high-stakes settings.

ExpertTest: intuition and algorithm

We propose the ExpertTest algorithm, which, given the set of inputs that might be available to an algorithm (‘features’) and a history of human predictions and realized outcomes, tests whether the expert is using information that could not be captured by any algorithm trained on the available features. This procedure is simple and takes the form of a conditional independence test specialized to our setting. That is, conditional on the features, we test whether the human predictions are informative about the outcome in a way that improves predictive accuracy. Intuitively, one can think of our algorithm as testing whether experts can reliably distinguish two observations with identical (or nearly identical) features. If they can, it must be that the expert is relying on additional information which is not present in the data (including, perhaps, their intuition or judgment) to make these distinctions.

Case study

We apply ExpertTest in a case study of emergency room triage decisions for patients who present with acute gastrointestinal bleeding (AGIB). The goal of triage in this setting is to hospitalize patients who require some form of acute care and to discharge those who do not.

We assess whether emergency room physicians make these hospitalization decisions by incorporating information not captured by the Glasgow-Blatchford Score (GBS). This standard algorithmic screening tool is known to be a highly sensitive measure of risk for patients with AGIB. We corroborate this finding in data collected from the emergency department of a large academic health system, where we show that making hospitalization decisions based solely on the GBS can modestly outperform physicians’ discretionary decisions. In particular, the GBS can achieve substantially better accuracy at comparable sensitivity levels, i.e., discharging very few patients who, in retrospect, should have been hospitalized while discharging many more patients who do not require hospitalization. Nonetheless, we find strong evidence that physicians reliably distinguish between patients who present with identical Glasgow-Blatchford scores, indicating that physicians are incorporating information not captured by this screening tool. These findings highlight that making accurate triage decisions still requires expert physician input, even when highly accurate screening tools are available and even if the goal is merely to maximize predictive accuracy (or minimize false negatives, as is often the case in high-stakes settings).

Between the lines

Summary

In this work, we provide a simple test to detect whether a human forecaster is incorporating unobserved information into their predictions and illustrate its utility in a case study of hospitalization decisions made by emergency room physicians. A key insight is to recognize that this requires more care than simply testing whether the forecaster outperforms an algorithm trained on observable data; a large body of prior work suggests this is rarely the case. Nonetheless, there are many settings in which we might expect an expert to use information or intuition, which is difficult to replicate with a predictive model.

Limitations

An important limitation of our approach is that we do not consider the possibility that expert forecasts might inform decisions that causally affect the outcome of interest, as is often the case in practice. We also do not address the possibility that the objective of interest is not merely accuracy but perhaps some more sophisticated measure of utility (e.g., one which also values fairness or simplicity). We caution more generally that there are often normative reasons to prefer human decision-makers, and our test captures merely one possible notion of expertise.

Future Directions

Our work draws a clean separation between the ‘upstream’ inferential goal of detecting whether a forecaster incorporates unobserved information and the ‘downstream’ algorithmic task of designing tools that complement or otherwise incorporate human expertise. However, these problems share a very similar underlying structure, and we conjecture that – as observed in other supervised learning settings – there is a tight connection between these auditing and learning problems.