• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Auditing for Human Expertise

June 24, 2023

🔬 Research summary by Rohan Alur, a second year PhD student in Electrical Engineering and Computer Science at MIT.

[Original paper by Rohan Alur, Loren Laine, Darrick K. Li, Manish Raghavan, Devavrat Shah and Dennis Shung]


Overview: In this work, we develop a statistical framework to test whether an expert tasked with making predictions (e.g., a doctor making patient diagnoses) incorporates information unavailable to any competing predictive algorithm. This ‘information’ may be implicit; for example, experts often exercise judgment or rely on intuition which is difficult to model with an algorithmic prediction rule. A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data. This implies that optimal performance for the given prediction task will require incorporating expert feedback.


Introduction

Trained human experts often handle high-stakes prediction tasks (e.g., patient diagnosis). A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises the question of whether human experts add value that an algorithmic predictor could not capture. We develop a statistical framework to pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. We apply our proposed test to admissions data collected from the emergency department of a large academic hospital system, where we show that physicians’ admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information not captured by a standard algorithmic screening tool. Importantly, this is true even though the screening tool is arguably more accurate than physicians’ discretionary decisions, highlighting that accuracy is insufficient to justify algorithmic automation, even absent normative or legal concerns. 

Key Insights

Experts may add value even when they underperform a competing algorithm

A natural first step when assessing the performance of human experts is to compare their predictive accuracy to that of a competing predictive algorithm. However, as we illustrate with toy examples and a real-world case study, this differs from testing whether experts add value to a given prediction task. In particular, it is straightforward to come up with examples where an algorithm can handily beat an expert in predictive accuracy. Still, an expert nonetheless incorporates intuition or unobserved information useful for improving predictions. For example, a doctor who underperforms a predictive algorithm trained only on electronic medical records may nonetheless glean valuable information from direct conversations with patients. This suggests that comparing human performance to an algorithm is insufficient for determining whether a given prediction task can or should be automated – even absent normative or legal concerns about automation in high-stakes settings.

ExpertTest: intuition and algorithm

We propose the ExpertTest algorithm, which, given the set of inputs that might be available to an algorithm (‘features’) and a history of human predictions and realized outcomes, tests whether the expert is using information that could not be captured by any algorithm trained on the available features. This procedure is simple and takes the form of a conditional independence test specialized to our setting. That is, conditional on the features, we test whether the human predictions are informative about the outcome in a way that improves predictive accuracy. Intuitively, one can think of our algorithm as testing whether experts can reliably distinguish two observations with identical (or nearly identical) features. If they can, it must be that the expert is relying on additional information which is not present in the data (including, perhaps, their intuition or judgment) to make these distinctions.  

Case study

We apply ExpertTest in a case study of emergency room triage decisions for patients who present with acute gastrointestinal bleeding (AGIB). The goal of triage in this setting is to hospitalize patients who require some form of acute care and to discharge those who do not. 

We assess whether emergency room physicians make these hospitalization decisions by incorporating information not captured by the Glasgow-Blatchford Score (GBS). This standard algorithmic screening tool is known to be a highly sensitive measure of risk for patients with AGIB. We corroborate this finding in data collected from the emergency department of a large academic health system, where we show that making hospitalization decisions based solely on the GBS can modestly outperform physicians’ discretionary decisions. In particular, the GBS can achieve substantially better accuracy at comparable sensitivity levels, i.e., discharging very few patients who, in retrospect, should have been hospitalized while discharging many more patients who do not require hospitalization. Nonetheless, we find strong evidence that physicians reliably distinguish between patients who present with identical Glasgow-Blatchford scores, indicating that physicians are incorporating information not captured by this screening tool. These findings highlight that making accurate triage decisions still requires expert physician input, even when highly accurate screening tools are available and even if the goal is merely to maximize predictive accuracy (or minimize false negatives, as is often the case in high-stakes settings).

Between the lines

Summary

In this work, we provide a simple test to detect whether a human forecaster is incorporating unobserved information into their predictions and illustrate its utility in a case study of hospitalization decisions made by emergency room physicians. A key insight is to recognize that this requires more care than simply testing whether the forecaster outperforms an algorithm trained on observable data; a large body of prior work suggests this is rarely the case. Nonetheless, there are many settings in which we might expect an expert to use information or intuition, which is difficult to replicate with a predictive model. 

Limitations

An important limitation of our approach is that we do not consider the possibility that expert forecasts might inform decisions that causally affect the outcome of interest, as is often the case in practice. We also do not address the possibility that the objective of interest is not merely accuracy but perhaps some more sophisticated measure of utility (e.g., one which also values fairness or simplicity). We caution more generally that there are often normative reasons to prefer human decision-makers, and our test captures merely one possible notion of expertise. 

Future Directions

Our work draws a clean separation between the ‘upstream’ inferential goal of detecting whether a forecaster incorporates unobserved information and the ‘downstream’ algorithmic task of designing tools that complement or otherwise incorporate human expertise. However, these problems share a very similar underlying structure, and we conjecture that – as observed in other supervised learning settings – there is a tight connection between these auditing and learning problems.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • Cinderella’s shoe won’t fit Soundarya: An audit of facial processing tools on Indian faces

    Cinderella’s shoe won’t fit Soundarya: An audit of facial processing tools on Indian faces

  • FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation (NeurIPS 2024)

    FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation (NeurIPS 2024)

  • A Survey on Intersectional Fairness in Machine Learning: Notions, Mitigation and Challenges

    A Survey on Intersectional Fairness in Machine Learning: Notions, Mitigation and Challenges

  • Scientists' Perspectives on the Potential for Generative AI in their Fields

    Scientists' Perspectives on the Potential for Generative AI in their Fields

  • Teaching AI Ethics Using Science Fiction (Research summary)

    Teaching AI Ethics Using Science Fiction (Research summary)

  • Artificial intelligence and biological misuse: Differentiating risks of language models and biologic...

    Artificial intelligence and biological misuse: Differentiating risks of language models and biologic...

  • Embedding Values in Artificial Intelligence (AI) Systems

    Embedding Values in Artificial Intelligence (AI) Systems

  • Research Summary: The cognitive science of fake news

    Research Summary: The cognitive science of fake news

  • Group Fairness Is Not Derivable From Justice: a Mathematical Proof

    Group Fairness Is Not Derivable From Justice: a Mathematical Proof

  • An Empirical Study of Modular Bias Mitigators and Ensembles

    An Empirical Study of Modular Bias Mitigators and Ensembles

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.