• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Auditing for Human Expertise

June 24, 2023

🔬 Research summary by Rohan Alur, a second year PhD student in Electrical Engineering and Computer Science at MIT.

[Original paper by Rohan Alur, Loren Laine, Darrick K. Li, Manish Raghavan, Devavrat Shah and Dennis Shung]


Overview: In this work, we develop a statistical framework to test whether an expert tasked with making predictions (e.g., a doctor making patient diagnoses) incorporates information unavailable to any competing predictive algorithm. This ‘information’ may be implicit; for example, experts often exercise judgment or rely on intuition which is difficult to model with an algorithmic prediction rule. A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data. This implies that optimal performance for the given prediction task will require incorporating expert feedback.


Introduction

Trained human experts often handle high-stakes prediction tasks (e.g., patient diagnosis). A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises the question of whether human experts add value that an algorithmic predictor could not capture. We develop a statistical framework to pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. We apply our proposed test to admissions data collected from the emergency department of a large academic hospital system, where we show that physicians’ admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information not captured by a standard algorithmic screening tool. Importantly, this is true even though the screening tool is arguably more accurate than physicians’ discretionary decisions, highlighting that accuracy is insufficient to justify algorithmic automation, even absent normative or legal concerns. 

Key Insights

Experts may add value even when they underperform a competing algorithm

A natural first step when assessing the performance of human experts is to compare their predictive accuracy to that of a competing predictive algorithm. However, as we illustrate with toy examples and a real-world case study, this differs from testing whether experts add value to a given prediction task. In particular, it is straightforward to come up with examples where an algorithm can handily beat an expert in predictive accuracy. Still, an expert nonetheless incorporates intuition or unobserved information useful for improving predictions. For example, a doctor who underperforms a predictive algorithm trained only on electronic medical records may nonetheless glean valuable information from direct conversations with patients. This suggests that comparing human performance to an algorithm is insufficient for determining whether a given prediction task can or should be automated – even absent normative or legal concerns about automation in high-stakes settings.

ExpertTest: intuition and algorithm

We propose the ExpertTest algorithm, which, given the set of inputs that might be available to an algorithm (‘features’) and a history of human predictions and realized outcomes, tests whether the expert is using information that could not be captured by any algorithm trained on the available features. This procedure is simple and takes the form of a conditional independence test specialized to our setting. That is, conditional on the features, we test whether the human predictions are informative about the outcome in a way that improves predictive accuracy. Intuitively, one can think of our algorithm as testing whether experts can reliably distinguish two observations with identical (or nearly identical) features. If they can, it must be that the expert is relying on additional information which is not present in the data (including, perhaps, their intuition or judgment) to make these distinctions.  

Case study

We apply ExpertTest in a case study of emergency room triage decisions for patients who present with acute gastrointestinal bleeding (AGIB). The goal of triage in this setting is to hospitalize patients who require some form of acute care and to discharge those who do not. 

We assess whether emergency room physicians make these hospitalization decisions by incorporating information not captured by the Glasgow-Blatchford Score (GBS). This standard algorithmic screening tool is known to be a highly sensitive measure of risk for patients with AGIB. We corroborate this finding in data collected from the emergency department of a large academic health system, where we show that making hospitalization decisions based solely on the GBS can modestly outperform physicians’ discretionary decisions. In particular, the GBS can achieve substantially better accuracy at comparable sensitivity levels, i.e., discharging very few patients who, in retrospect, should have been hospitalized while discharging many more patients who do not require hospitalization. Nonetheless, we find strong evidence that physicians reliably distinguish between patients who present with identical Glasgow-Blatchford scores, indicating that physicians are incorporating information not captured by this screening tool. These findings highlight that making accurate triage decisions still requires expert physician input, even when highly accurate screening tools are available and even if the goal is merely to maximize predictive accuracy (or minimize false negatives, as is often the case in high-stakes settings).

Between the lines

Summary

In this work, we provide a simple test to detect whether a human forecaster is incorporating unobserved information into their predictions and illustrate its utility in a case study of hospitalization decisions made by emergency room physicians. A key insight is to recognize that this requires more care than simply testing whether the forecaster outperforms an algorithm trained on observable data; a large body of prior work suggests this is rarely the case. Nonetheless, there are many settings in which we might expect an expert to use information or intuition, which is difficult to replicate with a predictive model. 

Limitations

An important limitation of our approach is that we do not consider the possibility that expert forecasts might inform decisions that causally affect the outcome of interest, as is often the case in practice. We also do not address the possibility that the objective of interest is not merely accuracy but perhaps some more sophisticated measure of utility (e.g., one which also values fairness or simplicity). We caution more generally that there are often normative reasons to prefer human decision-makers, and our test captures merely one possible notion of expertise. 

Future Directions

Our work draws a clean separation between the ‘upstream’ inferential goal of detecting whether a forecaster incorporates unobserved information and the ‘downstream’ algorithmic task of designing tools that complement or otherwise incorporate human expertise. However, these problems share a very similar underlying structure, and we conjecture that – as observed in other supervised learning settings – there is a tight connection between these auditing and learning problems.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

related posts

  • “Cold Hard Data” – Nothing Cold or Hard About It

    “Cold Hard Data” – Nothing Cold or Hard About It

  • Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions

    Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions

  • Designing Fiduciary Artificial Intelligence

    Designing Fiduciary Artificial Intelligence

  • From Promise to Practice: A Glimpse into AI-Driven Approaches to Neuroscience

    From Promise to Practice: A Glimpse into AI-Driven Approaches to Neuroscience

  • Towards Climate Awareness in NLP Research

    Towards Climate Awareness in NLP Research

  • Artificial Intelligence: the global landscape of ethics guidelines

    Artificial Intelligence: the global landscape of ethics guidelines

  • Consequences of Recourse In Binary Classification

    Consequences of Recourse In Binary Classification

  • Research summary: Maximizing Privacy and Effectiveness in COVID-19 Apps

    Research summary: Maximizing Privacy and Effectiveness in COVID-19 Apps

  • Challenges of AI Development in Vietnam: Funding, Talent and Ethics

    Challenges of AI Development in Vietnam: Funding, Talent and Ethics

  • Trust me!: How to use trust-by-design to build resilient tech in times of crisis

    Trust me!: How to use trust-by-design to build resilient tech in times of crisis

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.