• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

The path toward equal performance in medical machine learning

September 6, 2023

🔬 Research Summary by Eike Petersen, a postdoctoral researcher at the Technical University of Denmark (DTU), working on fair, responsible, and robust machine learning for medicine.

[Original paper by Eike Petersen, Sune Holm, Melanie Ganz, and Aasa Feragen]


Overview: Medical machine learning models are often better at predicting outcomes or diagnosing diseases in some patient groups than others. This paper asks why such performance differences occur and what it would take to build models that perform equally well for all patients.


Introduction

Disparities in model performance across patient groups are common in medical machine learning. While the under-representation of patient groups as one potential explanation has received significant attention, there are further reasons for performance discrepancies that are less discussed. These include differences in prediction task difficulty between groups due to, for example, more noisy measurements in one group, biased outcome measurements, and selection biases. Narrow algorithmic fairness solutions cannot address these issues. The authors conclude that leveling up model performance may require more data from underperforming groups and better data.

Key Insights

Models often perform poorly on patient groups that are underrepresented in the training set; this is not always true, however. Cases have been reported in which group representation did not strongly affect model performance on that group or even in which non-represented groups outperform highly represented ones. The authors propose two distinct mechanisms to explain these seemingly contradictory observations:

1) A model may perform worse than theoretically achievable on a given group due to a combination of under-representation, the nature of the differences between the groups, and technical modeling and optimization choices.

2) The optimal level of achievable performance may differ between groups due to differences in the intrinsic difficulty of the prediction task.

The relationship between group representation and model performance

Whether and how strongly under-representation affects model performance for that group depends on the nature of the differences between the groups. If the mapping from model inputs (medical measurements) to outputs (clinical prediction targets) is similar across groups, under-representation does not have to cause under-performance. Moreover, even if there are significant group differences (say, between male and female chest x-ray recordings), modern machine learning models are sufficiently expressive to still learn an optimal input-output mapping for all groups. In practice, however, this often does not happen since inductive biases, explicit regularization schemes, and the use of local optimization methods will strongly bias the model toward optimizing performance on the majority group. In this situation, standard algorithmic fairness approaches can play a role: they can help counter these majority biases and recover per-group optimal achievable performance.

Differences in task difficulty

Separate from issues related to under-representation and the choice of particular optimization algorithms, the maximum achievable performance level (given a specific dataset) may also differ between groups. Notably, this inherently limits what can be achieved using standard algorithmic fairness approaches without resorting to leveling down, i.e., reducing performance for all groups to the lowest level achievable in all groups. One reason for such differences in task difficulty is given by more strongly distorted input measurements in one group, such as, for example, in abdominal ultrasound or electromyographic recordings in obese patients. Another reason can be the lack of information about more important confounders in one group than another. To provide one example, hormone levels are often more predictive of clinical outcomes in females than in males.

Misleading performance estimates

Separate from the two issues outlined above, model performance metrics may be misleading. For example, a model may often predict the outputs correctly. Still, due to the samples being mislabeled (either by a medical expert or an automatic labeling mechanism), performance estimates may misleadingly indicate poor performance. Conversely, if the model has learned to “correctly” predict these labeling errors, performance estimates may misleadingly indicate good performance even though the model often makes wrong predictions. If the characteristics of such label errors differ between patient groups – as they often do – performance metrics can wrongly indicate the presence or absence of performance disparities.

A path forward

Given this more detailed understanding of how performance differences can arise, what does the path toward equal performance look like? It exists, but it may be long and winding. As a crucial first step, label, and selection biases must be ruled out or addressed, as these hamper any meaningful investigation into the presence or absence of performance differences (and will also lead to severely biased model predictions if left unaddressed). 

Secondly, the root causes of any observed model performance differences must be identified: Are they due to under-representation, technical design choices, or differences in task difficulty? The authors make some proposals for potential root cause identification methods. Still, it is here that they perceive the largest gap in the literature: the principled identification of the root causes of observed performance differences is currently a highly challenging and application-specific endeavor. 

Thirdly, and finally, an appropriate bias mitigation strategy can be devised. This may involve the gathering (or even design) of alternative or additional measurements, the targeted collection of more data from underrepresented groups, the use of algorithmic bias mitigation approaches, or changes in the model structure and training approach.

Between the lines

The algorithmic fairness literature often focuses on fairness-accuracy trade-offs and achieving fairness given a specific dataset. However, there is a priori no reason why it should be impossible to achieve equality of predictive performance by leveling up performance. This may require, however, reconsidering the setup of the estimation task and the data collection procedure. More data may be needed to improve performance in underperforming groups, as well as different and better data. The precise and principled identification of the most efficient mitigation approach in a given application remains an important open problem.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice

    The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice

  • Broadening AI Ethics Narratives: An Indic Art View

    Broadening AI Ethics Narratives: An Indic Art View

  • On the Construction of Artificial Moral Agents Agents

    On the Construction of Artificial Moral Agents Agents

  • Never trust, always verify: a roadmap for Trustworthy AI?

    Never trust, always verify: a roadmap for Trustworthy AI?

  • A hunt for the Snark: Annotator Diversity in Data Practices

    A hunt for the Snark: Annotator Diversity in Data Practices

  • Compute Trends Across Three Eras of Machine Learning

    Compute Trends Across Three Eras of Machine Learning

  • How to invest in Data and AI companies responsibly

    How to invest in Data and AI companies responsibly

  • Going public: the role of public participation approaches in commercial AI labs

    Going public: the role of public participation approaches in commercial AI labs

  • The Paradox of AI Ethics in Warfare

    The Paradox of AI Ethics in Warfare

  • The Limits of Global Inclusion in AI Development (Research Summary)

    The Limits of Global Inclusion in AI Development (Research Summary)

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.