• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

The path toward equal performance in medical machine learning

September 6, 2023

🔬 Research Summary by Eike Petersen, a postdoctoral researcher at the Technical University of Denmark (DTU), working on fair, responsible, and robust machine learning for medicine.

[Original paper by Eike Petersen, Sune Holm, Melanie Ganz, and Aasa Feragen]


Overview: Medical machine learning models are often better at predicting outcomes or diagnosing diseases in some patient groups than others. This paper asks why such performance differences occur and what it would take to build models that perform equally well for all patients.


Introduction

Disparities in model performance across patient groups are common in medical machine learning. While the under-representation of patient groups as one potential explanation has received significant attention, there are further reasons for performance discrepancies that are less discussed. These include differences in prediction task difficulty between groups due to, for example, more noisy measurements in one group, biased outcome measurements, and selection biases. Narrow algorithmic fairness solutions cannot address these issues. The authors conclude that leveling up model performance may require more data from underperforming groups and better data.

Key Insights

Models often perform poorly on patient groups that are underrepresented in the training set; this is not always true, however. Cases have been reported in which group representation did not strongly affect model performance on that group or even in which non-represented groups outperform highly represented ones. The authors propose two distinct mechanisms to explain these seemingly contradictory observations:

1) A model may perform worse than theoretically achievable on a given group due to a combination of under-representation, the nature of the differences between the groups, and technical modeling and optimization choices.

2) The optimal level of achievable performance may differ between groups due to differences in the intrinsic difficulty of the prediction task.

The relationship between group representation and model performance

Whether and how strongly under-representation affects model performance for that group depends on the nature of the differences between the groups. If the mapping from model inputs (medical measurements) to outputs (clinical prediction targets) is similar across groups, under-representation does not have to cause under-performance. Moreover, even if there are significant group differences (say, between male and female chest x-ray recordings), modern machine learning models are sufficiently expressive to still learn an optimal input-output mapping for all groups. In practice, however, this often does not happen since inductive biases, explicit regularization schemes, and the use of local optimization methods will strongly bias the model toward optimizing performance on the majority group. In this situation, standard algorithmic fairness approaches can play a role: they can help counter these majority biases and recover per-group optimal achievable performance.

Differences in task difficulty

Separate from issues related to under-representation and the choice of particular optimization algorithms, the maximum achievable performance level (given a specific dataset) may also differ between groups. Notably, this inherently limits what can be achieved using standard algorithmic fairness approaches without resorting to leveling down, i.e., reducing performance for all groups to the lowest level achievable in all groups. One reason for such differences in task difficulty is given by more strongly distorted input measurements in one group, such as, for example, in abdominal ultrasound or electromyographic recordings in obese patients. Another reason can be the lack of information about more important confounders in one group than another. To provide one example, hormone levels are often more predictive of clinical outcomes in females than in males.

Misleading performance estimates

Separate from the two issues outlined above, model performance metrics may be misleading. For example, a model may often predict the outputs correctly. Still, due to the samples being mislabeled (either by a medical expert or an automatic labeling mechanism), performance estimates may misleadingly indicate poor performance. Conversely, if the model has learned to “correctly” predict these labeling errors, performance estimates may misleadingly indicate good performance even though the model often makes wrong predictions. If the characteristics of such label errors differ between patient groups – as they often do – performance metrics can wrongly indicate the presence or absence of performance disparities.

A path forward

Given this more detailed understanding of how performance differences can arise, what does the path toward equal performance look like? It exists, but it may be long and winding. As a crucial first step, label, and selection biases must be ruled out or addressed, as these hamper any meaningful investigation into the presence or absence of performance differences (and will also lead to severely biased model predictions if left unaddressed). 

Secondly, the root causes of any observed model performance differences must be identified: Are they due to under-representation, technical design choices, or differences in task difficulty? The authors make some proposals for potential root cause identification methods. Still, it is here that they perceive the largest gap in the literature: the principled identification of the root causes of observed performance differences is currently a highly challenging and application-specific endeavor. 

Thirdly, and finally, an appropriate bias mitigation strategy can be devised. This may involve the gathering (or even design) of alternative or additional measurements, the targeted collection of more data from underrepresented groups, the use of algorithmic bias mitigation approaches, or changes in the model structure and training approach.

Between the lines

The algorithmic fairness literature often focuses on fairness-accuracy trade-offs and achieving fairness given a specific dataset. However, there is a priori no reason why it should be impossible to achieve equality of predictive performance by leveling up performance. This may require, however, reconsidering the setup of the estimation task and the data collection procedure. More data may be needed to improve performance in underperforming groups, as well as different and better data. The precise and principled identification of the most efficient mitigation approach in a given application remains an important open problem.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

AI Governance in a Competitive World: Balancing Innovation, Regulation and Ethics | Point Zero Forum 2025

related posts

  • The Proliferation of AI Ethics Principles: What's Next?

    The Proliferation of AI Ethics Principles: What's Next?

  • Can we trust robots?

    Can we trust robots?

  • Research Summary: Towards Evaluating the Robustness of Neural Networks

    Research Summary: Towards Evaluating the Robustness of Neural Networks

  • A Systematic Review of Ethical Concerns with Voice Assistants

    A Systematic Review of Ethical Concerns with Voice Assistants

  • The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks (Research Summa...

    The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks (Research Summa...

  • A Hazard Analysis Framework for Code Synthesis Large Language Models

    A Hazard Analysis Framework for Code Synthesis Large Language Models

  • When Algorithms Infer Pregnancy or Other Sensitive Information About People

    When Algorithms Infer Pregnancy or Other Sensitive Information About People

  • The State of AI Ethics Report (Jan 2021)

    The State of AI Ethics Report (Jan 2021)

  • Research summary: Politics of Adversarial Machine Learning

    Research summary: Politics of Adversarial Machine Learning

  • AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in...

    AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.