• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

The path toward equal performance in medical machine learning

September 6, 2023

🔬 Research Summary by Eike Petersen, a postdoctoral researcher at the Technical University of Denmark (DTU), working on fair, responsible, and robust machine learning for medicine.

[Original paper by Eike Petersen, Sune Holm, Melanie Ganz, and Aasa Feragen]


Overview: Medical machine learning models are often better at predicting outcomes or diagnosing diseases in some patient groups than others. This paper asks why such performance differences occur and what it would take to build models that perform equally well for all patients.


Introduction

Disparities in model performance across patient groups are common in medical machine learning. While the under-representation of patient groups as one potential explanation has received significant attention, there are further reasons for performance discrepancies that are less discussed. These include differences in prediction task difficulty between groups due to, for example, more noisy measurements in one group, biased outcome measurements, and selection biases. Narrow algorithmic fairness solutions cannot address these issues. The authors conclude that leveling up model performance may require more data from underperforming groups and better data.

Key Insights

Models often perform poorly on patient groups that are underrepresented in the training set; this is not always true, however. Cases have been reported in which group representation did not strongly affect model performance on that group or even in which non-represented groups outperform highly represented ones. The authors propose two distinct mechanisms to explain these seemingly contradictory observations:

1) A model may perform worse than theoretically achievable on a given group due to a combination of under-representation, the nature of the differences between the groups, and technical modeling and optimization choices.

2) The optimal level of achievable performance may differ between groups due to differences in the intrinsic difficulty of the prediction task.

The relationship between group representation and model performance

Whether and how strongly under-representation affects model performance for that group depends on the nature of the differences between the groups. If the mapping from model inputs (medical measurements) to outputs (clinical prediction targets) is similar across groups, under-representation does not have to cause under-performance. Moreover, even if there are significant group differences (say, between male and female chest x-ray recordings), modern machine learning models are sufficiently expressive to still learn an optimal input-output mapping for all groups. In practice, however, this often does not happen since inductive biases, explicit regularization schemes, and the use of local optimization methods will strongly bias the model toward optimizing performance on the majority group. In this situation, standard algorithmic fairness approaches can play a role: they can help counter these majority biases and recover per-group optimal achievable performance.

Differences in task difficulty

Separate from issues related to under-representation and the choice of particular optimization algorithms, the maximum achievable performance level (given a specific dataset) may also differ between groups. Notably, this inherently limits what can be achieved using standard algorithmic fairness approaches without resorting to leveling down, i.e., reducing performance for all groups to the lowest level achievable in all groups. One reason for such differences in task difficulty is given by more strongly distorted input measurements in one group, such as, for example, in abdominal ultrasound or electromyographic recordings in obese patients. Another reason can be the lack of information about more important confounders in one group than another. To provide one example, hormone levels are often more predictive of clinical outcomes in females than in males.

Misleading performance estimates

Separate from the two issues outlined above, model performance metrics may be misleading. For example, a model may often predict the outputs correctly. Still, due to the samples being mislabeled (either by a medical expert or an automatic labeling mechanism), performance estimates may misleadingly indicate poor performance. Conversely, if the model has learned to “correctly” predict these labeling errors, performance estimates may misleadingly indicate good performance even though the model often makes wrong predictions. If the characteristics of such label errors differ between patient groups – as they often do – performance metrics can wrongly indicate the presence or absence of performance disparities.

A path forward

Given this more detailed understanding of how performance differences can arise, what does the path toward equal performance look like? It exists, but it may be long and winding. As a crucial first step, label, and selection biases must be ruled out or addressed, as these hamper any meaningful investigation into the presence or absence of performance differences (and will also lead to severely biased model predictions if left unaddressed). 

Secondly, the root causes of any observed model performance differences must be identified: Are they due to under-representation, technical design choices, or differences in task difficulty? The authors make some proposals for potential root cause identification methods. Still, it is here that they perceive the largest gap in the literature: the principled identification of the root causes of observed performance differences is currently a highly challenging and application-specific endeavor. 

Thirdly, and finally, an appropriate bias mitigation strategy can be devised. This may involve the gathering (or even design) of alternative or additional measurements, the targeted collection of more data from underrepresented groups, the use of algorithmic bias mitigation approaches, or changes in the model structure and training approach.

Between the lines

The algorithmic fairness literature often focuses on fairness-accuracy trade-offs and achieving fairness given a specific dataset. However, there is a priori no reason why it should be impossible to achieve equality of predictive performance by leveling up performance. This may require, however, reconsidering the setup of the estimation task and the data collection procedure. More data may be needed to improve performance in underperforming groups, as well as different and better data. The precise and principled identification of the most efficient mitigation approach in a given application remains an important open problem.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

related posts

  • Clinical trial site matching with improved diversity using fair policy learning

    Clinical trial site matching with improved diversity using fair policy learning

  • A Holistic Assessment of the Reliability of Machine Learning Systems

    A Holistic Assessment of the Reliability of Machine Learning Systems

  • Algorithmic accountability for the public sector

    Algorithmic accountability for the public sector

  • Artificial Intelligence and the Privacy Paradox of Opportunity, Big Data and The Digital Universe

    Artificial Intelligence and the Privacy Paradox of Opportunity, Big Data and The Digital Universe

  • RĂ©ponse Ă  la Commission d’accès Ă  l’information du QuĂ©bec portant sur les amendements potentiels Ă  l...

    Réponse à la Commission d’accès à l’information du Québec portant sur les amendements potentiels à l...

  • Open and Linked Data Model for Carbon Footprint Scenarios

    Open and Linked Data Model for Carbon Footprint Scenarios

  • Sex Trouble: Sex/Gender Slippage, Sex Confusion, and Sex Obsession in Machine Learning Using Electro...

    Sex Trouble: Sex/Gender Slippage, Sex Confusion, and Sex Obsession in Machine Learning Using Electro...

  • A Critical Analysis of the What3Words Geocoding Algorithm

    A Critical Analysis of the What3Words Geocoding Algorithm

  • The AI Junkyard: Thinking Through the Lifecycle of AI Systems

    The AI Junkyard: Thinking Through the Lifecycle of AI Systems

  • Research summary: Using Multimodal Sensing to Improve Awareness in Human-AI Interaction

    Research summary: Using Multimodal Sensing to Improve Awareness in Human-AI Interaction

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.