• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?

March 11, 2022

🔬 Research summary by Jessica Schrouff, a Senior Research Scientist at Google Research working on trustworthy machine learning for healthcare.

[Original paper by Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdulmohsin, Eva Schnider, Krista Opsahl-Ong, Alex Brown, Subhrajit Roy, Diana Mincu, Christina Chen, Awa Dieng, Yuan Liu, Vivek Natarajan, Alan Karthikesalingam, Katherine Heller, Silvia Chiappa, Alexander D’Amour]


Overview: If my model respects a desired fairness property in Hospital A, will it also be fair in Hospital B? In this work, we use examples from the healthcare domain to show how fairness properties can fail to transfer between the development and deployment environments. In these real-world settings, we show that this transfer is far more complex than the situations considered in current algorithmic fairness research, suggesting a need for remedies beyond purely algorithmic interventions. 


Introduction

Machine learning models are developed using a “snapshot” of an environment: data is selected from a specific population, time period and/or geographical area, among others. A core concern for the safety of such models is how they behave in new environments (e.g. another country). Fairness—a model’s tendency to behave equally across subgroups of the population—is one such property that we hope will transfer to new environments. However, recent research has shown that the fairness properties of a model can be affected when the environment changes.

In this work, we consider how this problem appears in healthcare applications, where we would be concerned that a model that satisfies fairness criteria when developed in “Hospital A” may not satisfy them when deployed in “Hospital B”’. Specifically, using applications in dermatology and in electronic health records, our work shows that changes in the environment can lead to significant differences in the fairness properties of a model. In addition, our work highlights the lack of algorithmic solutions for this problem in complex, real-world applications and discusses potential remedies along all steps of the machine learning pipeline.

Changes in the environment are common in healthcare applications, and this can present challenges for machine learning

In the healthcare domain, the data available to develop a machine learning model is typically scarce. Therefore, it is common to develop a model on one or multiple datasets that were collected in specific environments and consider the model for deployment in other environments. Unfortunately, there are often systematic differences between environments that induce “shifts” in patterns that appear in the data. For example, different hospitals may serve different patient populations, or use different imaging equipment to perform the same procedures. These data shifts can cause models that appear to behave well in one environment to behave poorly in another. 

In this work, we investigated two healthcare applications with plausible shifts:

  • Dermatology: we predict 27 categories of skin conditions based on one or more pictures of the skin pathology, sex and age. The model is developed on a dataset from a teledermatology service serving multiple clinics in the USA, and we assess its behavior on a dataset from skin cancer clinics in Australia, and on a teledermatology dataset from Colombia.
  • Electronic Health Records (EHR): we predict prolonged length of stay in the Intensive Care Unit (ICU) 24 hours after admission. Within a single hospital system, we select general ICUs for the model development, and test the model on specialist cardiac ICUs.

Fairness properties are affected by changes in the environment

Previous works have explored how engineered or simulated shifts could affect fairness properties. Here we show that changes in fairness properties are significant in real-world applications. In the dermatology application, the model performs similarly across different age ranges (maximum gap: ~1%) when assessed in the development environment. This gap increases to ~16-21% when we test the model in other environments. This result reflects the fact that the model maintains a high performance on the younger patients in the new environments, but then performance degrades with increasing age. In the case of the EHR application, the model displays unequal performance across different age subgroups in the development environment. We decide to apply a correction to its predictions such that it displays common fairness properties. While the correction leads to the desired effect on the development environment, it does not improve the fairness properties in the new environment, and even slightly worsens them.

Overall, our results show that maintaining fairness properties under changing environments is an important practical consideration for real-world applications.

Algorithmic mitigation strategies are not applicable

The instability of fairness metrics across environments would not be a major issue if mitigation strategies could be applied. Indeed, various previous works aim at providing models that are “robust” to such changes in the environment, with recent research considering the impact of those changes on fairness properties. However, models that are “robust and fair” under changes in the environment can only be obtained when strong assumptions are made on the nature of the shift. For instance, the method in Singh et al., 2021 requires that there are no shifts in the prevalence of the outcome (here skin condition or length of stay) between environments.

We assess whether such assumptions would hold in the scenarios considered. In both applications, our results show that all aspects of the data are affected by the change in environment. In the case of dermatology, these include the demographics of the population (in terms of age and sex), the prevalence of the 27 skin condition categories, as well as the photography characteristics of the pathology. None of the algorithmic methods to provide fair and robust models have been designed for this setting.

Between the lines

Our work highlights important “gaps” that prevent the development of robust and fair machine learning models:

  • Technical gap. Methods that guarantee fairness properties across environments a priori have limited applicability in real-world settings due to their strong assumptions.
  • Practical gap. Further work is required to adapt current techniques to more complex predictive tasks (e.g. time series predictions).

We hope that our demonstration will promote work at the intersection of fairness and robustness that accounts for the complexity of real-world applications. In the meantime, assessing which assumptions would hold in potential deployment environments is an important step in model development, as is continuous monitoring. On the other hand, techniques that provide fair “transfer” of models could be considered (e.g. Zhao et al., 2020; Slack et al., 2020), at the cost of maintaining one tuned model per environment. In addition, non-algorithmic remedies can be envisaged, such as the selection of lower risk tasks or prospective observational integrations. Finally, multiple questions remain open, including how to deal with limited demographic data availability in different environments, or different fairness definitions in different environments.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: Frontier AI Safety Commitments, AI Seoul Summit 2024

AI Policy Corner: The Colorado State Deepfakes Act

Special Edition: Honouring the Legacy of Abhishek Gupta (1992–2024)

AI Policy Corner: The Turkish Artificial Intelligence Law Proposal

From Funding Crisis to AI Misuse: Critical Digital Rights Challenges from RightsCon 2025

related posts

  • Research summary: Snapshot Series: Facial Recognition Technology

    Research summary: Snapshot Series: Facial Recognition Technology

  • Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

    Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

  • The Unnoticed Cognitive Bias Secretly Shaping the AI Agenda

    The Unnoticed Cognitive Bias Secretly Shaping the AI Agenda

  • Achieving a ‘Good AI Society’: Comparing the Aims and Progress of the EU and the US

    Achieving a ‘Good AI Society’: Comparing the Aims and Progress of the EU and the US

  • Now I’m Seen: An AI Ethics Discussion Across the Globe

    Now I’m Seen: An AI Ethics Discussion Across the Globe

  • Democratising AI: Multiple Meanings, Goals, and Methods

    Democratising AI: Multiple Meanings, Goals, and Methods

  • Transparency as design publicity: explaining and justifying inscrutable algorithms

    Transparency as design publicity: explaining and justifying inscrutable algorithms

  • Supporting Human-LLM collaboration in Auditing LLMs with LLMs

    Supporting Human-LLM collaboration in Auditing LLMs with LLMs

  • A Hazard Analysis Framework for Code Synthesis Large Language Models

    A Hazard Analysis Framework for Code Synthesis Large Language Models

  • Research summary: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Lea...

    Research summary: Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Lea...

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.