Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?

🔬 Research summary by Jessica Schrouff, a Senior Research Scientist at Google Research working on trustworthy machine learning for healthcare.

[Original paper by Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdulmohsin, Eva Schnider, Krista Opsahl-Ong, Alex Brown, Subhrajit Roy, Diana Mincu, Christina Chen, Awa Dieng, Yuan Liu, Vivek Natarajan, Alan Karthikesalingam, Katherine Heller, Silvia Chiappa, Alexander D’Amour]

Overview: If my model respects a desired fairness property in Hospital A, will it also be fair in Hospital B? In this work, we use examples from the healthcare domain to show how fairness properties can fail to transfer between the development and deployment environments. In these real-world settings, we show that this transfer is far more complex than the situations considered in current algorithmic fairness research, suggesting a need for remedies beyond purely algorithmic interventions.

Introduction

Machine learning models are developed using a “snapshot” of an environment: data is selected from a specific population, time period and/or geographical area, among others. A core concern for the safety of such models is how they behave in new environments (e.g. another country). Fairness—a model’s tendency to behave equally across subgroups of the population—is one such property that we hope will transfer to new environments. However, recent research has shown that the fairness properties of a model can be affected when the environment changes.

In this work, we consider how this problem appears in healthcare applications, where we would be concerned that a model that satisfies fairness criteria when developed in “Hospital A” may not satisfy them when deployed in “Hospital B”’. Specifically, using applications in dermatology and in electronic health records, our work shows that changes in the environment can lead to significant differences in the fairness properties of a model. In addition, our work highlights the lack of algorithmic solutions for this problem in complex, real-world applications and discusses potential remedies along all steps of the machine learning pipeline.

Changes in the environment are common in healthcare applications, and this can present challenges for machine learning

In the healthcare domain, the data available to develop a machine learning model is typically scarce. Therefore, it is common to develop a model on one or multiple datasets that were collected in specific environments and consider the model for deployment in other environments. Unfortunately, there are often systematic differences between environments that induce “shifts” in patterns that appear in the data. For example, different hospitals may serve different patient populations, or use different imaging equipment to perform the same procedures. These data shifts can cause models that appear to behave well in one environment to behave poorly in another.

In this work, we investigated two healthcare applications with plausible shifts:

Dermatology: we predict 27 categories of skin conditions based on one or more pictures of the skin pathology, sex and age. The model is developed on a dataset from a teledermatology service serving multiple clinics in the USA, and we assess its behavior on a dataset from skin cancer clinics in Australia, and on a teledermatology dataset from Colombia.
Electronic Health Records (EHR): we predict prolonged length of stay in the Intensive Care Unit (ICU) 24 hours after admission. Within a single hospital system, we select general ICUs for the model development, and test the model on specialist cardiac ICUs.

Fairness properties are affected by changes in the environment

Previous works have explored how engineered or simulated shifts could affect fairness properties. Here we show that changes in fairness properties are significant in real-world applications. In the dermatology application, the model performs similarly across different age ranges (maximum gap: ~1%) when assessed in the development environment. This gap increases to ~16-21% when we test the model in other environments. This result reflects the fact that the model maintains a high performance on the younger patients in the new environments, but then performance degrades with increasing age. In the case of the EHR application, the model displays unequal performance across different age subgroups in the development environment. We decide to apply a correction to its predictions such that it displays common fairness properties. While the correction leads to the desired effect on the development environment, it does not improve the fairness properties in the new environment, and even slightly worsens them.

Overall, our results show that maintaining fairness properties under changing environments is an important practical consideration for real-world applications.

Algorithmic mitigation strategies are not applicable

The instability of fairness metrics across environments would not be a major issue if mitigation strategies could be applied. Indeed, various previous works aim at providing models that are “robust” to such changes in the environment, with recent research considering the impact of those changes on fairness properties. However, models that are “robust and fair” under changes in the environment can only be obtained when strong assumptions are made on the nature of the shift. For instance, the method in Singh et al., 2021 requires that there are no shifts in the prevalence of the outcome (here skin condition or length of stay) between environments.

We assess whether such assumptions would hold in the scenarios considered. In both applications, our results show that all aspects of the data are affected by the change in environment. In the case of dermatology, these include the demographics of the population (in terms of age and sex), the prevalence of the 27 skin condition categories, as well as the photography characteristics of the pathology. None of the algorithmic methods to provide fair and robust models have been designed for this setting.

Between the lines

Our work highlights important “gaps” that prevent the development of robust and fair machine learning models:

Technical gap. Methods that guarantee fairness properties across environments a priori have limited applicability in real-world settings due to their strong assumptions.
Practical gap. Further work is required to adapt current techniques to more complex predictive tasks (e.g. time series predictions).

We hope that our demonstration will promote work at the intersection of fairness and robustness that accounts for the complexity of real-world applications. In the meantime, assessing which assumptions would hold in potential deployment environments is an important step in model development, as is continuous monitoring. On the other hand, techniques that provide fair “transfer” of models could be considered (e.g. Zhao et al., 2020; Slack et al., 2020), at the cost of maintaining one tuned model per environment. In addition, non-algorithmic remedies can be envisaged, such as the selection of lower risk tasks or prospective observational integrations. Finally, multiple questions remain open, including how to deal with limited demographic data availability in different environments, or different fairness definitions in different environments.