🔬 Research Summary by Eike Petersen, a postdoctoral researcher at the Technical University of Denmark (DTU), working on fair, responsible, and robust machine learning for medicine.
[Original paper by Eike Petersen, Sune Holm, Melanie Ganz, and Aasa Feragen]
Overview: Medical machine learning models are often better at predicting outcomes or diagnosing diseases in some patient groups than others. This paper asks why such performance differences occur and what it would take to build models that perform equally well for all patients.
Introduction
Disparities in model performance across patient groups are common in medical machine learning. While the under-representation of patient groups as one potential explanation has received significant attention, there are further reasons for performance discrepancies that are less discussed. These include differences in prediction task difficulty between groups due to, for example, more noisy measurements in one group, biased outcome measurements, and selection biases. Narrow algorithmic fairness solutions cannot address these issues. The authors conclude that leveling up model performance may require more data from underperforming groups and better data.
Key Insights
Models often perform poorly on patient groups that are underrepresented in the training set; this is not always true, however. Cases have been reported in which group representation did not strongly affect model performance on that group or even in which non-represented groups outperform highly represented ones. The authors propose two distinct mechanisms to explain these seemingly contradictory observations:
1) A model may perform worse than theoretically achievable on a given group due to a combination of under-representation, the nature of the differences between the groups, and technical modeling and optimization choices.
2) The optimal level of achievable performance may differ between groups due to differences in the intrinsic difficulty of the prediction task.
The relationship between group representation and model performance
Whether and how strongly under-representation affects model performance for that group depends on the nature of the differences between the groups. If the mapping from model inputs (medical measurements) to outputs (clinical prediction targets) is similar across groups, under-representation does not have to cause under-performance. Moreover, even if there are significant group differences (say, between male and female chest x-ray recordings), modern machine learning models are sufficiently expressive to still learn an optimal input-output mapping for all groups. In practice, however, this often does not happen since inductive biases, explicit regularization schemes, and the use of local optimization methods will strongly bias the model toward optimizing performance on the majority group. In this situation, standard algorithmic fairness approaches can play a role: they can help counter these majority biases and recover per-group optimal achievable performance.
Differences in task difficulty
Separate from issues related to under-representation and the choice of particular optimization algorithms, the maximum achievable performance level (given a specific dataset) may also differ between groups. Notably, this inherently limits what can be achieved using standard algorithmic fairness approaches without resorting to leveling down, i.e., reducing performance for all groups to the lowest level achievable in all groups. One reason for such differences in task difficulty is given by more strongly distorted input measurements in one group, such as, for example, in abdominal ultrasound or electromyographic recordings in obese patients. Another reason can be the lack of information about more important confounders in one group than another. To provide one example, hormone levels are often more predictive of clinical outcomes in females than in males.
Misleading performance estimates
Separate from the two issues outlined above, model performance metrics may be misleading. For example, a model may often predict the outputs correctly. Still, due to the samples being mislabeled (either by a medical expert or an automatic labeling mechanism), performance estimates may misleadingly indicate poor performance. Conversely, if the model has learned to “correctly” predict these labeling errors, performance estimates may misleadingly indicate good performance even though the model often makes wrong predictions. If the characteristics of such label errors differ between patient groups – as they often do – performance metrics can wrongly indicate the presence or absence of performance disparities.
A path forward
Given this more detailed understanding of how performance differences can arise, what does the path toward equal performance look like? It exists, but it may be long and winding. As a crucial first step, label, and selection biases must be ruled out or addressed, as these hamper any meaningful investigation into the presence or absence of performance differences (and will also lead to severely biased model predictions if left unaddressed).
Secondly, the root causes of any observed model performance differences must be identified: Are they due to under-representation, technical design choices, or differences in task difficulty? The authors make some proposals for potential root cause identification methods. Still, it is here that they perceive the largest gap in the literature: the principled identification of the root causes of observed performance differences is currently a highly challenging and application-specific endeavor.
Thirdly, and finally, an appropriate bias mitigation strategy can be devised. This may involve the gathering (or even design) of alternative or additional measurements, the targeted collection of more data from underrepresented groups, the use of algorithmic bias mitigation approaches, or changes in the model structure and training approach.
Between the lines
The algorithmic fairness literature often focuses on fairness-accuracy trade-offs and achieving fairness given a specific dataset. However, there is a priori no reason why it should be impossible to achieve equality of predictive performance by leveling up performance. This may require, however, reconsidering the setup of the estimation task and the data collection procedure. More data may be needed to improve performance in underperforming groups, as well as different and better data. The precise and principled identification of the most efficient mitigation approach in a given application remains an important open problem.