🔬 Research Summary by Eran Tal, Canada Research Chair in Data Ethics and Associate Professor of Philosophy at McGill University. He studies the epistemology and ethics of data collection and data use in scientific research, healthcare, and policy.
[Original paper by Eran Tal]
Overview: This paper exposes a hidden and widespread type of bias of healthcare decision-support tools based on supervised ML: target specification bias. This bias stems from the fact that decision-makers, e.g., physicians, typically specify their desired target of prediction differently than the way algorithm designers operationalize it. This type of bias cannot be resolved by improvements to the data or the model alone. Instead, tackling target specification bias requires a fundamental shift of approach to how model accuracy is evaluated and reported to decision-makers.
Introduction
Sometimes, machine learning models become good at predicting a variable, but that variable differs from what users care about predicting. A well-known example dates back to the mid-1990s. A neural net trained on health records from Pittsburgh-area hospitals learned to associate asthma with a low risk of death from pneumonia. The association was real. Asthmatics who presented pneumonia received aggressive care that lowered their mortality risk below the general population’s. And yet, to physicians who need to allocate hospital beds, this association was a confounder that would have dangerously de-prioritized asthmatics had it not been caught in time. Physicians needed the model to predict mortality risk when all other things are equal, not mortality risk in the real, messy world.
This is an example of target specification bias: a mismatch between the specification of a target variable and its operationalization by an ML model. Commonly mistaken for a transparency problem, target specification bias is an accuracy problem that affects opaque and transparent models alike. Target specification bias is still widely overlooked when evaluating model accuracy. This is due largely to the overly simplistic, ‘label-matching’ conception of accuracy currently prevalent in the ML community. This paper characterizes target specification bias, distinguishes it from other prevalent types of bias in ML, explains how it contributes to inaccuracy, and offers ways of mitigating it.
Key Insights
What is target specification bias?
Target specification bias is a mismatch between the way decision-makers specify the variable they need to predict and the way this variable is operationalized by the designers and developers of a decision-support tool. The mismatch is often subtle and stems from the fact that decision-makers are typically interested in predicting counterfactual outcomes rather than actual scenarios.
For example, physicians who make treatment decisions are interested in predicting patients’ health outcomes under a counterfactual scenario where all patients receive the same treatment. By contrast, the model is necessarily trained on data that reflect actual scenarios. Different populations receive different treatments based on need and availability in an actual scenario, as the pneumonia example above shows.
The same holds true for physicians who decide which patients to refer for a diagnostic test. Whether and when a condition is diagnosed partially depends on the judgment of clinicians and on the availability and cost of diagnostic services. For the referring physician, these factors are all confounders. The physician is interested in predicting a patient’s diagnosis in a counterfactual world, where all patients have access to timely and accurate diagnostic tests.
Although the conceptual distinction between actual and counterfactual variable specification is subtle, the practical consequences of ignoring it can be severe. When left uncorrected, target specification bias leads to overestimating predictive accuracy, inefficient utilization of medical resources, and suboptimal decisions that can harm patients.
How does target specification bias arise?
There are several misconceptions about how target specification bias arises. Cases of target specification bias are sometimes mistakenly classified as transparency problems. While increased transparency can reveal the presence of target specification bias, this bias affects opaque and intelligible (or ‘explainable’) models alike. Target specification bias also does not result from insufficient, unreliable, incomplete, or unrepresentative data. On the contrary, this type of bias tends to become more pronounced as data quality is improved. This is because the confounding effects that make up this bias are real and not merely data artifacts. For example, the reduced risk for asthmatics of dying from pneumonia in Pittsburgh hospitals is a real effect and not an artifact of data acquisition or data analysis.
The source of target specification bias is the fact that labels in datasets acquired from the actual world are, at best, imperfect operationalizations of the counterfactually specified variables that decision-makers care about. Values of counterfactually specified variables are, by being counterfactual, not directly accessible from datasets that are obtained from the actual world. Rather, values of counterfactually specified variables must be inferred from data using domain-specific background knowledge.
Target specification bias persists undetected largely due to an overly technical and simplistic conception of accuracy currently prevalent in supervised ML. This ‘label-matching’ conception takes accuracy to strictly track matches (or distances) between predictions and labels. This conception underlies all commonly used accuracy measures in supervised ML, such as precision, recall, area under curve, F1 score, and mean squared error. Such metrics neglect that labels – even reliable and representative – can be poor benchmarks for assessing the accuracy of counterfactual predictions. Yet counterfactual predictions are the kinds of predictions decision-makers typically care about. The upshot is that model accuracy is often overestimated and reported as being higher than the model’s performance in the use cases which decision-makers will employ them for.
How can target specification bias be mitigated?
There is good news: much can be done to mitigate target specification bias. On a conceptual level, alternative conceptions of accuracy are available. Specifically, metrology has a long-standing tradition of thinking about accuracy in counterfactual terms. Metrology, the science of measurement, employs idealized models of instruments, such as clocks and thermometers, and uses such models to evaluate their accuracy. Such models appeal to background causal knowledge to predict the instrument’s indications in the absence of extrinsic influences. The successful standardization and reproducible measurement of physical quantities such as time, length, and mass are largely due to this counterfactual way of thinking about accuracy.
On a practical level, the paper identifies several lessons supervised ML can learn from metrology about evaluating model accuracy and mitigating target specification bias. These insights can be combined with existing methods of causal modeling that reveal counterfactual probabilities in the data and with methods for extracting counterfactual information from ML models and presenting this information to users.
Between the lines
Thinking about model accuracy in more sophisticated and user-oriented ways can be helpful beyond the realm of medicine. Such a shift would mark an important step in the maturity of the ML discipline as a whole, from its current exploratory stage toward producing a body of reliable, reproducible evidence for science and policy. A precondition for this sort of shift is a clearer distinction between the internal and external validity of supervised ML models. Internal validation procedures, such as testing models for under- or over-fitting training data, are considered sufficient for evaluating a model’s performance. The degree of fit between predictions and data is called ‘accuracy’ and is reported to users as such. This practice neglects external validation procedures, which are required to test model performance in the ‘wild,’ in light of the tool’s intended purpose, typical use cases, typical input data, and its reception by stakeholders.
Accuracy, as reported to users as a criterion of overall model performance, is an external validity criterion. It needs to be evaluated relative to users’ specifications and meet reproducibility requirements. Further work is needed to develop a framework for evaluating ML model accuracy in the ‘wild’ and reporting it to users in a manner most relevant to their values and goals.