🔬 Research Summary by Aparna Balagopalan, a PhD student in the EECS department at the Massachusetts Institute of Technology, and her research aims to develop fair, interpretable, and robust models by carefully re-evaluating and surfacing assumptions of machine learning-based measurements in socially-relevant contexts.
[Original paper by Aparna Balagopalan, Abigail Z. Jacobs, and Asia Biega]
Overview: Online platforms generate rankings and allocate search exposure, thus mediating access to economic opportunity. Fairness measures proposed to avoid unfair outcomes of such rankings often aim to (re-)distribute exposure in proportion to ranking worthiness or merit. In practice, relevance is frequently used as a proxy for worthiness. In our paper, we probe this choice and propose desiderata for relevance as a valid and reliable proxy for worthiness from an interdisciplinary perspective. Then, with a case-study-based approach, we assess if these desired properties are empirically met in practice. Our analyses and results surface the pressing need for novel approaches to collect worthiness scores at scale for fair ranking.
Introduction
Online search and ranking systems mediate access to opportunity across various safety-critical settings such as housing, e-commerce, and hiring. Rankings are often generated by sorting the items in decreasing order of a relevance score.
We have seen several examples of unfairness in ranking systems in popular media: for example, of male-dominated search results in response to search queries pertinent to candidate hiring. Various fairness measures and interventions have been proposed to prevent harmful outcomes in these rankings. Many existing fairness definitions propose explicitly distributing exposure or search attention in rankings as a function of worthiness or merit, e.g., how qualified a candidate is for a job. However, in the fair ranking literature, the concept of worthiness has remained under-defined and has instead been represented by ranking relevance. In our paper, we probe relevance as a proxy for worthiness and establish desiderata for relevance to be a valid and reliable proxy for worthiness from the interdisciplinary lens of information retrieval (IR), measurement theory, and machine learning. With a case study of relevance inferred from biased user click data, we empirically show that not all of these criteria may be met in practice. Further, we test the impact of these violations on the estimated system fairness and analyze whether existing fairness interventions may mitigate the identified issues within the case study.
Our work highlights the need for critically assessing the limitations of relevance as a proxy for guiding fair exposure allocation and novel approaches to generate and collect worthiness scores at scale for fair ranking under different application domains.
Key Insights
Exposure or attention to items ranked by online platforms may translate to real-world opportunities. Fairness measures and interventions to reduce potential harms are highly important. Our work focuses on exposure-guided fairness measures that intervene in exposure distributions at a group or individual level.
The Role of Relevance in Fair Ranking
As emphasized in recent literature, all ranking systems and operationalizations of fairness express a normative goal. In fair ranking, different interpretations of fairness thus build from differing normative theories of discrimination underlying each framework. Fair-ranking interventions implemented in practice are realizations of these interpretations: for example, for a goal of equal opportunity, under the view that similar items (or groups) should attain equal attention, one might allocate exposure in proportion to the notion of an item’s (or group’s) attention-deservedness or “worthiness” at ranking time.
Examining existing fairness definitions, we observe that relevance is a target for such exposure or attention at both group and individual levels. For example, group-level relevance ratios are used as targets for expected group-level exposure ratios (group fairness), or the distribution of exposure across rankings is matched to the distribution of relevance across rankings for individual items (individual fairness).
Probing Relevance as a Proxy for Worthiness
First, we highlight the definitional nuances with “worthiness” definitions inherent to fair ranking. We define worthiness as the underlying construct that fair rankings aim to operationalize or the value that allocation of attention to different items/groups brings about. Yet, a key question is: the value for whom? Job seekers might receive value from being allocated attention to searchers likely to hire them. Searchers, on the other hand, might receive value from being exposed to job seekers who are qualified and likely to stay with the employer long-term. Worthiness scores based on the value for different stakeholders might thus diverge.
Relevance, on the other hand, has several definitions throughout domains and practices in IR. The construct of relevance is operationalized and measured through measurement models of observable properties thought to be related to it. Some examples of these include crowdsourced judgments, click model-based measurements, etc. While limitations of such scores have been studied in past literature, the construct of relevance and its limitations have primarily been investigated in the context of accurate rankings, not as a worthiness score to guide fair exposure allocation.
Using relevance as a proxy for worthiness involves making certain implicit assumptions about their relationship. For example, the relative ordering of items according to worthiness is consistent with the ordering by relevance. It is essential to elucidate and test the assumptions of how the two concepts relate to each other if relevance is to be a valid and reliable candidate for approximating worthiness.
Desiderata of Relevance in Fair Ranking
Recent literature adapting the framework for measurement modeling from the social sciences to algorithmic fairness emphasizes that the properties of proxy scores should match their theoretical ideal across a range of qualitative dimensions. Drawing on domain knowledge about relevance-based ranking systems and the assumptions within, we describe five desired properties for inferred relevance as a valid and reliable proxy measurement for worthiness.
First, credibility: this states that the rank ordering of pairs and groups of items with relevance scores match the corresponding rank orderings with worthiness scores. Second, consistency: this states that inferred relevance scores should converge within the limit of sufficient data. Third, stability is to do with test-retest reliability of relevance or similar inferences for similar inputs or models. Fourth, comparability concerns the proportionality of inferred relevance and worthiness scores for individuals or groups of items. Fifth, availability necessitates that the distributions of relevance should be identical to the true worthiness scores.
Case study: Relevance Inferred from Clicks
We take a case-study approach and empirically test if the five properties proposed are met when relevance is inferred from biased click data and then used to guide fair exposure allocation. We focus on single-query setups where a ranking response is provided to the same query, and clicks are simulated using predefined models of user behavior. We benchmark results on both synthetic and real-world datasets. We assume that relevance labels provided in each dataset (used to simulate clicks) are the same as worthiness scores and assess if relevance scores inferred from clicks satisfy the proposed desiderata with respect to these.
We find that some properties, such as credibility, are satisfied across datasets, while some, such as availability, are not satisfied. We observe small but significant differences in using inferred relevance versus worthiness scores for measuring group/individual fairness. We also observe that fairness interventions and class imbalance may be mitigating factors, but results vary across the types of interventions and the datasets. Thus, results and the impact of violating desiderata are often dataset-dependent and hence, application-dependent.
Between the lines
Our work highlights important directions in IR in light of some of our observations.
First, critically assessing limitations of relevance as a proxy for worthiness. Relevance can be justified as a proxy for worthiness; however, this justification must be done by meaningfully establishing the validity and reliability of relevance in a given setting. The goal of fair ranking requires showing that the properties of the relevance scores and resulting rankings are aligned with the intended fairness goals. We emphasize that additional properties beyond those proposed in our work may still need to be identified and tested in various use cases. Interestingly, in concurrent work with an alternate view, researchers derived desired properties for group fairness metrics and highlighted similar nuances of metrics that rely on relevance, further validating our findings. Second, defining and redefining ‘worthiness’ for different ranking application domains remains important. We believe that connecting different conceptualizations of worthiness to how they are operationalized in fair ranking systems will aid researchers, developers, and auditors in enhancing system equity. Third, developing new methods for obtaining worthiness scores at scale is important. For example, by calibrating continuous relevance predictions from browsing models, proposing new ways of accounting for annotator biases during crowdsourced judgment collection, etc.
This reveals several open problems in relevance measurement at scale for fair exposure allocation. Further, we underscore the need for interdisciplinary collaborations for studying and addressing these problems.