Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics

🔬 Research Summary by Tomo Lazovich (they/them) is a Senior Machine Learning Researcher on Twitter’s ML Ethics, Transparency, and Accountability (META) team.

[Original paper by Tomo Lazovich, Luca Belli, Aaron Gonzales, Amanda Bower, Uthaipon Tantipongpipat, Kristian Lum, Ferenc Huszar, Rumman Chowdhury]

Overview: Some popular ML fairness metrics are hard to operationalize in practice, largely due to the absence of demographic data in industry settings. This paper proposes a complementary set of metrics, originally used in economics to measure income inequality, as a way to capture disparities in outcomes of large-scale ML systems.

Introduction

In recent years, many examples of the potential harms caused by machine learning systems have come to the forefront (see a collection of them in the awful-ai repository). Practitioners in the field of algorithmic bias and fairness have developed a suite of metrics to capture one aspect of these harms: namely, differences in performance between different demographic groups and, in particular, worse performance for marginalized communities. Take, for example, the now-iconic Gender Shades paper, which found that commercial gender recognition systems performed significantly worse for darker-skinned women. Despite great progress made in this area, one open question became particularly prominent for industry practitioners: how do you capture such disparities if you don’t have reliable demographic data or choose not to collect it due to privacy concerns?

This paper approaches the problem by adapting income inequality metrics from economics. You’ve probably heard statistics like “the top 1% of people own X% share of wealth”, and this work applies those notions to levels of engagement on Twitter. If you imagine Twitter as an economy, with ‘impressions’ being distributed instead of dollars, then the ‘rich’ are those users who have gotten many impressions in the past, now have many followers, and are therefore more likely to get more impressions in the future.. By applying inequality metrics to the distributions of engagements, we seek to understand whether our algorithmic systems are reinforcing or worsening ‘rich get richer’ dynamics on the platform. It’s been shown that a majority of Twitter users feel that only a few people, or no one at all, are seeing their Tweets, a phenomenon that has been referred to as “Tweeting into the void”. This work uses inequality metrics at the distribution level to understand exactly how skewed engagements are on Twitter, and additionally digs deeper to isolate some of the algorithms that may be driving that effect. At the same time, it also evaluates a number of metrics in terms of desirable criteria. Overall, it finds that inequality metrics are a useful complement to demographic-based fairness metrics and do faithfully capture skews in outcomes.

Key Insights

The work benchmarks a total of seven different inequality metrics on two Twitter-related case studies. Each metric is evaluated with respect to a number of criteria, including both desirable mathematical properties and subjective criteria like interpretability. Going through the details of all of the metrics is out of scope for this summary, but two you may have heard in the news are the Gini coefficient and top 1% share. The Gini coefficient is a number that ranges between zero and one, with 1 being the case where one person in the distribution holds all the wealth and zero being the case where everyone has an equal amount of wealth. The top 1% share is simply the percentage of total wealth that is held by people in the top 1% of the distribution. In this work, instead of measuring inequality in “income” or “wealth”, the metrics are used to measure the skew, or “top-heaviness”, in how many impressions and other engagements (likes, retweets, etc.) authors get on Twitter.

Result 1: Inequality metrics meaningfully capture differences in skew between different engagement types

The first case study tries to answer a simple question: do these metrics actually work on real data from Twitter? That is, when we apply them to distributions that we know have different levels of skew, do we actually see meaningful differences in the metrics? The results show clearly that, yes, we can distinguish distributions with these metrics! In particular, it is shown that the skew of the engagements goes roughly with the level of effort needed for that engagement. Impressions (having someone look at your Tweet) have a lower level of skew, while quote Tweets (sharing someone’s Tweet to your timeline and adding your own Tweet on top of it) have the highest level of skew. To quantify, the top 1% of users get almost 80% of all impressions and almost 90% of all quote Tweets. Not only do we see that there are clear differences between distributions, but also that these distributions on Twitter in general are very highly skewed.

This leads to the next question the paper addresses: can we use these metrics to identify potential algorithmic drivers of this high level of skew?

Result 2: Inequality metrics identify out-of-network suggestions as potential drivers of skew

In the second case study, the work focuses on the impression distribution specifically. It then breaks down impressions by which algorithm led to the Tweet being on the reader’s timeline. Some of these are “in-network” (IN) suggestions, meaning they were ranked highly and are from an author the reader follows. Others are “out-of-network” (OON) suggestions, or Tweets from authors whom the reader does not directly follow. One example of out-of-network suggestions that can appear on a timeline are Tweets that were liked by someone the reader follows, but the reader does not follow the Tweet author themself.

When breaking down algorithmic sources between IN and different types of OON suggestions, it was found that OON suggestions in general have much higher levels of skew. The top 1% of users get around 77% of impressions from IN Tweets, but they get close to 99% of impressions from certain kinds of OON Tweets. Additionally, when you break down by number of followers, you find that the difference between the skew of IN and OON Tweets is much larger for authors with low numbers of followers. All of this serves as evidence that the structure of the graph itself may be driving some of this inequality, with certain algorithms exacerbating the effect more than others.

Between the lines

Moving forward, one of the most interesting lines of research will be to better understand how the structure of Twitter’s social graph is feeding the inequality observed in this paper. Ideally, we can develop methods to decouple algorithmic behavior from the graph structure. Additionally, since this work found that inequality metrics are a useful complement to demographic-based metrics, future work can focus on how to incorporate these metrics into automated testing, feature review processes, and other internal procedures that are part of the ML evaluation cycle. We are currently exploring ways to implement this metric in practice at Twitter, so product owners can have better visibility into the impacts of their models. Overall, these findings are a promising step in the real-world operationalization of ML fairness/bias metrics.