Top-level summary: When we think about fairness in ML systems, we usually focus a lot on data and not as much on the other pieces of the pipeline. This talk provides some illustrative examples from the Google Fairness in ML team on how to look at the process of making ML systems fairer by looking at design, data, and measurement and modeling. Motivated by some commonplace examples on how skewed underlying data can lead to significant harms to people, one of the examples demonstrates how it was only after 2011 that female crash dummies were used, prior to which, women had higher rates of injuries in car crashes because their body types were not included in crash testing by automotive firms. Numerous services provided by Google, including Jigsaw, which helps detect the toxicity level of a piece of text, had flaws that were initially not caught by the team but emerged after deployment when users pointed out how results had biases towards common stereotypes. A great design example was on the subject of how band-aids were, until quite recently, made in just one color and didn’t serve well the needs of people with darker skin tones. The importance of measurement and modeling was made clear through examples that highlighted how the process of creating and tracking fairness metrics help to monitor the system over time and provide intelligence to the teams on which areas they can do better on. The lessons learned by the team from various deployments are grouped under the categories of Fairness by data, Fairness by design and Fairness by measurement and modeling. The lessons are a mix of aspirational and actionable steps that aid on the ground development and design teams grapple with the challenges of translating abstract principles into concrete steps and provide a neat framework. This talk was followed up at the end of 2019 with the launch of a fairness indicators tool set that integrates into TFX and other frameworks that can be utilized to achieve some of the actions as mentioned in the lessons.
Google had announced its AI principles on building systems that are ethical, safe and inclusive, yet as is the case with so many high level principles, it’s hard to put them into practice unless there is more granularity and actionable steps that are derived from those principles. Here are the principles:
- Be socially beneficial
- Avoid creating or reinforcing unfair bias
- Be built and tested for safety
- Be accountable to people
- Incorporate privacy design principles
- Uphold high standards of scientific excellence
- Be made available for uses that accord with these principles
This talk focused on the second principle and did just that in terms of providing concrete guidance on how to translate this into everyday practice for design and development teams.
Humans have a history of making product design decisions that are not in line with the needs of everyone. Examples of the crash dummy and band-aids mentioned above give some insight into the challenges that users face even when the designers and developers of the products and services don’t necessarily have ill intentions. Products and services shouldn’t be designed such that they perform poorly for people due to aspects of themselves that they can’t change.
For example, when looking at the Open Image dataset, searching for images marked with wedding indicate stereotypical Western weddings but those from other cultures and parts of the world are not tagged as such. From a data perspective, the need for having more diverse sources of data is evident and the Google team made an effort to do this by building an extension to the Open Images data set by providing users from across the world to snap pictures from their surroundings that captured diversity in many areas of everyday life. This helped to mitigate the problem that a lot of open image data sets have in being geographically skewed.
Biases can enter at any stage of the ML development pipeline and solutions need to address them at different stages to get the desired results. Additionally, the teams working on these solutions need to come from a diversity of backgrounds including UX design, ML, public policy, social sciences and more.
So, in the area of fairness by data which is one of the first steps in the ML product lifecycle and it plays a significant role in the rest of the steps of the lifecycle as well since data is used to both train and evaluate a system. Google Clips was a camera that was designed to automatically find interesting moments and capture them but what was observed was that it did well only for a certain type of family, under particular lighting conditions and poses. This represented a clear bias and the team moved to collect more data that better represented the situations for a variety of families that would be the target audience for the products. Quickdraw was a fun game that was built to ask users to supply their quickly sketched hand drawings of various commonplace items like shoes. The aspiration from this was that given that it was open to the world and had a game element to it, it would be utilized by many people from a diversity of backgrounds and hence the data so collected would have sufficient richness to capture the world. On analysis, what they saw was that most users had a very particular concept of a shoe in mind, the sneaker which they sketched and there were very few women’s shoes that were submitted. What this example highlighted was that data collection, especially when trying to get diverse samples, requires a very conscious effort that can account for what the actual distribution the system might encounter in the world and make a best effort attempt to capture their nuances. Users don’t use systems exactly in the way we intend them to, so reflect on who you’re able to reach and not reach with your system and how you can check for blindspots, ensure that there is some monitoring for how data changes over time and use these insights to build automated tests for fairness in data.
The second approach that can help with fairness in ML systems is looking at measurement and modeling. The benefits of measurement are that it can be tracked over time and you can test for both individuals and groups at scale for fairness. Different fairness concerns require different metrics even within the same product. The primary categories of fairness concerns are disproportionate harms and representational harms. The Jigsaw API provides a tool where you can input a piece of text and it tells you the level of toxicity of that piece of text. An example in the earlier version of the system rated sentences of the form “I am straight” as not toxic while those like “I am gay” as toxic. So what was needed to be able to see what was causing this and how it could be addressed. By removing the identity token, they monitored for how the prediction changed and then the outcomes from that measurement gave indications on where the data might be biased and how to fix it. An approach can be to use block lists and removals of such tokens so that sentences that are neutral are perceived as such without imposing stereotypes from large corpora of texts. These steps prevent the model from accessing information that can lead to skewed outcomes. But, in certain places we might want to brand the first sentence as toxic if it used in a derogatory manner against an individual, we require context and nuance to be captured to make that decision. Google undertook Project Respect to capture positive identity associations from around the world as a way of improving data collection and coupled that with active sampling (an algorithmic approach that samples more from the training data set in areas where it is under performing) to improve outputs from the system. Another approach is to create synthetic data that mimics the problematic cases and renders them in a neutral context. Adversarial training and updated loss functions where one updates a model’s loss function to minimize difference in performance between groups of individuals can also be used to get better results. In their updates to the toxicity model, they’ve seen improvements, but this was based on synthetic data on short sentences and it is still an area of improvement. Some of the lessons learned from the experiments carried out by the team:
- Test early and test often
- Develop multiple metrics (quantitative and qualitative measures along with user testing is a part of this) for measuring the scale of each problem
- Possible to take proactive steps in modeling that are aware of production constraints
From a design perspective, think about fairness in a more holistic sense and build communication lines between the user and the product. As an example, Turkish is a gender neutral language, but when translating to English, sentences take on gender along stereotypes by attributing female to nurse and male to doctor. Say we have a sentence, “Casey is my friend”, given no other information we can’t infer what the gender of Casey is and hence it is better to present that choice to the user from a design perspective because they have the context and background and can hence make the best decision. Without that, no matter how much the model is trained to output fair predictions, they will be erroneous without the explicit context that the user has. Lessons learned from the experiments include:
- Context is key
- Get information from the user that the model doesn’t have and share information with the user that the model has and they don’t
- How do you design so the user can communicate effectively and have transparency so that can you get the right feedback?
- Get feedback from a diversity of users
- See the different ways in how they provide feedback, not every user can offer feedback in the same way
- Identify ways to enable multiple experiences
- We need more than a theoretical and technical toolkit, there needs to be rich and context-dependent experience
Putting these lessons into practice, what’s important is to have consistent and transparent communication and layering on approaches like datasheets for data sets and model cards for model reporting will aid in highlighting appropriate uses for the system and where it has been tested and warn of potential misuses and where the system hasn’t been tested.
Full video of Jacqueline Pan and Tulsee Doshi’s talk at Google I/O 2019: https://youtu.be/6CwzDoE8J4M