🔬 Research summary by Julienne LaChance, PhD (SonyAI, AI Ethics) and Alessandro Fabris (University of Padua). Dr. Julienne LaChance, an AI Research Scientist on the AI Ethics Team at SonyAI and founding member of Princeton AI4ALL.” Alessandro is a PhD student at the University of Padua, specialized in algorithmic fairness, auditing and information access systems.
[Original paper by Alessandro Fabris, Stefano Messina, Gianmaria Silvello, and Gian Antonio Susto]
Overview: Access to well-documented, high-quality datasets is crucial to effective algorithmic fairness research, yet in many sub-fields of AI/ML, dataset documentation is insufficient and scattered. Fabris et. al survey and clearly document over 200 datasets employed in algorithmic fairness research from 2014 to mid-2021. Here, we highlight the 28 computer vision datasets from this survey.
Introduction
Despite the growing urgency for algorithmic fairness assessments, both in industry and academia, comprehensive fairness evaluations are frequently hindered by a lack of suitable data. In practice, this often forces fairness evaluators to choose between (1) using datasets which have become popular in the literature despite their limitations as fairness benchmarks (e.g. contrived prediction tasks, noisy data, severe coding mistakes, age), (2) using related but inappropriate datasets, which may include unethically sourced data or utilize flawed annotation methods, or (3) hand-crafting their own smaller test sets– which may be costly, time-consuming, and/or ultimately result in insufficient data for a meaningful evaluation. By thoroughly examining the datasets used in fairness research across 9 diverse domains (computer vision, linguistics, etc.), this paper provides a handy reference for 200+ current fairness datasets and clearly presents strategies towards future improvement.
How was the dataset list compiled? The authors surveyed, from 2014 to early May 2021, the following publication sources: every article published in domain-specific conferences (e.g. FAccT, AIES); every article published in proceedings of well-known machine learning and data mining conferences (e.g. CVPR, NeurIPS); and every article available from “Past Network Events” and “Older Workshops and Events” of the FAccT network. The results were filtered by keyword strings (e.g. *fair*, *bias*, *parit*) and manually cleaned by the authors. An updated version including more recent articles and datasets is due in 2022!
Key Insights
As promised, here are the 28 computer vision datasets used in fairness research from 2014-May 2021:
Athletes and health professionals
Benchmarking Attribution Methods (BAM)
Diversity in Faces (DiF)
IARPA Janus Benchmark A (IJB-A)
Image Embedding Association Test (iEAT)
Labeled Faces in the Wild (LFW)
MS-Celeb-1M
Multi-task Facial Landmark (MTFL)
Pilot Parliaments Benchmark (PPB)
Racial Faces in the Wild (RFW)
Visual Question Answering (VQA)
Readers can refer to the article for the full 200+ dataset list. Computer vision researchers may notice the absence of datasets such as the Chicago Faces Database (CFD) – which was not used in AI/ML fairness research until after May 2021. We aren’t providing hyperlinks here to two datasets with hosting websites which have been taken down: namely, IBM’s Diversity in Faces (DiF) resulting from a class-action lawsuit, and Microsoft’s MS-Celeb-1M.
Just from a quick skim of this list, some limitations concerning the computer vision datasets employed in algorithmic fairness research thus far become immediately apparent: for example, those assessing algorithmic fairness of a model based on sensitive attributes like race and gender won’t find much use in generic ML datasets (e.g. MNIST, Fashion MNIST); highly domain-specific datasets such as those containing geometric shapes/objects (e.g. dSprites, shapes3D) or office supplies (Office31); or even datasets with questionable image sources or annotation schemes (e.g. RFW and BUPT Faces, which use Face++ API to apply “race” annotations in the categories “Caucasian”, “Indian”, “Asian” and “African” on MS-Celeb-1M images). Once we narrow our focus to specific sub-tasks in computer vision, the outlook becomes more grim: do researchers studying fairness in pose estimation have a single data source (MS-COCO)?
Let’s dive in to take a quick look at just those datasets containing images of people.
Human-centric fairness datasets in computer vision: A closer look
For brevity, we will skip those datasets which have been taken down (DiF, MS-Celeb-1M, the persons category of ImageNet)– and the MS-Celeb-1M-based RFW and BUPT– even though retracted datasets and their derivatives continue to be used. This leaves:
Audience: ~30K in-the-wild smart-phone images of ~2K people sourced from Flickr. Manually annotated for age, gender, and identity; some Fitzpatrick skin color annotations were added for the Gender Shades analysis.
Athletes and health professionals: ~500 images of nurses/doctors manually collected to identify bias in race/gender, plus ~800 images of athletes for gender/jersey color bias. Subgroups roughly balanced at 200 individuals.
CelebFaces Attributes (CelebA): ~200K faces of ~10K individuals augmented with landmark location and manually annotated binary attributes. Annotations can be highly subjective (e.g. “attractive”, “big nose”) or offensive (“double chin”). Gender, age labels exist.
FairFace: Race, age, gender, skin tone annotations for ~100K face images from Yahoo’s YFCC100M. Sensitive attributes like race annotated by Mechanical Turk workers, an associated model, and re-verification.
IARPA Janus Benchmark A (IJB-A): ~6K images of ~500 subjects. Gender and skin color annotations on manually selected in-the-wild images of people from broad geographic representation; original annotation methodology is unspecified. Gender and Fitzpatrick skin type were labeled by one author of the Gender Shades study.
iEAT: A smaller dataset (~200 images) designed for testing biased associations between social concepts and attributes in images (e.g. “Old”, “Young”, vs. “Pleasant”, “Unpleasant”).
Labeled Faces in the Wild (LFW): ~13K faces of ~6K individuals. Gender, age, and race annotations for images of people in unconstrained settings (from the news). Images skew mostly white, male, and below age 60. Look for extensions like LFWA+.
MS-COCO: For object recognition. ~300K images from Flickr labeled according to whether or not they contain objects from 91 object types. Segmentation, key-point detection, and captioning data provided; gender labels can be inferred from captions.
Multi-Task Facial Landmark (MTFL): ~10K images. Builds upon another dataset of outdoor face images; annotations include assumed gender, pose, and if subjects are smiling.
Pilot Parliaments Benchmark (PPB): ~1K images of ~1K parliamentary representatives from three African countries (Rwanda, Senegal, South Africa) and three European countries (Iceland, Finland, Sweden) chosen for distinctions/balance in skin tone/gender. A certified surgical dermatologist provided Fitzpatrick skin type labels.
UTK Face: ~20K face images sourced from two existing datasets (Morph and CACD); age, gender, and race estimated by an algorithm and human validated. Additional images were crawled from major search engines to increase diversity.
Visual Question Answering (VQA): ~1M questions over ~300K images. Contains real images from MS-COCO and also abstract scenes with human figures. Questions and answers compiled by Mechanical Turk workers; these can refer to gender.
These datasets highlight issues exposed by the survey: data opacity and data sparsity.
Between the lines
On the one hand, this paper provides a useful list of existing datasets for fairness researchers who need to perform model evaluations and situate their work in the scope of current practices. On the other, the survey provides the algorithmic fairness community with best practices to produce new resources: given the existing fairness datasets and their limitations, what can we do to curate novel, improved datasets?
Moreover, the study opens the possibility for future explorations into how fairness evaluators in specific domains respond when appropriate datasets are unavailable. Are inappropriate datasets augmented and mis-used? To what extent are retracted, problematic datasets utilized anyway? Do many independent researchers create many closed-source, smaller test sets? Some of these questions have been explored in prior works. Yet, as the authors note, in order for fairness evaluations to become standard practice in AI/ML, we must tackle data opacity (the lack of information on specific resources) and data sparsity (the scattered-ness of available information) to fully address our collective data documentation debt.Â