• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

The 28 Computer Vision Datasets Used in Algorithmic Fairness Research

May 18, 2022

šŸ”¬ Research summary by Julienne LaChance, PhD (SonyAI, AI Ethics) and Alessandro Fabris (University of Padua). Dr. Julienne LaChance, an AI Research Scientist on the AI Ethics Team at SonyAI and founding member of Princeton AI4ALL.” Alessandro is a PhD student at the University of Padua, specialized in algorithmic fairness, auditing and information access systems.

[Original paper by Alessandro Fabris, Stefano Messina, Gianmaria Silvello, and Gian Antonio Susto]


Overview: Access to well-documented, high-quality datasets is crucial to effective algorithmic fairness research, yet in many sub-fields of AI/ML, dataset documentation is insufficient and scattered. Fabris et. al survey and clearly document over 200 datasets employed in algorithmic fairness research from 2014 to mid-2021. Here, we highlight the 28 computer vision datasets from this survey.


Introduction

Despite the growing urgency for algorithmic fairness assessments, both in industry and academia, comprehensive fairness evaluations are frequently hindered by a lack of suitable data. In practice, this often forces fairness evaluators to choose between (1) using datasets which have become popular in the literature despite their limitations as fairness benchmarks (e.g. contrived prediction tasks, noisy data, severe coding mistakes, age), (2) using related but inappropriate datasets, which may include unethically sourced data or utilize flawed annotation methods, or (3) hand-crafting their own smaller test sets– which may be costly, time-consuming, and/or ultimately result in insufficient data for a meaningful evaluation. By thoroughly examining the datasets used in fairness research across 9 diverse domains (computer vision, linguistics, etc.), this paper provides a handy reference for 200+ current fairness datasets and clearly presents strategies towards future improvement.

How was the dataset list compiled? The authors surveyed, from 2014 to early May 2021, the following publication sources: every article published in domain-specific conferences (e.g. FAccT, AIES); every article published in proceedings of well-known machine learning and data mining conferences (e.g. CVPR, NeurIPS); and every article available from ā€œPast Network Eventsā€ and ā€œOlder Workshops and Eventsā€ of the FAccT network. The results were filtered by keyword strings (e.g. *fair*, *bias*, *parit*) and manually cleaned by the authors. An updated version including more recent articles and datasets is due in 2022! 

Key Insights

As promised, here are the 28 computer vision datasets used in fairness research from 2014-May 2021: 

Adience

Athletes and health professionals

Benchmarking Attribution Methods (BAM)

BUPT Faces

Cars3D

CelebA

CIFAR

Diversity in Faces (DiF)

dSprites

FairFace

Fashion MNIST

IARPA Janus Benchmark A (IJB-A)

Image Embedding Association Test (iEAT)

ImageNet

Labeled Faces in the Wild (LFW)

MNIST

MS-Celeb-1M

MS-COCO

Multi-task Facial Landmark (MTFL)

Office31

Omniglot

Pilot Parliaments Benchmark (PPB)

Racial Faces in the Wild (RFW)

shapes3D

SmallNORB

UTK Face

Visual Question Answering (VQA)

Waterbirds

Readers can refer to the article for the full 200+ dataset list. Computer vision researchers may notice the absence of datasets such as the Chicago Faces Database (CFD) – which was not used in AI/ML fairness research until after May 2021. We aren’t providing hyperlinks here to two datasets with hosting websites which have been taken down: namely, IBM’s Diversity in Faces (DiF) resulting from a class-action lawsuit, and Microsoft’s MS-Celeb-1M. 

Just from a quick skim of this list, some limitations concerning the computer vision datasets employed in algorithmic fairness research thus far become immediately apparent: for example, those assessing algorithmic fairness of a model based on sensitive attributes like race and gender won’t find much use in generic ML datasets (e.g. MNIST, Fashion MNIST); highly domain-specific datasets such as those containing geometric shapes/objects (e.g. dSprites, shapes3D) or office supplies (Office31); or even datasets with questionable image sources or annotation schemes (e.g. RFW and BUPT Faces, which use Face++ API to apply ā€œraceā€ annotations in the categories ā€œCaucasianā€, ā€œIndianā€, ā€œAsianā€ and ā€œAfricanā€ on MS-Celeb-1M images). Once we narrow our focus to specific sub-tasks in computer vision, the outlook becomes more grim: do researchers studying fairness in pose estimation have a single data source (MS-COCO)? 

Let’s dive in to take a quick look at just those datasets containing images of people. 

Human-centric fairness datasets in computer vision: A closer look 

For brevity, we will skip those datasets which have been taken down (DiF, MS-Celeb-1M, the persons category of ImageNet)– and the MS-Celeb-1M-based RFW and BUPT– even though retracted datasets and their derivatives continue to be used. This leaves: 

Audience: ~30K in-the-wild smart-phone images of ~2K people sourced from Flickr. Manually annotated for age, gender, and identity; some Fitzpatrick skin color annotations were added for the Gender Shades analysis.

Athletes and health professionals: ~500 images of nurses/doctors manually collected to identify bias in race/gender, plus ~800 images of athletes for gender/jersey color bias. Subgroups roughly balanced at 200 individuals. 

CelebFaces Attributes (CelebA): ~200K faces of ~10K individuals augmented with landmark location and manually annotated binary attributes. Annotations can be highly subjective (e.g. ā€œattractiveā€, ā€œbig noseā€) or offensive (ā€œdouble chinā€). Gender, age labels exist. 

FairFace: Race, age, gender, skin tone annotations for ~100K face images from Yahoo’s YFCC100M. Sensitive attributes like race annotated by Mechanical Turk workers, an associated model, and re-verification.

IARPA Janus Benchmark A (IJB-A): ~6K images of ~500 subjects. Gender and skin color annotations on manually selected in-the-wild images of people from broad geographic representation; original annotation methodology is unspecified. Gender and Fitzpatrick skin type were labeled by one author of the Gender Shades study.

iEAT: A smaller dataset (~200 images) designed for testing biased associations between social concepts and attributes in images (e.g. ā€œOldā€, ā€œYoungā€, vs. ā€œPleasantā€, ā€œUnpleasantā€).

Labeled Faces in the Wild (LFW): ~13K faces of ~6K individuals. Gender, age, and race annotations for images of people in unconstrained settings (from the news). Images skew mostly white, male, and below age 60. Look for extensions like LFWA+.

MS-COCO: For object recognition. ~300K images from Flickr labeled according to whether or not they contain objects from 91 object types. Segmentation, key-point detection, and captioning data provided; gender labels can be inferred from captions. 

Multi-Task Facial Landmark (MTFL): ~10K images. Builds upon another dataset of outdoor face images; annotations include assumed gender, pose, and if subjects are smiling.

Pilot Parliaments Benchmark (PPB): ~1K images of ~1K parliamentary representatives from three African countries (Rwanda, Senegal, South Africa) and three European countries (Iceland, Finland, Sweden) chosen for distinctions/balance in skin tone/gender. A certified surgical dermatologist provided Fitzpatrick skin type labels.

UTK Face: ~20K face images sourced from two existing datasets (Morph and CACD); age, gender, and race estimated by an algorithm and human validated. Additional images were crawled from major search engines to increase diversity. 

Visual Question Answering (VQA): ~1M questions over ~300K images. Contains real images from MS-COCO and also abstract scenes with human figures. Questions and answers compiled by Mechanical Turk workers; these can refer to gender. 

These datasets highlight issues exposed by the survey: data opacity and data sparsity.

Between the lines

On the one hand, this paper provides a useful list of existing datasets for fairness researchers who need to perform model evaluations and situate their work in the scope of current practices. On the other, the survey provides the algorithmic fairness community with best practices to produce new resources: given the existing fairness datasets and their limitations, what can we do to curate novel, improved datasets? 

Moreover, the study opens the possibility for future explorations into how fairness evaluators in specific domains respond when appropriate datasets are unavailable. Are inappropriate datasets augmented and mis-used? To what extent are retracted, problematic datasets utilized anyway? Do many independent researchers create many closed-source, smaller test sets? Some of these questions have been explored in prior works. Yet, as the authors note, in order for fairness evaluations to become standard practice in AI/ML, we must tackle data opacity (the lack of information on specific resources) and data sparsity (the scattered-ness of available information) to fully address our collective data documentation debt.Ā 

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

šŸ” SEARCH

Spotlight

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

related posts

  • Language Models: A Guide for the Perplexed

    Language Models: A Guide for the Perplexed

  • Research summary: Legal Risks of Adversarial Machine Learning Research

    Research summary: Legal Risks of Adversarial Machine Learning Research

  • Worldwide AI Ethics: a review of 200 guidelines and recommendations for AI governance

    Worldwide AI Ethics: a review of 200 guidelines and recommendations for AI governance

  • The Evolution of the Draft European Union AI Act after the European Parliament’s Amendments

    The Evolution of the Draft European Union AI Act after the European Parliament’s Amendments

  • AI Ethics and Ordoliberalism 2.0: Towards A ā€˜Digital Bill of Rights

    AI Ethics and Ordoliberalism 2.0: Towards A ā€˜Digital Bill of Rights

  • Regulatory Instruments for Fair Personalized Pricing

    Regulatory Instruments for Fair Personalized Pricing

  • Exploring XAI for the Arts: Explaining Latent Space in Generative Music

    Exploring XAI for the Arts: Explaining Latent Space in Generative Music

  • Science Communications for Explainable Artificial Intelligence

    Science Communications for Explainable Artificial Intelligence

  • Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of Demographic Data Collection an...

    Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of Demographic Data Collection an...

  • Beyond the Frontier: Fairness Without Accuracy Loss

    Beyond the Frontier: Fairness Without Accuracy Loss

Partners

  • Ā 
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • Ā© 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.