• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

DICES Dataset: Diversity in Conversational AI Evaluation for Safety

January 22, 2024

🔬 Research Summary by Ding Wang, a senior researcher from the Responsible AI Group in Google Research, specializing in responsible data practices with a specific focus on accounting for the human experience and perspective in data production.

[Original paper by Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, and Ding Wang]


Overview: Machine learning often relies on distinct positive-negative datasets, simplifying subjective tasks. Preserving diversity is expensive and critical for conversational AI safety. The DICES dataset offers fine-grained demographic data, replication, and rater vote distribution, enabling the exploration of aggregation strategies. It serves as a shared resource for diverse perspectives in safety evaluations.


Introduction

The importance of safety in conversational AI systems, particularly those based on large language models (LLMs), has grown significantly. These systems have the potential to generate harmful content, spread misinformation, and violate social norms. Despite advances in AI technology, ensuring safety requires robust evaluation data and fine-tuning to align with social norms and responsible tech practices. This need for curated data is essential for safe AI deployment. While previous research has focused on fine-tuning language models with safety-annotated datasets, little attention has been given to capturing different user groups’ diverse perspectives on safety. This paper introduces the DICES dataset, designed to represent and analyze safety perceptions from diverse user populations, including demographic factors like age, gender, and ethnicity, providing a valuable resource for evaluating safety in language models, especially concerning population diversity.

This paper introduces the DICES dataset to address the need for nuanced safety approaches in language modeling. It focuses on the following contributions:

  1. Rater Diversity: Instead of solely mitigating bias, the paper emphasizes characterizing the impact of raters’ backgrounds on dataset annotations. DICES intentionally accounts for diversity with a balanced demographic distribution among raters.
  1. Expanded Safety: DICES assesses a broader notion of safety, encompassing five safety categories, including harm, bias, misinformation, politics, and safety policy violations, to evaluate conversational AI systems comprehensively.
  1. Dataset Size: DICES comprises two sets of annotated AI chatbot conversations with exceptionally large rater replication rates, enabling robust observations about demographic diversity’s impact on safety opinions.
  1. Metrics: The paper demonstrates how DICES can be used to develop metrics for examining safety and diversity in conversational AI systems. It provides insights into inter-rater reliability and demographic subgroup agreement.

Overall, DICES is a valuable resource for understanding and evaluating safety and diversity in language models.

Key Insights

Developing DICES aimed to establish a benchmark dataset capable of systematically encompassing the diversity in safety assessments, enabling comparisons across rater groups defined by demographics. This was accomplished through a five-step process, which included creating the corpus, curating samples, recruiting a diverse pool of raters, conducting safety annotations, and expert assessments. The methodology was crafted to bolster statistical robustness by ensuring a wide representation of demographics among raters, enhance confidence in comparisons among subpopulations by having all raters assess every conversation, and evaluate variations in rater opinions by sampling data with established safety standards.

For safety annotation in DICES-990, 173 diverse raters from the US and India were recruited, providing 60-70 unique ratings per conversation across 24 safety criteria. Each rater annotated a subset of the dataset. In contrast, DICES-350 was annotated by 123 unique US-based raters, each evaluating all 350 conversations using 16 safety criteria. Some low-quality raters (13 in DICES-990 and 19 in DICES-350) were identified, and their annotations were removed. Due to space constraints, this paper focuses on DICES-350, chosen for its balanced demographics. The dataset aimed for 120 US raters with equal representation across 12 demographic groups, comprising age groups (GenZ, Millennial, GenX+) and race/ethnicity (Asian, Black, Latine/x, White). The results are based on a unique rater pool of 104 individuals with diverse genders, ages, and races/ethnicities, and all raters consented to provide detailed demographic information. IRR shows that Latine raters have significantly more agreement than other races. Negentropy (i.e., negative of entropy) and plurality size (i.e., the fraction of raters who choose the most popular response) show that White raters have significantly more and Multiracial, significantly less agreement than other races (Figure 1). It is important to note that all raters annotated all conversations and signed a consent form agreeing to the detailed demographics to be collected for this task. We also used a survey form, which allowed raters to select the option “Prefer not to answer” for each question. The raters self-reported all the rater demographics after they finished the annotation task.

The annotation task included the following six sets of questions: 

  • Q1 evaluated the conversation’s overall quality, including language, comprehensibility, topic familiarity, and more.
  • Q2 assessed the presence of harmful content with the potential for immediate harm to individuals, groups, or animals.
  • Q3 examined whether the conversation contained unfair bias or incited hatred against individuals or groups.
  • Q4 determined if the conversation featured misinformation, such as demonstrably false or outdated theories.
  • Q5 investigated any expression of political affiliations or downplaying of controversial topics in the conversation.
  • Q6 checked for policy violations regarding polarizing topics and endorsements.

All questions offered an “other” option to accommodate safety reasons beyond the predefined categories, and responses were categorized as “No” (safe), “Yes” (unsafe), or “Unsure,” with a “Yes” answer indicating the conversation was considered unsafe.

All conversations in DICES-350 and a sample of 400 conversations in DICES-990 underwent assessment by in-house experts to evaluate their level of harm and discussion topics. About 22% of the conversations covered racial topics, with political topics at 14%, gendered topics at 10%, and 7% each for misinformation and medical topics. Over 40% of the conversations were rated as benign, while the remaining 60% were evenly divided between debatable, moderate, and extreme in terms of harm level, with most benign conversations labeled as banter. In DICES-350, all conversations also received gold ratings for safety by trust and safety experts. At the same time, DICES-990 lacked gold ratings, with only a random sample of 400 conversations rated for topic and degree of harm (see Figure 2).

The paper and dataset contain more details than what’s shown in the summary, such as aggregated ratings, which were generated from all granular safety ratings. They include a single aggregated overall safety rating (“Q_overall”) and aggregated ratings for the three safety categories that the 16 more granular safety ratings correspond to: “Harmful content” (“Q2_harmful_content_overall”), “Unfair bias” (“Q3_bias_overall”) and “Safety policy violations” (“Q6_policy_guidelines_overall”). It also contains granular safety ratings—Raters’ answers to the 16 (DICES-350) or 24 (DICES-990) safety questions spread across five categories of safety: “harmful content” (Q2.1–Q2.9), “unfair bias” (Q3.1–Q3.5), “misinformation” (Q4), “political affiliation” (Q5) and “safety policy violations” (Q6.1–Q6.3).

We refer to the data from the aggregated overall safety ratings in this paper for both brevity and illustrative purposes. However, it is worth emphasizing that the DICES dataset provides further opportunity for extensive, detailed analysis of specific safety-related categories and specific rated conversations.

Between the lines

The DICES dataset facilitates the evaluation of conversational AI system safety with a focus on diverse, subjective safety opinions from around 300 raters. This vast dataset, comprising over 2.5 million ratings, enables comprehensive exploration of safety evaluation themes, including ambiguity in safety assessments, rater disagreements among different groups, and fine-tuning strategies considering diverse safety perspectives. However, the dataset has limitations, such as a relatively small number of conversations and limited demographic categories. Addressing these issues and understanding and managing disagreements are subjects for future research. The dataset’s unique aspect is its collaboration between raters and experts to define “truth” in real-world scenarios, offering a valuable resource for the research community.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

related posts

  • Theorizing Femininity in AI: a Framework for Undoing Technology’s Gender Troubles (Research Summary)

    Theorizing Femininity in AI: a Framework for Undoing Technology’s Gender Troubles (Research Summary)

  • The Narrow Depth and Breadth of Corporate Responsible AI Research

    The Narrow Depth and Breadth of Corporate Responsible AI Research

  • Rethink reporting of evaluation results in AI

    Rethink reporting of evaluation results in AI

  • AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries

    AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries

  • Oppenheimer As A Timely Warning to the AI Community

    Oppenheimer As A Timely Warning to the AI Community

  • The Two Faces of AI in Green Mobile Computing: A Literature Review

    The Two Faces of AI in Green Mobile Computing: A Literature Review

  • It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

    It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

  • Race and AI: the Diversity Dilemma

    Race and AI: the Diversity Dilemma

  • Governance of artificial intelligence

    Governance of artificial intelligence

  • Research summary:  The Flight to Safety-Critical AI

    Research summary: The Flight to Safety-Critical AI

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.