• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection

November 4, 2023

🔬 Research Summary by Oana Inel, a Postdoctoral Researcher at the University of Zurich, where she is working on responsible and reliable use of data and investigating the use of explanations to provide transparency for decision-support systems and foster reflective thinking in people.

[Original paper by Oana Inel, Tim Draws, and Lora Aroyo]


Overview: Recent research has shown that typical one-off data collection practices, dataset reuse, poor dataset quality, or representativeness could lead to unfair, biased, or inaccurate outcomes. Data collection for AI should be performed responsibly, where the data quality is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In our paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative, in-depth analysis of the factors influencing the quality and reliability of the generated data.


Introduction

Despite developing numerous toolkits and checklists to assess the quality of AI models and human-generated datasets, the research landscape still needs a unified framework for cross-dataset comparison and measurement of dataset stability for repeated data collections. Our approach complements existing research by proposing an iterative metrics-based methodology that enables a comprehensive analysis of data collections by systematically applying reliability and reproducibility measurements. 

The reliability metrics are applied to a single data collection and focus on understanding the raters. We propose that data collection campaigns be repeated under similar or different conditions. This allows us to study in-depth the reproducibility of the datasets and their stability under various conditions using a set of reproducibility metrics. The overall methodology is designed to integrate responsible AI practices into data collection for AI and allow data practitioners to explore factors influencing reliability and quality, ensuring transparent and responsible data collection practices. We found that our systematic set of metrics allows us to draw insights into the human and task-dependent factors that influence the quality of AI datasets. They also provide the necessary input for a dataset scorecard, allowing for thorough and systematic evaluation of data collection experiments. 

Key Insights

Reliability and Reproducibility Metrics for Responsible Data Collection

Our proposed methodology brings together, systematically, a set of measurements typically performed ad hoc. However, observing their interaction allows data practitioners to provide a holistic picture of the data quality produced by these studies. The chosen metrics provide input for a scorecard, allowing for thorough and systematic evaluation and comparison of different data collection experiments.

We address the reliability of human annotations by looking at raters’ agreement (i.e., measuring their inter-rater reliability), raters’ variability (i.e., measuring the variability in raters’ answers distribution), and power analysis (i.e., determining the sufficient number of raters for each task). These analyses equip us with fundamental observations and findings for characterizing annotations’ quality and reliability. 

To investigate how rater populations influence the reliability of the annotation results, we propose repeating the annotations at different time intervals and in different settings, thus identifying the factors influencing their reliability. However, we need to use the aforementioned reliability measures to perform a proper comparison of the collected annotations. For instance, high raters agreement values in several repetitions indicate highly homogeneous raters’ populations within each repetition, but it does not necessarily mean that the experiments are highly reproducible. For this, we perform two additional measurements: 1) stability analysis (i.e., measuring the degree of association of the aggregated raters’ scores across two data collection repetitions) and 2) replicability similarity analysis (i.e., measuring the degree of agreement between two rater pools, making two data annotation tasks comparable) to understand how much variability the raters bring and how much we can generalize the results. 

In sum, our proposed methodology provides a step-wise approach as a guide for practitioners to explore factors that influence or impact the reliability and quality of their collected data. Ultimately, the proposed reliability and reproducibility scorecards and analyses allow for more transparent and responsible data collection practices. This leads to identifying factors influencing quality and reliability, thoroughly measuring dataset stability over time or in different conditions, and allowing for dataset comparison.  

What are the factors influencing the quality of data collection?

We validated our methodology on nine existing data collections repeated at different time intervals with similar or different rater qualifications. The annotation tasks span different degrees of subjectivity, data modalities (text and videos), and data sources (Twitter, search results, product reviews, YouTube videos). By following our proposed methodology, we were able to identify the following factors that influence the overall quality of data collection:

  • Intrinsic task subjectivity: This is the case of tasks with low observed inter-rater reliability in each repetition but high stability and high replicability similarity across repetitions. Such scorecard interpretations indicate that raters are similarly consistent within each repetition and across repetitions and that the disagreement indicated by the low IRR scores is, in fact, intrinsic to the subjective nature of the task.
  • Region-specific and time-sensitive annotations: High variability for certain annotated items across different repetitions of a data collection indicates that data collection practices are affected by temporal, familiarity, and regional aspects. In such cases, our analysis shows consistently low stability and replicability similarity. This has serious implications for when data collections are reused, as certain annotations may become obsolete or change in interpretation over time. Furthermore, we posit that diverse raters should not be expected to produce a coherent view of the annotations. We advise repeating the data collection by creating dedicated pools of raters with similar demographic characteristics and comparing their results.
  • Ambiguity of annotation categories: High variability for certain annotated items and power analysis indicating that even a very high number of raters (around 90) can exhibit high levels of consistent disagreement is typically caused by the subjectivity of the task. In this case, we recommend optimizing the task design to decrease additional ambiguity in the annotation categories. Lower inter-rater reliability (IRR) values for certain annotation categories indicate that some categories may not be as clear as others or are only seldom applicable. This suggests that careful attention should be given to the annotation task’s design, instructions, and possible answer categories. Furthermore, the high number of raters needed to obtain stable results indicates that the task might benefit from a more thorough selection of raters and training sessions. 

Between the lines

Our proposed methodology for responsible data collection does not pose any requirements on how data is structured or formatted. What we propose does, however, affect the current practice and assumes a significant adaptation to using reliability and reproducibility metrics. We recommend the following:

  • Systematic piloting: The proposed methodology is primarily suitable as an investigative pilot of data annotation studies. Pilot experiments could identify factors that influence the data collection and be ultimately mitigated for large-scale data collection. 
  • Capture raters, task, and dataset characteristics: Borrow guidelines for reporting human-centric studies from psychology, medicine, and HCI, where human stances, opinions, and other meaningful characteristics are thoroughly recorded. This would facilitate informed decisions on the proper process of collecting raters’ annotations and possible reuse.
  • Cognitive biases assessment: We recommend using existing checklists to identify and subsequently measure, mitigate, and document cognitive biases that may present an issue in the data collection tasks between each data collection iteration. This allows for proper mitigation of biases.
  • Provenance for data collection: Data documentation and maintenance approaches should thoroughly record provenance, including quality scorecards. This would alleviate issues regarding data handling, reuse or modifications of annotation tasks, and platform selection.
Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

A brightly coloured illustration which can be viewed in any direction. It has many elements to it working together: men in suits around a table, someone in a data centre, big hands controlling the scenes and holding a phone, people in a production line. Motifs such as network diagrams and melting emojis are placed throughout the busy vignettes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part II

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

related posts

  • Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits

    Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits

  • The Bias of Harmful Label Associations in Vision-Language Models

    The Bias of Harmful Label Associations in Vision-Language Models

  • Anthropomorphization of AI: Opportunities and Risks

    Anthropomorphization of AI: Opportunities and Risks

  • The Challenge of Understanding What Users Want: Inconsistent Preferences and Engagement Optimization

    The Challenge of Understanding What Users Want: Inconsistent Preferences and Engagement Optimization

  • Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

    Research summary: Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

  • Levels of AGI: Operationalizing Progress on the Path to AGI

    Levels of AGI: Operationalizing Progress on the Path to AGI

  • Eticas Foundation external audits VioGĂ©n: Spain’s algorithm designed to protect victims of gender vi...

    Eticas Foundation external audits VioGén: Spain’s algorithm designed to protect victims of gender vi...

  • A hunt for the Snark: Annotator Diversity in Data Practices

    A hunt for the Snark: Annotator Diversity in Data Practices

  • Intersectional Inquiry, on the Ground and in the Algorithm

    Intersectional Inquiry, on the Ground and in the Algorithm

  • Promises and Challenges of Causality for Ethical Machine Learning

    Promises and Challenges of Causality for Ethical Machine Learning

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.