• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Knowing Your Annotator: Rapidly Testing the Reliability of Affect Annotation

December 2, 2023

🔬 Research Summary by Matthew Barthet, a Ph.D. student researching artificial intelligence and games at the Institute of Digital Games, University of Malta.

[Original paper by Matthew Barthet, Chintan Trivedi, Kosmas Pinitas, Emmanouil Xylakis, Konstantinos Makantasis, Antonios Liapis, and Georgios N. Yannakakis]


Overview: Collecting human emotion data reliably is notoriously challenging, especially at a large scale. Our latest work proposes a new quality assurance tool that uses straightforward audio and visual tests to assess annotators’ reliability before data collection. This tool filters out unreliable annotators with 80% accuracy, allowing for more efficient and clean data collection.


Introduction

Reliably measuring human emotions has always been a pivotal challenge in affective computing and human-computer interaction. At its core, this issue boils down to the fact that emotions are highly subjective and loosely defined responses to stimuli, making them very challenging to annotate reliably. To make matters even harder, collecting such data is often very expensive, time-consuming, and cognitively demanding for annotators, making collecting labels at a large scale highly impractical.

To address these challenges, we’ve designed a novel annotator Quality Assurance (QA) tool that swiftly and efficiently assesses the reliability of participants before data collection. We tested this tool on an interesting challenge where participants were recruited to annotate their engagement in real-time while watching videos of 30 different first-person shooter games. The results highlighted two main findings. The first was that unsurprisingly, trained annotators in a controlled environment were consistently more reliable than crowdworkers. The second and more crucial finding is that our QA tool achieved an impressive 80% accuracy for predicting annotator reliability in subsequent affect tasks. This validates its ability to filter unreliable participants and ensures the data collection process is more efficient and yields higher-quality data.

Key Insights

Knowing Your Annotator

Rapidly testing the reliability of subjective annotators.

Our Quality Assurance (QA) tool allows researchers to better understand the reliability and characteristics of their annotators. Due to the subjectivity of human emotions, there is ultimately no correct answer as to how one is expected to feel at any given moment. Therefore, understanding whether your annotators are just feeling differently from one another or are not annotating their emotions correctly (i.e., unreliably) is vital when processing your data.

With this challenge in mind, we have developed two QA tests to quickly assess an annotator’s reliability through basic stimuli where the correct answer is known. They are designed to test how well a person can respond to audio and visual stimuli in a timely manner and ensure they are using the correct hardware setup for the experiment. The tool is built upon the PAGAN annotation platform and RankTrace, where participants watch a video and annotate how they feel in real-time by scrolling up/down on a mouse wheel.

Visual QA Test

Drawing on inspiration from previous work, our first QA test requires annotators to mark changes in the brightness of a green screen. The tool presents the user with a 1-minute video showing a green screen with a slowly fluctuating brightness. Participants use the mouse wheel to annotate the changes in brightness they observe. The goal is for the participant to accurately replicate the signal driving the changes in color on the screen. The more closely they replicate it, the higher their reliability score.

Audio QA Test

Expanding on the above, the second QA test focuses on auditory stimuli. This time, the participants are shown a blank screen while a 1-minute sound clip is played where the frequency (or pitch) changes slowly over time. Once again, the participant must annotate the changes in frequency they perceive by scrolling on the mouse wheel. The closer they replicate the frequency signal used in the audio clip, the better their reliability score. This second test was included as our subsequent data collection focused on annotating audiovisual content; therefore, testing the participants on both modalities is vital. This also ensures that participants not in a controlled setting use the correct setup, such as using a mouse wheel (not a trackpad) and headphones to pick up on subtle audio cues.

Case Study: Annotating Engagement in Games

To validate the predictive nature of our tool, we conducted a case study on annotating viewer engagement in first-person shooter (FPS) games. We assembled a dataset containing video clips of 30 different FPS games, selecting games with different graphical styles (e.g., cartoon, realistic) and different levels of detail. Annotators were asked to annotate their engagement in 30 minutes of footage (1 minute per game) presented randomly after completing the QA tests. 

Since we don’t have a ground truth signal for emotions to compare to, we approximate the annotators’ reliability by seeing how well they agree with one another as a group. We tested two different groups of annotators. The first is a group of 10 “Experts” who were trained students/researchers who were brought into our research lab and provided with the necessary hardware to complete the study. The second group comprises 10 “Crowdworkers” recruited using Amazon’s Mechanical Turk platform. The goal is to compare the reliability of the two groups and observe how well the tool can filter out unreliable annotators using their average reliability in the QA tests. 

The results for the above goals highlighted two main findings. The first was that unsurprisingly, expert annotators in our research lab were consistently more reliable than crowdworkers. More crucially, our QA tool achieved an impressive 80% accuracy for predicting annotator reliability in our engagement study, validating its ability to filter unreliable participants and ensure we collect cleaner data more efficiently.

The QA tests used in this study, the dataset of FPS videos, and the annotated engagement labels (the collection is ongoing) are available on our GitHub repository at: https://github.com/institutedigitalgames/pagan_qa_tool_sources. We are working on an automated version of this tool for the public and will be added to PAGAN in the near future.

Between the lines

This work takes some important initial steps to make large-scale affect data collection more feasible and understandable for researchers. It enables researchers to save time and resources by only allowing the best quality annotators to complete their study, and it also ensures the data collected is of the best possible quality. This is a critical and long-standing goal in affective computing and will help foster the creation of better and larger corpora.

There are many directions for future development here. We are testing our QA tool on more participants and gathering more engagement data to validate our findings further and build upon our dataset. We are also looking into different annotation protocols for maximizing annotators’ reliability, different approaches for measuring reliability, and ways of filtering unreliable annotators (e.g., by using machine learning).

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground.

Tech Futures: Co-opting Research and Education

Agentic AI systems and algorithmic accountability: a new era of e-commerce

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

related posts

  • Characterizing, Detecting, and Predicting Online Ban Evasion

    Characterizing, Detecting, and Predicting Online Ban Evasion

  • Why was your job application rejected: Bias in Recruitment Algorithms? (Part 2)

    Why was your job application rejected: Bias in Recruitment Algorithms? (Part 2)

  • Research summary: Challenging Truth and Trust: A Global Inventory of Organized Social Media Manipula...

    Research summary: Challenging Truth and Trust: A Global Inventory of Organized Social Media Manipula...

  • Towards a Framework for Human-AI Interaction Patterns in Co-Creative GAN Applications

    Towards a Framework for Human-AI Interaction Patterns in Co-Creative GAN Applications

  • Moral Zombies: Why Algorithms Are Not Moral Agents

    Moral Zombies: Why Algorithms Are Not Moral Agents

  • Embedded ethics: a proposal for integrating ethics into the development of medical AI

    Embedded ethics: a proposal for integrating ethics into the development of medical AI

  • Consent as a Foundation for Responsible Autonomy

    Consent as a Foundation for Responsible Autonomy

  • Research Summary: Explaining and Harnessing Adversarial Examples

    Research Summary: Explaining and Harnessing Adversarial Examples

  • Research summary: Working Algorithms: Software Automation and the Future of Work

    Research summary: Working Algorithms: Software Automation and the Future of Work

  • Transferring Fairness under Distribution Shifts via Fair Consistency Regularization

    Transferring Fairness under Distribution Shifts via Fair Consistency Regularization

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.