• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Augmented Datasheets for Speech Datasets and Ethical Decision-Making

July 6, 2023

🔬 Research Summary by Orestis Papakyriakopoulos, a research scientist in AI ethics at Sony AI. His research provides ideas, frameworks, and practical solutions towards just, inclusive and participatory socio-algorithmic ecosystems.

[Original paper by Orestis Papakyriakopoulos, Anna Seo Gyeong Choi, William Thong, Dora Zhao, Jerone Andrews, Rebecca Bourke, Alice Xiang, and Allison Koenecke]


Overview: The lack of diversity in datasets can lead to serious limitations in building equitable and robust Speech-language Technologies (SLT), especially along dimensions of language, accent, dialect, variety, and speech impairment. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to “Datasheets for Datasets.”


Introduction

Many  SLTs are used in high-impact scenarios: to support individuals with disabilities (e.g., the speech synthesizer used by Steven Hawking), to make driving safer, to help medical doctors transcribe patient notes, and to transcribe courtroom events. 

Prior research studies have shown that such SLT applications can have disparate impacts on different populations, which can largely be attributed to the training data used. For example, a study in 2020 showed that speech-to-text error rates of top commercial speech assistants were twice as high for Black speakers relative to white speakers in the US.

Such biases often emerge because speech data selection & collection is a time and resource-intensive task, especially considering the different kinds of speech. For example, dataset creators and model developers should answer crucial questions about who should be included in the dataset regarding age, accents, dialects, gender & atypical speech. Furthermore, they should also consider data-subjects rights & privacy while making data collection feasible and useful for the SLT application. 

Towards this end, we created a datasheet template, which provides explicit questions that speech dataset creators should answer, contributing to more ethical SLTs. Using the template by the research community can lead to more transparent & inclusive dataset creation and sharing. 

Key insights

Our primary aim was to develop a set of considerations that can be useful to speech dataset creators and users when developing speech datasets and training machine learning models.  Towards this, we performed a large-scale literature review & focused dataset evaluation to uncover existing limitations and detect properties that ethical & inclusive datasets should have.  Based on the detected limitations and properties, we supplemented the existing datasheets for datasets with speech-specific questions that can guide researchers and practitioners in developing more robust and trustworthy SLTs.

Limitations of existing datasets & documentation 

During our large-scale literature and dataset review, we found serious limitations regarding demographic diversity, numerous ethical questions remained unaddressed, and many dataset properties undisclosed, leading to confusion about dataset utility. For example, we found that less than a handful of datasets contained explicit recordings of non-binary individuals, while age diversity was very limited. Similarly, we found a systematic under-documentation on the speaking style of individuals in the recordings or about the privacy and compensation practices of data subjects during data collection.

Considerations

Based on the detected limitations and issues, we formulated questions that can inform transparent & ethical dataset development in different stages of the process. An overview of these questions for each stage can be found next:

(1) Motivation

How does a dataset creator determine which linguistic subpopulations (age, gender, dialect, nationality, etc.) are the focus of the dataset, as well as for which purpose is the dataset going to be used (e.g., read or spontaneous speech)?

(2) Composition

How much data does a dataset creator collect for each subpopulation? What topics are included in the recordings, and is it ensured that the content does not conflict with the values of data subjects and users (religious, cultural, political values)?

(3) Collection Process

Are speech recordings collected in a way that respects data subjects’ privacy and their well-being? Do dataset creators have the appropriate rights to collect recordings? Under what conditions (e.g., noise) are they collecting the data?

(4) Preprocessing/cleaning/labeling

How are annotators trained to perform the labeling of the data? How familiar are they with the types of speech (e.g., dialect, accent) collected, and how are they resolving differences across annotators? What is the transcription convention used, and how are recordings standardized?

(5) Uses / Distribution / Maintenance

How is sensitive information redacted from the recordings and transcriptions? Is the dataset a subsample of the data collected? What other documentation is available to understand further the data collection process (e.g., agreements signed with data subjects and research methodology)?

Between the lines

Our augmented datasheets provide a blueprint for considerations practitioners should examine and address.

For dataset creators, the datasheets ensure consistency in documenting datasets, increase transparency regarding dataset contents, clarify the motivations and methods behind data collection, and encourage explicit consideration of linguistic subgroups and socioeconomic/demographic categories that are often overlooked.

For dataset users, the benefits include a more comprehensive understanding of dataset usefulness and easier decision-making when selecting data for more robust and inclusive SLTs. 

Regardless of whether the datasheet questions are being answered by users or creators, engaging with each question can provide valuable reflexive knowledge on ethical machine learning development. We recommend that dataset creators release a completed augmented data sheet alongside their speech dataset to inform the wider research community about its capabilities and limitations. This release also sets an example for adopting more ethical, inclusive, and transparent machine learning practices.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • Attacking Fake News Detectors via Manipulating News Social Engagement

    Attacking Fake News Detectors via Manipulating News Social Engagement

  • Challenges of AI Development in Vietnam: Funding, Talent and Ethics

    Challenges of AI Development in Vietnam: Funding, Talent and Ethics

  • Bots don’t Vote, but They Surely Bother! A Study of Anomalous Accounts in a National Referendum

    Bots don’t Vote, but They Surely Bother! A Study of Anomalous Accounts in a National Referendum

  • Fair Interpretable Representation Learning with Correction Vectors

    Fair Interpretable Representation Learning with Correction Vectors

  • Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

    Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

  • Research summary: AI Mediated Exchange Theory by Xiao Ma and Taylor W. Brown

    Research summary: AI Mediated Exchange Theory by Xiao Ma and Taylor W. Brown

  • A Hazard Analysis Framework for Code Synthesis Large Language Models

    A Hazard Analysis Framework for Code Synthesis Large Language Models

  • Bias Amplification Enhances Minority Group Performance

    Bias Amplification Enhances Minority Group Performance

  • Submission to World Intellectual Property Organization on IP & AI

    Submission to World Intellectual Property Organization on IP & AI

  • A Case Study: Increasing AI Ethics Maturity in a Startup

    A Case Study: Increasing AI Ethics Maturity in a Startup

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.