• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Tiny, Always-on and Fragile: Bias Propagation through Design Choices in On-device Machine Learning Workflows

July 17, 2023

🔬 Research summary by Wiebke Hutiri, a PhD candidate at Delft University of Technology, where she studies and develops responsible design practices for trustworthy AI

[Original paper by Wiebke (Toussaint) Hutiri, Aaron Yi Ding, Fahim Kawsar, Akhil Mathur]


Overview: On-device machine learning (ML) is used by billions of resource-constrained Internet of Things (IoT) devices – think smart watches, mobile phones, smart speakers, emergency response, and health-tracking devices. This paper investigates how design choices during model training and optimization can lead to unequal predictive performance across gender groups and languages, leading to reliability bias in device performance.


Introduction

Imagine emergency response systems, activated by voice recognition technology, that consistently ignore the high-pitched voices of women in distress while flawlessly responding to the lower-pitched commands of men. This scenario is a real concern in on-device machine learning (ML), where the resource-constrained settings of IoT devices result in design trade-offs to balance predictive accuracy, power consumption, and compute requirements. During development, engineers must make many decisions to address hardware limitations and meet specific operational requirements while managing the diversity of devices, users, and operating environments. Navigating these challenges successfully requires expertise in hardware, software engineering, and data processing techniques, as well as a deep understanding of the application context.

This research studies performance disparities in on-device ML workflows, exploring how design choices during model training and optimization can perpetuate unequal performance across gender groups and languages. Such disparities can lead to biased device reliability. Through a series of empirical experiments on a keyword spotting task, the study uncovers how complex technical decisions related to the data sample rate, pre-processing parameters, model architecture, and pruning can amplify and propagate reliability bias. The findings highlight the importance of studying bias beyond cloud-based settings. The paper also offers low-effort strategies for engineers to mitigate such biases.

Key Insights

Overview of Design Choices in On-device Keyword Spotting

An audio keyword spotting system takes a raw speech signal as input and outputs the keyword(s) present in the signal from a set of predefined keywords. During inference, the speech signal is sampled and divided into overlapping frames using a sliding window approach, with parameters such as frame length, frame step, and window function specified for pre-processing. These frames are transformed into either log Mel spectrograms or Mel Frequency Cepstral Coefficients (MFCCs). The resulting frame-level features are concatenated, mean-normalized, and used to train a deep neural network classifier. Additionally, the trained neural network can be optimized using model compression techniques, such as weight pruning, as optional steps in the process.

Reliability Bias Assumptions and Definition

We consider an on-device ML model a reliable device component for a user group if the group’s predictive performance equals the model’s overall predictive performance across all groups. If a model performs better or worse than average for a group, we consider it biased, meaning it favors or is prejudiced against that group. Both favoritism and prejudice increase reliability bias. We operationalize reliability bias with a measure that captures these definitions and penalizes favoritism and prejudice equally. Additionally, the measure scores models as being more or less biased while considering positive and negative prediction outcomes.

Based on these assumptions and definitions, the study conducts empirical experiments on spoken keyword corpora in English, French, German, and Kinyarwanda. It considers performance disparities for male and female speakers across these languages. The results are presented next.

Bias Due to Design Choices During Model Training

Impact of Model Architecture Size and Sample Rate

Model accuracy is lower at lower sample rates and for lightweight architectures. The median and interquartile range (IQR) of reliability bias tends to be greater at lower sample rates and for lightweight architectures. The direction of bias is strongly influenced by the training dataset. Overall, male speakers are favored by models. An exception to this is models trained on a Kinyarwanda language dataset, which have considerably lower accuracy and favor female speakers.

Impact of Pre-processing Parameters

Feature type and dimensions impact KWS accuracy and reliability bias. Their effect is further influenced by the training dataset. In general, MFCC-type features perform better than log Mel spectrograms. However, they can also increase reliability bias, prejudicing models against females and favoring males. For MFCC features, fewer dimensions (i.e., cepstral coefficients and Mel filter banks) can reduce computational demands with a negligible impact on accuracy and reliability bias.

Bias Due to Design Choices During Model Optimization

Impact of Pruning Hyperparameters

Polynomial decay is a more robust pruning schedule than constant sparsity, and a larger pruning learning rate, like 0.001, reduces the likelihood of unintended bias and unexpected accuracy degradation. These design choices are particularly important when pruning models to sparsities greater than 50%, beyond which accuracy and reliability bias can deteriorate dramatically. The increase in reliability bias due to pruning is greater for smaller architectures and at lower sample rates. This trend is stronger at smaller learning rates. Training, validation, and test datasets must be large enough and representative across groups of users to ensure robust results and avoid bias.

Strategies to Mitigate Reliability Bias

Model Selection after Training and Optimization

Engineers should use a multi-objective criterion that considers accuracy and reliability bias to select models with high accuracy and low bias after training or pruning. We propose that engineers set a tolerance that controls the drop in accuracy from the maximum value, thus using accuracy as a satisficing metric while minimizing reliability bias. The tolerance value should be determined from application requirements. If model training is followed by pruning, a few top models should be selected for pruning using high accuracy and low bias + high accuracy strategies.

Supporting Design Decisions with Targeted Experimentation

Iterating over pre-processing and pruning parameters during model training and optimization can achieve high accuracy and low bias. Still, it comes at a cost of computational resources, time, and energy. To mitigate bias while considering computational costs, we propose a reduced set of design choice values based on our experiments that engineers should iterate over during model development. By targeting specific values, engineers can train fewer models and conduct limited training and pruning experiments. This data-driven approach to targeted experimentation provides a feasible strategy to select on-device ML models with good accuracy and low bias.

Between the lines

The prevalence of on-device ML applications in everyday consumer devices and the potential consequences of biased systems make it necessary to study bias in on-device ML. Where cloud-based studies of bias in ML abstract away resource considerations, hardware, compute, and power limitations pose real constraints in on-device ML settings. These constraints affect predictive performance and, as this paper shows, can lead to systematic performance variation across user groups. This paper termed these variations as reliability bias. The findings emphasize that careful design choices can mitigate reliability bias. This emphasizes the importance of responsible design practice and engineers’ role in building fairer systems. While the study focuses on predictive accuracy, system efficiency should also be considered a variable across which systematic performance differences can occur. System efficiency impacts power consumption and device battery life, and reliability bias can lead to disparities along these dimensions. Future research should extend the study to different data modalities and investigate reliability bias in other learning tasks.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

A brightly coloured illustration which can be viewed in any direction. It has many elements to it working together: men in suits around a table, someone in a data centre, big hands controlling the scenes and holding a phone, people in a production line. Motifs such as network diagrams and melting emojis are placed throughout the busy vignettes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part II

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

Illustration of a coral reef ecosystem

Tech Futures: Diversity of Thought and Experience: The UN’s Scientific Panel on AI

related posts

  • Creative Agents: Rethinking Agency and Creativity in Human and Artificial Systems

    Creative Agents: Rethinking Agency and Creativity in Human and Artificial Systems

  • The Bias of Harmful Label Associations in Vision-Language Models

    The Bias of Harmful Label Associations in Vision-Language Models

  • International Institutions for Advanced AI

    International Institutions for Advanced AI

  • Governance by Algorithms (Research Summary)

    Governance by Algorithms (Research Summary)

  • Research summary:  Learning to Complement Humans

    Research summary: Learning to Complement Humans

  • Teaching AI Ethics Using Science Fiction (Research summary)

    Teaching AI Ethics Using Science Fiction (Research summary)

  • From AI Winter to AI Hype: The Story of AI in Montreal

    From AI Winter to AI Hype: The Story of AI in Montreal

  • Design Principles for User Interfaces in AI-Based Decision Support Systems: The Case of Explainable ...

    Design Principles for User Interfaces in AI-Based Decision Support Systems: The Case of Explainable ...

  • Benchmark Dataset Dynamics, Bias and Privacy Challenges in Voice Biometrics Research

    Benchmark Dataset Dynamics, Bias and Privacy Challenges in Voice Biometrics Research

  • Algorithmic Domination in the Gig Economy

    Algorithmic Domination in the Gig Economy

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.