🔬 Research summary by Wiebke Hutiri, a PhD candidate at Delft University of Technology, where she studies and develops responsible design practices for trustworthy AI
[Original paper by Wiebke (Toussaint) Hutiri, Aaron Yi Ding, Fahim Kawsar, Akhil Mathur]
Overview: On-device machine learning (ML) is used by billions of resource-constrained Internet of Things (IoT) devices – think smart watches, mobile phones, smart speakers, emergency response, and health-tracking devices. This paper investigates how design choices during model training and optimization can lead to unequal predictive performance across gender groups and languages, leading to reliability bias in device performance.
Introduction
Imagine emergency response systems, activated by voice recognition technology, that consistently ignore the high-pitched voices of women in distress while flawlessly responding to the lower-pitched commands of men. This scenario is a real concern in on-device machine learning (ML), where the resource-constrained settings of IoT devices result in design trade-offs to balance predictive accuracy, power consumption, and compute requirements. During development, engineers must make many decisions to address hardware limitations and meet specific operational requirements while managing the diversity of devices, users, and operating environments. Navigating these challenges successfully requires expertise in hardware, software engineering, and data processing techniques, as well as a deep understanding of the application context.
This research studies performance disparities in on-device ML workflows, exploring how design choices during model training and optimization can perpetuate unequal performance across gender groups and languages. Such disparities can lead to biased device reliability. Through a series of empirical experiments on a keyword spotting task, the study uncovers how complex technical decisions related to the data sample rate, pre-processing parameters, model architecture, and pruning can amplify and propagate reliability bias. The findings highlight the importance of studying bias beyond cloud-based settings. The paper also offers low-effort strategies for engineers to mitigate such biases.
Key Insights
Overview of Design Choices in On-device Keyword Spotting
An audio keyword spotting system takes a raw speech signal as input and outputs the keyword(s) present in the signal from a set of predefined keywords. During inference, the speech signal is sampled and divided into overlapping frames using a sliding window approach, with parameters such as frame length, frame step, and window function specified for pre-processing. These frames are transformed into either log Mel spectrograms or Mel Frequency Cepstral Coefficients (MFCCs). The resulting frame-level features are concatenated, mean-normalized, and used to train a deep neural network classifier. Additionally, the trained neural network can be optimized using model compression techniques, such as weight pruning, as optional steps in the process.
Reliability Bias Assumptions and Definition
We consider an on-device ML model a reliable device component for a user group if the group’s predictive performance equals the model’s overall predictive performance across all groups. If a model performs better or worse than average for a group, we consider it biased, meaning it favors or is prejudiced against that group. Both favoritism and prejudice increase reliability bias. We operationalize reliability bias with a measure that captures these definitions and penalizes favoritism and prejudice equally. Additionally, the measure scores models as being more or less biased while considering positive and negative prediction outcomes.
Based on these assumptions and definitions, the study conducts empirical experiments on spoken keyword corpora in English, French, German, and Kinyarwanda. It considers performance disparities for male and female speakers across these languages. The results are presented next.
Bias Due to Design Choices During Model Training
Impact of Model Architecture Size and Sample Rate
Model accuracy is lower at lower sample rates and for lightweight architectures. The median and interquartile range (IQR) of reliability bias tends to be greater at lower sample rates and for lightweight architectures. The direction of bias is strongly influenced by the training dataset. Overall, male speakers are favored by models. An exception to this is models trained on a Kinyarwanda language dataset, which have considerably lower accuracy and favor female speakers.
Impact of Pre-processing Parameters
Feature type and dimensions impact KWS accuracy and reliability bias. Their effect is further influenced by the training dataset. In general, MFCC-type features perform better than log Mel spectrograms. However, they can also increase reliability bias, prejudicing models against females and favoring males. For MFCC features, fewer dimensions (i.e., cepstral coefficients and Mel filter banks) can reduce computational demands with a negligible impact on accuracy and reliability bias.
Bias Due to Design Choices During Model Optimization
Impact of Pruning Hyperparameters
Polynomial decay is a more robust pruning schedule than constant sparsity, and a larger pruning learning rate, like 0.001, reduces the likelihood of unintended bias and unexpected accuracy degradation. These design choices are particularly important when pruning models to sparsities greater than 50%, beyond which accuracy and reliability bias can deteriorate dramatically. The increase in reliability bias due to pruning is greater for smaller architectures and at lower sample rates. This trend is stronger at smaller learning rates. Training, validation, and test datasets must be large enough and representative across groups of users to ensure robust results and avoid bias.
Strategies to Mitigate Reliability Bias
Model Selection after Training and Optimization
Engineers should use a multi-objective criterion that considers accuracy and reliability bias to select models with high accuracy and low bias after training or pruning. We propose that engineers set a tolerance that controls the drop in accuracy from the maximum value, thus using accuracy as a satisficing metric while minimizing reliability bias. The tolerance value should be determined from application requirements. If model training is followed by pruning, a few top models should be selected for pruning using high accuracy and low bias + high accuracy strategies.
Supporting Design Decisions with Targeted Experimentation
Iterating over pre-processing and pruning parameters during model training and optimization can achieve high accuracy and low bias. Still, it comes at a cost of computational resources, time, and energy. To mitigate bias while considering computational costs, we propose a reduced set of design choice values based on our experiments that engineers should iterate over during model development. By targeting specific values, engineers can train fewer models and conduct limited training and pruning experiments. This data-driven approach to targeted experimentation provides a feasible strategy to select on-device ML models with good accuracy and low bias.
Between the lines
The prevalence of on-device ML applications in everyday consumer devices and the potential consequences of biased systems make it necessary to study bias in on-device ML. Where cloud-based studies of bias in ML abstract away resource considerations, hardware, compute, and power limitations pose real constraints in on-device ML settings. These constraints affect predictive performance and, as this paper shows, can lead to systematic performance variation across user groups. This paper termed these variations as reliability bias. The findings emphasize that careful design choices can mitigate reliability bias. This emphasizes the importance of responsible design practice and engineers’ role in building fairer systems. While the study focuses on predictive accuracy, system efficiency should also be considered a variable across which systematic performance differences can occur. System efficiency impacts power consumption and device battery life, and reliability bias can lead to disparities along these dimensions. Future research should extend the study to different data modalities and investigate reliability bias in other learning tasks.