• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
    • Tech Futures
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 7 (November 2025)
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Breaking Fair Binary Classification with Optimal Flipping Attacks

November 7, 2022

🔬 Research Summary by Changhun Jo, Jy-yong Sohn, Kangwook Lee.

Changhun Jo is a PhD candidate at University of Wisconsin-Madison, working on algorithmic fairness, social recommender systems, and machine learning.
Jy-yong Sohn is a post-doctoral researcher at University of Wisconsin-Madison, and he is interested in the intersection of machine learning, information theory, and distributed algorithms.
Kangwook Lee is an Assistant Professor at University of Wisconsin-Madison, and his research interests lie in machine learning, deep learning, information theory, and coding.

[Original paper by Changhun Jo, Jy-yong Sohn, Kangwook Lee]


Overview: Fair classification has been widely studied since current machine learning (ML) systems tend to make biased decisions across different groups, e.g., race, gender, and age. We especially focus on fair classification under corrupted data, motivated by the fact that the web-scale data we use for training models is potentially poisoned by random or adversarial noise. This paper studies the minimum amount of data corruption required for a successful attack against a fairness-aware learner.


Introduction

It has been observed that current ML systems are making unfair decisions for different groups of people depending on their race, gender, age, etc. For example, according to the experimental results reported by AlgorithmWatch (a non-profit research organization) in 2020, a thermometer held by a dark-skinned hand was labeled as a gun by Google Vision Cloud, while the same thermometer held by a light-skinned hand was labeled as a monocular. This motivated various attempts to develop fair classifiers in recent years. Meanwhile, various ML pipelines use crowdsourced training data, potentially containing random or adversarial noise. Thus, it is important to study the effect of data poisoning on ML frameworks. 

In this work, we ask the following question: “What is the minimum amount of data poisoning needed to let a fairness-aware learner output a predefined target classifier?” Focusing on the setting where the attacker can manipulate the labels and sensitive attributes (race, gender, etc.) of data samples, we provide information-theoretic analysis on the minimum amount of data poisoning required for the classifier learned on the poisoned data distribution, becoming a target classifier. Inspired by this theoretic result, we also propose a computationally efficient data poisoning attack algorithm that can compromise the performance of a fairness-aware learner.

Key Insights

Problem setup

Fairness-aware learner

We consider a learner that finds a classifier that achieves the minimum risk among fair classifiers. We use a popular fairness notion called equal opportunity, which aims at equalizing the true positive rates for different groups.

Attacker

The attacker’s goal is to make the learner output the target classifier, by using a minimum amount of data perturbation. The attacker can change (or flip, since we consider a binary setup) labels and sensitive attributes. This attack is dubbed a “flipping attack” in our paper.

Information-Theoretic Results on Minimum Perturbation

We find the lower/upper bounds of the minimum amount of data perturbation required for a successful flipping attack. The lower bound can be obtained by finding the minimum amount of data perturbation required for making the target classifier look perfectly fair on the poisoned distribution. The upper bound can be obtained by constructing an explicit successful data poisoning scheme via a two-stage attack algorithm. In the first stage, the attacker flips labels to get an intermediate distribution on which the target classifier achieves the minimum risk. In the second stage, the attacker flips sensitive attributes to find a distribution on which the target classifier looks perfectly fair. Interestingly, these bounds are tight when the target classifier is the risk minimizer. 

Sensitive Attribute Flipping Attack Algorithm

Inspired by our theoretical finding, we propose a computationally efficient data poisoning attack algorithm. Specifically, we set the empirical risk minimizer, which is generally unfair, as a target classifier and apply the empirical counterpart of the second stage of our two-stage attack algorithm. Note that this attack algorithm flips only sensitive attributes without perturbing labels. We prove that this attack algorithm makes the empirical risk minimizer look almost fair on the poisoned distribution. Therefore, our attack algorithm increases the chance of the empirical risk minimizer being found by the fair learner, thereby degrading the fairness of the learning algorithm.

Experimental Results

For the convenience of presentation, we use Y and Z to denote the label and sensitive attribute, respectively. We compare our sensitive attribute flipping algorithm with four data poisoning attack algorithms:
(1) random Y-flip attack chooses random samples from the training set and flips Y values
(2) random Z-flip attack chooses random samples and flips Z values
(3) random Y&Z-flip attack chooses random samples and flips both Y and Z values
(4) adversarial sampling (AS) attack chooses adversarial samples from the feasible attack set using the online gradient descent algorithm proposed in [1] and adds them to the training set. 

We evaluate these attacks against three fair learning algorithms: 

(1) fair classification using fairness constraints (FC) [2]
(2) fair training against adversarial perturbations (Err-Tol) [3]
(3) fair and robust training (FR-Train) [4] given a clean validation set.

[Table 1. Comparison with other baseline attack algorithms]

Shown in Table 1 is the performance of attack algorithms against fair learning algorithms on a synthetic dataset. When the learner runs fair learning algorithms on the uncorrupted dataset, the unfairness significantly decreases at the cost of degraded accuracy, exhibiting a well-known tradeoff between accuracy and fairness. One can observe that by only manipulating 3.2% of data, our attack makes the output significantly unfair, outperforming (or achieving comparable attack performances with) other attack baselines. Interestingly, our attack successfully degrades the fairness of robust, fair training algorithms; Err-Tol and FR-Train.

Between the lines

The theoretical results in our paper provide insights into how much fairness-aware learners are robust against data poisoning attacks and what is the optimal way of fooling fairness-aware learners. According to our empirical results, manipulating only 3% of data is enough to let a fairness-aware learner output a significantly unfair model. This implies that the system designer should be careful about acquiring clean data if we want to guarantee that the learner outputs a fair classifier.

Our paper opens up interesting research questions: (1) tightening the lower/upper bounds for general target classifier, (2) extending the result to cases when labels and sensitive attributes are non-binary, (3) extending our attack algorithm into the federated learning setup where the effect of data poisoning in one adversarial device is degraded by the process of aggregating models from multiple honest devices. 

References

[1] Hongyan Chang, Ta Duy Nguyen, Sasi Kumar Murakonda, Ehsan Kazemi, and Reza Shokri. On adversarial bias and the robustness of fair machine learning. arXiv:2006.08669, 2020.

[2] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P. Gummadi. Fairness Constraints: Mechanisms for Fair Classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.

[3] L. Elisa Celis, Anay Mehrotra, and Nisheeth K Vishnoih. Fair classification with adversarial perturbations. In Advances in Neural Information Processing Systems (NeurIPS), 2021.

[4] Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. FR-train: A mutual information-based approach to fair and robust training. In International Conference on Machine Learning (ICML), 2020.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

A network diagram with lots of little emojis, organised in clusters.

Tech Futures: AI For and Against Knowledge

A brightly coloured illustration which can be viewed in any direction. It has many elements to it working together: men in suits around a table, someone in a data centre, big hands controlling the scenes and holding a phone, people in a production line. Motifs such as network diagrams and melting emojis are placed throughout the busy vignettes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part II

A rock embedded with intricate circuit board patterns, held delicately by pale hands drawn in a ghostly style. The contrast between the rough, metallic mineral and the sleek, artificial circuit board illustrates the relationship between raw natural resources and modern technological development. The hands evoke human involvement in the extraction and manufacturing processes.

Tech Futures: The Fossil Fuels Playbook for Big Tech: Part I

Close-up of a cat sleeping on a computer keyboard

Tech Futures: The threat of AI-generated code to the world’s digital infrastructure

The undying sun hangs in the sky, as people gather around signal towers, working through their digital devices.

Dreams and Realities in Modi’s AI Impact Summit

related posts

  • Meet the inaugural cohort of the MAIEI Summer Research Internship!

    Meet the inaugural cohort of the MAIEI Summer Research Internship!

  • Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

  • Visions of Artificial Intelligence and Robots in Science Fiction: a computational analysis

    Visions of Artificial Intelligence and Robots in Science Fiction: a computational analysis

  • Computer vision and sustainability

    Computer vision and sustainability

  • Research summary: Beyond a Human Rights Based Approach To AI Governance: Promise, Pitfalls and Plea

    Research summary: Beyond a Human Rights Based Approach To AI Governance: Promise, Pitfalls and Plea

  • Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparativ...

    Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparativ...

  • Sex Trouble: Sex/Gender Slippage, Sex Confusion, and Sex Obsession in Machine Learning Using Electro...

    Sex Trouble: Sex/Gender Slippage, Sex Confusion, and Sex Obsession in Machine Learning Using Electro...

  • Design Principles for User Interfaces in AI-Based Decision Support Systems: The Case of Explainable ...

    Design Principles for User Interfaces in AI-Based Decision Support Systems: The Case of Explainable ...

  • Should you make your decisions on a WhIM? Data-driven decision-making using a What-If Machine for Ev...

    Should you make your decisions on a WhIM? Data-driven decision-making using a What-If Machine for Ev...

  • Public Strategies for Artificial Intelligence: Which Value Drivers?

    Public Strategies for Artificial Intelligence: Which Value Drivers?

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.