• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • šŸ‡«šŸ‡·
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

Fairness implications of encoding protected categorical attributes

June 17, 2022

šŸ”¬ Research Summary by Carlos Mougan, a Ph.D. candidate at the University of Southampton within the Marie Sklodowska-Curie ITN of NoBias.

[Original paper by Carlos Mougan, Jose M. Alvarez, Gourab K Patro, Salvatore Ruggieri and Steffen Staab]


Overview: Protected attributes (such as gender, religion, or race) are often presented as categorical features that need to be encoded before feeding them into an ML algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we investigate the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding.Ā 


Introduction

Sensitive attributes are central to fairness and so are their handling throughout the machine learning pipeline. Many machine learning algorithms require categorical attributes to be suitably encoded as numerical data before being fed to algorithms.

What are the implications of encoding categorical protected attributes?

In previous fairness works, the presence of sensitive attributes is assumed, and so is their feature encoding. Given the range of available categorical encoding methods and the fact that they often must deal with sensitive attributes, we believe this first study on the subject to be highly relevant to the fair machine learning community. 

What encoding method is best in terms of fairness? Can we improve fairness with encoding hyperparameters? Does having a fair model imply having a less performant ML model?

Key Insights

Types of induced bias

Two types  bias are induced when encoding categorical protected attributes, to illustrate this types of induced bias we use the famous Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset:

  • Irreducible bias: The primary problem of unfairness is acquired from the use of protected attributes.  It refers to (direct) group discrimination arising from the categorization of groups into labels: more data about the compared groups do not reduce this type of bias. In the COMPAS dataset, criminal ethnicity was paramount when determining recidivism scores; the numerical encoding of large ethnicity groups such as African-Americans or Caucasian may lead to discrimination, which is an unfair effect coming from the irreducible bias.
  • Reducible bias:  arises due to the variance when encoding groups that have a small statistical representation, sometimes even only very few instances of the group.  Reducible bias can be found and introduced when encoding the ethnicity category Arabic, which is rarely represented in the data, provoking a large sampling variance that ends in an almost random and unrealistic encoding.

Encoding methods

Handling categorical features is a common problem in machine learning, given that many algorithms needs to be fed with numerical data. We review two of the most well-known traditional methods:

An illustrative example of a feature that need to be encoded

  • One hot encoding: Is the most established encoding method for categorical features, is also the default method within the fairness literature. This encoding method constructs orthogonal and equidistant vectors for each category.

One Hot Encoding over illustrative example

  • Target Encoding: categorical features are replaced with the mean target value of each respective category. This technique handles high cardinality categorical data and categories are ordered. The main drawback of target encoding appears when categories with few samples are replaced by values close to the desired target. This introduces bias to the model as it over-trusts the target encoded feature and makes the model prone to overfitting and reducible bias.

Furthermore, this type of encoding allows a regularization hyperparameter of adding Gaussian noise to the category. 

Target Encoding over illustrative example.

Figure: Comparing one-hot encoding and target encoding regularization (Gaussian noise) for the Logistic Regression overĀ  COMPAS dataset. The protected group is African-American. The reference group is Caucasian. Red dots regard different regularization parameters: the darker the red the higher the regularization. Blue dot regards the one-hot encoding.

In the above experiment, we showed that the most used categorical encoding method in the fair machine learning literature, one-hot encoding, discriminates more in terms of equal opportunity fairness than target encoding. However, target encoding shows promising results. Target encoding using Gaussian regularization show improvements under the presence of both types of biases, with the risk of a noticeable loss of model performance in the case of over-parametrization.

Between the lines

In recent years we have seen algorithmic methods aiming to improve fairness in data-driven systems from many perspectives: data collection, pre-processing, in-processing, and post-processing steps. In this work, we have focused on how the encoding of categorical attributes (a common pre-processing step) can reconcile model quality and fairness. 

 A common underpinning of much of the work in fair ML is the assumption that trade-offs between equity and accuracy may necessitate complex methods or difficult policy choices [Rodolfa et al.]

Since target encoding with regularization is easy to perform and does not require significant changes to the machine learning models, it can be explored in the future as a suitable complementary for in-processing methods in fair machine learning.

Acknowledgments

This work has received funding from the European Union’s Horizon 2020 research and innovation program under Marie Sklodowska-Curie Actions (grant agreement number 860630) for the project ā€˜ā€™NoBIAS – Artificial Intelligence without Bias’’

Disclaimer

This work reflects only the authors’ views and the European Research Executive Agency (REA) is not responsible for any use that may be made of the information it contains.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

šŸ” SEARCH

Spotlight

AI Policy Corner: New York City Local Law 144

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

related posts

  • Social media polarization reflects shifting political alliances in Pakistan

    Social media polarization reflects shifting political alliances in Pakistan

  • HAI Weekly Seminar Series: Decolonizing AI with Sabelo Mhlambi

    HAI Weekly Seminar Series: Decolonizing AI with Sabelo Mhlambi

  • The Algorithm Audit: Scoring the Algorithms That Score Us (Research Summary)

    The Algorithm Audit: Scoring the Algorithms That Score Us (Research Summary)

  • The Two Faces of AI in Green Mobile Computing: A Literature Review

    The Two Faces of AI in Green Mobile Computing: A Literature Review

  • UNESCO’s Recommendation on the Ethics of AI

    UNESCO’s Recommendation on the Ethics of AI

  • Research summary: AI in Context: The Labor of Integrating New Technologies

    Research summary: AI in Context: The Labor of Integrating New Technologies

  • An Algorithmic Introduction to Savings Circles

    An Algorithmic Introduction to Savings Circles

  • Corporate Governance of Artificial Intelligence in the Public Interest

    Corporate Governance of Artificial Intelligence in the Public Interest

  • A roadmap toward empowering the labor force behind AI

    A roadmap toward empowering the labor force behind AI

  • De-platforming disinformation: conspiracy theories and their control

    De-platforming disinformation: conspiracy theories and their control

Partners

  • Ā 
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • Ā© MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.