FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation (NeurIPS 2024)

🔬 Research Summary by Christopher Teo, PhD, Singapore University of Technology and Design (SUTD).

[Original paper by Christopher T.H Teo, Milad Abdollahzadeh, Xinda Ma, Ngai-man Cheung]

Note: This paper, FairQueue: Rethinking Prompt Learning for Fair Text-to-Image Generation, will be presented at NeurIPS 2024 in Vancouver, Canada. It explores advancements in fair text-to-image diffusion models and contributes to the growing body of research on Fair Gen AI in computer vision.

Read: On Measuring Fairness in Generative Modelling (NeurIPS 2023)

Overview: This paper introduces FairQueue, a novel framework for achieving high-quality and fair text-to-image (T2I) generation. Existing T2I generations, such as Stable Diffusion are biased. State-of-the-art approach to mitigate bias in T2I suffers from quality degradation. We propose FairQueue, which incorporates two key strategies—Prompt Queuing and Attention Amplification—to address these issues, achieving outstanding image quality, semantic preservation, and competitive fairness.

Introduction

Generative AI models, especially text-to-image (T2I) systems, have reshaped industries, enabling applications from creative arts to personalized content. However, traditional hard prompts—like “a headshot of a smiling person”—often fail to achieve balanced sensitive attribute (SA) distributions, such as gender or ethnicity, due to linguistic ambiguity.

The current state-of-the-art (SOTA) method, ITI-GEN, introduced a novel prompt learning approach to address these shortcomings. Instead of relying solely on hard prompts, ITI-GEN leverages reference images to learn inclusive prompts tailored to specific SA categories. By aligning the embeddings of prompts and reference images, ITI-GEN seeks to ensure fair representation. However, this method has its limitations: learned prompts often distort generated outputs, resulting in reduced image quality and semantic inconsistencies.

This paper introduces FairQueue, a framework designed to overcome these issues while maintaining competitive fairness. By stabilizing early denoising steps with Prompt Queuing and enhancing SA representation through Attention Amplification, FairQueue significantly improves image quality, semantic preservation, and fairness consistency. Extensive experiments on diverse datasets highlight its advantages over ITI-GEN, marking a step forward for fair and high-quality generative AI systems.

Fairness in Generative Models

Fairness in generative AI requires outputs to represent sensitive attributes (SAs) equally, such as gender, race, and age. For instance, when generating images from prompts like “a person,” the outputs should not disproportionately depict one gender or ethnic group over another. Fair representation ensures inclusivity and avoids reinforcing societal biases.

Limitations of Hard Prompts

Hard prompts, such as appending SA-related phrases (“with pale skin”) to a base prompt (“a person”), have been an intuitive method for achieving fairness. However, these prompts often fail to generate balanced outputs because they are constrained by linguistic ambiguity, which is inherent to the T2I models. For example, terms like “smiling” or “not Smiling” are not easily differentiated by the T2I model, resulting in bias-generated samples.

Analyzing the Existing State-of-the-Art Prompt Learning Approach: ITI-GEN

ITI-GEN sought to address these limitations by introducing a prompt learning approach with respect to some reference images. Specifically, instead of relying solely on textual descriptions, ITI-GEN aligns the embeddings of reference images and a learned inclusive prompt–containing a combination of some learnable token and the original base prompt—in a shared CLIP space. This directional alignment aims to capture nuanced SA representations and improve fairness.

While ITI-GEN showed significant progress, it has notable drawbacks. Specifically, our extensive analysis found that the reference images introduce unrelated concepts into the learned prompts. This thereby results in degraded image quality such as distorted faces or irrelevant elements e.g., cartoonish images, in the generated samples.

Our further analysis of the cross-attention maps (which we coined as H2I and I2H) reveals that this is due to the learned tokens being distorted in the early denoising steps, leading to incomplete or inconsistent global structures in the generated images.

Proposed Solution: FairQueue

FairQueue introduces two key strategies:

Prompt Queuing: To address early-stage degradation, FairQueue uses base prompts without SA-specific tokens in the initial denoising steps, allowing the model to form stable global structures. ITI-GEN prompts are then introduced in later stages to refine SA-specific details.
Attention Amplification: By scaling the attention weights of SA tokens during the later denoising steps, FairQueue enhances SA expression without sacrificing image quality or semantic coherence.

Why These Innovations Matter

Stabilizing Early Denoising: By deferring the use of learned prompts, FairQueue avoids the disruptions caused by distorted tokens in the critical early stages of image synthesis.
Enhancing Fine-Grained Control: Attention Amplification ensures that SA-specific details are effectively incorporated, preserving fairness while improving image clarity and fidelity.

Between the lines

FairQueue’s innovations matter because they address a critical gap in T2I generation: balancing fairness with quality and semantic coherence. By identifying and tackling the root causes of ITI-GEN’s limitations, FairQueue demonstrates that fairness need not come at the expense of quality—a key consideration for ethical AI deployment.

However, challenges remain. Current approaches still rely on predefined sensitive attributes, limiting their applicability to real-world contexts where attributes are fluid or intersectional.

Read: On Measuring Fairness in Generative Modelling (NeurIPS 2023)