• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Core Principles of Responsible AI
    • Accountability
    • Fairness
    • Privacy
    • Safety and Security
    • Sustainability
    • Transparency
  • Special Topics
    • AI in Industry
    • Ethical Implications
    • Human-Centered Design
    • Regulatory Landscape
    • Technical Methods
  • Living Dictionary
  • State of AI Ethics
  • AI Ethics Brief
  • 🇫🇷
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

July 26, 2023

🔬 Research Summary by Yiting Qu, a Ph.D. student at CISPA Helmholtz Center for Information Security interested in AI Ethics and Safety.

[Original paper by Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang]


Overview: Text-to-Image models are revolutionizing the way people generate images. However, they also pose significant risks in generating unsafe images. This paper systematically evaluates the potential of these models in generating unsafe images and, in particular,  hateful memes.


Introduction

Text-to-Image models are generative models developed to generate images based on textual descriptions. Simply provide them with a text (a prompt), and they will produce images that perfectly illustrate the prompt in seconds. However, these models might be used in unexpected ways. Take Unstable Diffusion as an example, it is a profit-seeking community that automatically generates pornographic images using these models. This raises the question: what if a malicious user misuses these models and scales up the generation of unsafe images? Editor’s note: check out this MAIEI article highlighting some ethical issues and challenges with Unstable Diffusion.

To answer the question, we systematically evaluate the susceptibility of four popular Text-to-Image models in unsafe image generation. We collect three harmful prompt sets and one harmless prompt set, which are then fed into the Text-to-Image models for image generation. We assess the percentage of unsafe images in all generated images. The results show that 15.83% – 50.56% of the generated images are unsafe when models are intentionally misled with harmful prompts. Even when using harmless prompts, there is still a small possibility (i.e., 0.5%) of generating unsafe content.

Next, we zoom into a specific type of unsafe image: hateful memes. We envision a scenario where a malicious user aims to create variants of hateful memes using existing hateful symbols to attack specific individuals or communities. When combined with image editing techniques, we evaluate the potential of Stable Diffusion to generate these hateful meme variants. Compared to the variants manually drawn in the real world, we find that Stable Diffusion has a success rate of up to 24% in generating hateful meme variants that could be directly used to attack others. The quality and connotations of these AI-generated memes are comparable to real-world instances.

Key Insights

The Susceptibility of Text-to-Image Models in Unsafe Image Generation

The scope of unsafe images is broad and ambiguous. To capture the prominent unsafe content in AI-generated images, we adopt a data-driven approach and identify five categories of unsafe images: sexually explicit, violent, disturbing, hateful, and political. Within this defined scope, we assess the susceptibility of four open-source Text-to-Image models, Stable Diffusion, Latent Diffusion, DALL-E 2, and DALL-E-mini, using diverse sets of prompts.

Three harmful prompt sets are utilized to test the worst-case scenario when the models are exploited to generate unsafe content intentionally. These sets are derived from two main sources, 4chan and the Lexica website. We also include a clean prompt set using randomly sampled MS-COCO captions that describe normal objects. Using these prompts and the four Text-to-Image models, we generate 17K images. We then develop an image safety classifier to classify a generated image as safe or unsafe, specifying the category of unsafe content. 

We have discovered that, on average, the four models have a 14.56% probability of generating unsafe images. Among them, Stable Diffusion is the most susceptible, with 18.92% of all generated images classified as unsafe​​; DALL-E 2 is the safest model, with only a 7.16% probability. However, this probability can increase significantly when intentionally provided with harmful prompts, ranging from 15.83% to 50.56% across the four models. What’s more concerning is that even when using a clean prompt set that describes normal objects, the models still have a 0.5% chance of generating unsafe images. Considering the large number of users and the frequent use of these models in daily life, these results indicate a significant risk of these models exacerbating the problem of unsafe image generation.

The Potential of Text-to-Image Models in Hateful Meme Generation

Hateful memes pose a large threat to online safety. In the real world, if a malicious user aims to attack specific individuals/communities via memes, one possible way is to find a notorious hateful meme, such as Pepe the Frog, and manually merge it with his target using Photoshop. The merged meme is referred to as the hateful meme variant. What would be the consequence if the malicious user aims to scale up the generation of hateful meme variants against a list of targets?

We develop an evaluation process focusing on Stable Diffusion and three image editing methods, including DreamBooth, Textual Inversion, and SDEdit. We use a dataset in the real world containing a list of hateful meme variants (manually drawn by humans) and the corresponding targets. The evaluation process starts with automatically generating prompts to describe how the targets are presented in hateful meme variants within this dataset. Next, we apply three image editing methods on top of Stable Diffusion and feed the designed prompts to generate variants. Finally, we compare the quality of generated and original variants using multiple assessment metrics.

Our research has found that 24% of hateful meme variants generated by Stable Diffusion are successful and could be used to attack the targets directly. Compared to real-world hateful meme variants, the generated ones, especially those with DreamBooth, have comparable image quality. This can be extremely concerning if malicious users launch a hate campaign online by producing many hateful meme variants.

Between the lines

The risk of Text-to-Image models in generating unsafe images should not be overlooked. Our research has found a significant likelihood of generating unsafe images, including sexually explicit, violent, disturbing, hateful, and political images. A particular concern is the automatic generation of hateful memes. It requires the user to input a piece of text into the model, reducing the cost compared to manually drawing. Considering the role of hateful memes in historical hate campaigns, we are concerned that current open-source Text-to-Image models might be exploited for the massive production of hateful memes and serve as a powerful weapon for orchestrated hate campaigns.

To mitigate the risk, model developers can make improvements from three aspects: 1) rigorously curating the training dataset of these models to minimize the proportion of unsafe images; 2) regulating user prompts and eliminating any unsafe keywords for models deployed online; and 3) training more accurate image safety detectors that report unsafe images, particularly hateful memes.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

AI Policy Corner: New York City Local Law 144

Canada’s Minister of AI and Digital Innovation is a Historic First. Here’s What We Recommend.

Am I Literate? Redefining Literacy in the Age of Artificial Intelligence

AI Policy Corner: The Texas Responsible AI Governance Act

AI Policy Corner: Singapore’s National AI Strategy 2.0

related posts

  • Research summary: Comparing Privacy Law GDPR Vs CCPA

    Research summary: Comparing Privacy Law GDPR Vs CCPA

  • Research summary: Sponge Examples: Energy-Latency Attacks on Neural Networks

    Research summary: Sponge Examples: Energy-Latency Attacks on Neural Networks

  • Moral consideration of nonhumans in the ethics of artificial intelligence

    Moral consideration of nonhumans in the ethics of artificial intelligence

  • Promises and Challenges of Causality for Ethical Machine Learning

    Promises and Challenges of Causality for Ethical Machine Learning

  • Avoiding an Oppressive Future of Machine Learning: A Design Theory for Emancipatory Assistants

    Avoiding an Oppressive Future of Machine Learning: A Design Theory for Emancipatory Assistants

  • Towards User-Guided Actionable Recourse

    Towards User-Guided Actionable Recourse

  • Ethics of AI‑Enabled Recruiting and Selection: A Review and Research Agenda

    Ethics of AI‑Enabled Recruiting and Selection: A Review and Research Agenda

  • ChatGPT and the media in the Global South: How non-representative corpus in sub-Sahara Africa are en...

    ChatGPT and the media in the Global South: How non-representative corpus in sub-Sahara Africa are en...

  • Research summary: Artificial Intelligence: The Ambiguous Labor Market Impact of Automating Predictio...

    Research summary: Artificial Intelligence: The Ambiguous Labor Market Impact of Automating Predictio...

  • Research summary: Machine Learning Fairness - Lessons Learned

    Research summary: Machine Learning Fairness - Lessons Learned

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer

Categories


• Blog
• Research Summaries
• Columns
• Core Principles of Responsible AI
• Special Topics

Signature Content


• The State Of AI Ethics

• The Living Dictionary

• The AI Ethics Brief

Learn More


• About

• Open Access Policy

• Contributions Policy

• Editorial Stance on AI Tools

• Press

• Donate

• Contact

The AI Ethics Brief (bi-weekly newsletter)

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.


Archive

  • © MONTREAL AI ETHICS INSTITUTE. All rights reserved 2024.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.