• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
Montreal AI Ethics Institute

Montreal AI Ethics Institute

Democratizing AI ethics literacy

  • Articles
    • Public Policy
    • Privacy & Security
    • Human Rights
      • Ethics
      • JEDI (Justice, Equity, Diversity, Inclusion
    • Climate
    • Design
      • Emerging Technology
    • Application & Adoption
      • Health
      • Education
      • Government
        • Military
        • Public Works
      • Labour
    • Arts & Culture
      • Film & TV
      • Music
      • Pop Culture
      • Digital Art
  • Columns
    • AI Policy Corner
    • Recess
  • The AI Ethics Brief
  • AI Literacy
    • Research Summaries
    • AI Ethics Living Dictionary
    • Learning Community
  • The State of AI Ethics Report
    • Volume 6 (February 2022)
    • Volume 5 (July 2021)
    • Volume 4 (April 2021)
    • Volume 3 (Jan 2021)
    • Volume 2 (Oct 2020)
    • Volume 1 (June 2020)
  • About
    • Our Contributions Policy
    • Our Open Access Policy
    • Contact
    • Donate

Self-Improving Diffusion Models with Synthetic Data

February 3, 2025

🔬 Research Summary by Sina Alemohammad, a PhD candidate at Rice University with a focus on the interaction between generative models and synthetic data.

[Original paper by Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, Richard Baraniuk]


Overview: The increasing reliance on synthetic data to train generative models risks creating a feedback loop that degrades model performance and biases outputs. This paper introduces Self-IMproving diffusion models with Synthetic data (SIMS), a novel approach to utilize synthetic data effectively without incurring Model Autophagy Disorder (MAD) or model collapse, setting new performance benchmarks and addressing biases in data distributions.


Introduction

In a world where generative artificial intelligence (AI) is transforming industries, the availability of quality data for training these models is becoming a pressing issue. Many generative models now rely on synthetic data from previous iterations, which can lead to a self-consuming loop that results in Model Autophagy Disorder (MAD) or model collapse. Over time, this process amplifies errors, degrades performance, and increases bias, presenting significant challenges to fairness and accuracy in AI outputs.

This paper tackles two critical questions:

  1. How can we best exploit synthetic data in generative model training to improve real data modeling and synthesis?
  2. How can we exploit synthetic data in generative model training in a way that does not lead to MADness in the future?

In this paper, we develop Self-IMproving diffusion models with Synthetic data (SIMS), a new learning framework for diffusion models that address both of the above issues simultaneously. Our key insight is that, to most effectively exploit synthetic data in training a generative model, we need to change how we employ synthetic data. Instead of naïvely training a model on synthetic data as though it were real, SIMS guides the model toward better performance but away from the patterns that arise from synthetic data training.

Figure 1: Self-IMproving diffusion models with Synthetic data (SIMS) simultaneously improves diffusion
modeling and synthesis performance while acting as a prophylactic against Model Autophagy Disorder
(MAD).
First row: Samples from a base diffusion model (EDM2-S (Kynkäänniemi et al., 2024)) trained on
1.28M real images from the ImageNet-512 dataset Karras et al. (2024a) (Fréchet inception distance, FID = 2.56). Second row: Samples from the base model after fine-tuning with 1.5M images synthesized from the base model, which degrades synthesis performance and pushes the model towards MADness (Alemohammad et al., 2023; 2024) (FID = 6.07). Third row: Samples from the base model after applying SIMS using the same self-generated synthetic data as in the second row (FID = 1.73).

We focus here on SIMS for diffusion models in the context of image generation because their robust guidance capabilities enable us to efficiently guide them away from their own generated synthetic data. The method involves using a base model’s synthetic data to calculate a synthetic score function, which provides negative guidance during generation, steering the process toward real data distributions. 

Key features include:

  • Negative Guidance: Synthetic data is used as a reference to counteract its own biases.
  • Iterative Refinement: An auxiliary model fine-tuned on synthetic data refines the score function to better align with real data distributions.

Algorithm Overview:

  1. Train a base diffusion model on real data, i.e., standard training.
  2. Generate new synthetic data using the base model.
  3. Fine-tune an auxiliary model on the synthetic data.
  4. Combine the score functions from the base and auxiliary models with negative guidance to improve synthetic data quality.

Experimental Results:

  • Self-Improvement: SIMS delivers superior performance compared to standard training approaches. By integrating negative guidance from the auxiliary model, SIMS enhances results beyond those achieved using only the base model for data generation. It also sets new benchmarks in Fréchet Inception Distance (FID) with scores of 1.33 for CIFAR-10 and 0.92 for ImageNet-64.
  • MAD-prophylactic: Experiments demonstrated that SIMS effectively prevents MAD even after 100 iterations of self-consuming training cycles. In this setup, the first generation of models is trained on real data, while subsequent generations are trained on a combination of real and synthetic data from the previous generation. With standard training, model performance typically declines as the number of generations increases. However, with SIMS, the performance of later-generation models remains consistent with that of the first-generation model trained exclusively on real data.
  • Bias Mitigation and Distribution Control: SIMS has the capability to tailor synthetic data distributions to match any desired in-domain target. The process involves modifying the synthetic data distribution used to train the auxiliary model to act as a complement to the target distribution. With the auxiliary model providing negative guidance, the final model naturally shifts towards the desired distribution while simultaneously improving generation quality. For instance, in experiments on FFHQ-64, SIMS successfully adjusted gender representation in generated images while enhancing overall generation quality. This approach helps mitigate biases, promote fairness, and improve output quality. 

Between the lines

Synthetic data has become a key component in model development, offering advantages such as cost-efficiency, limitless availability, and reduced privacy concerns (e.g., in medical applications). Over time, datasets sourced online will increasingly include synthetic content.

SIMS addresses this by using synthetic data as negative guidance, steering models toward true data distributions. This unconventional approach not only improves generative model performance but also prevents MAD. By eliminating the first-mover advantage of early adopters who trained on purely real data, SIMS ensures future models remain competitive, breaking potential monopolies and fostering fairness in the AI industry.

Want quick summaries of the latest research & reporting in AI ethics delivered to your inbox? Subscribe to the AI Ethics Brief. We publish bi-weekly.

Primary Sidebar

🔍 SEARCH

Spotlight

ALL IN Conference 2025: Four Key Takeaways from Montreal

Beyond Dependency: The Hidden Risk of Social Comparison in Chatbot Companionship

AI Policy Corner: Restriction vs. Regulation: Comparing State Approaches to AI Mental Health Legislation

Beyond Consultation: Building Inclusive AI Governance for Canada’s Democratic Future

AI Policy Corner: U.S. Executive Order on Advancing AI Education for American Youth

related posts

  • Embedding Values in Artificial Intelligence (AI) Systems

    Embedding Values in Artificial Intelligence (AI) Systems

  • Atomist or holist? A diagnosis and vision for more productive interdisciplinary AI ethics dialogue

    Atomist or holist? A diagnosis and vision for more productive interdisciplinary AI ethics dialogue

  • Participatory Design to build better contact- and proximity-tracing apps

    Participatory Design to build better contact- and proximity-tracing apps

  • NATO Artificial Intelligence Strategy

    NATO Artificial Intelligence Strategy

  • On Measuring Fairness in Generative Modelling (NeurIPS 2023)

    On Measuring Fairness in Generative Modelling (NeurIPS 2023)

  • Getting from Commitment to Content in AI and Data Ethics: Justice and Explainability

    Getting from Commitment to Content in AI and Data Ethics: Justice and Explainability

  • Mapping the Design Space of Human-AI Interaction in Text Summarization

    Mapping the Design Space of Human-AI Interaction in Text Summarization

  • Rethinking Gaming: The Ethical Work of Optimization in Web Search Engines (Research Summary)

    Rethinking Gaming: The Ethical Work of Optimization in Web Search Engines (Research Summary)

  • Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles

    Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles

  • The Grand Illusion: The Myth of Software Portability and Implications for ML Progress

    The Grand Illusion: The Myth of Software Portability and Implications for ML Progress

Partners

  •  
    U.S. Artificial Intelligence Safety Institute Consortium (AISIC) at NIST

  • Partnership on AI

  • The LF AI & Data Foundation

  • The AI Alliance

Footer


Articles

Columns

AI Literacy

The State of AI Ethics Report


 

About Us


Founded in 2018, the Montreal AI Ethics Institute (MAIEI) is an international non-profit organization equipping citizens concerned about artificial intelligence and its impact on society to take action.

Contact

Donate


  • © 2025 MONTREAL AI ETHICS INSTITUTE.
  • This work is licensed under a Creative Commons Attribution 4.0 International License.
  • Learn more about our open access policy here.
  • Creative Commons License

    Save hours of work and stay on top of Responsible AI research and reporting with our bi-weekly email newsletter.