🔬 Research Summary by Sina Alemohammad, a PhD candidate at Rice University with a focus on the interaction between generative models and synthetic data.
[Original paper by Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, Richard Baraniuk]
Overview: The increasing reliance on synthetic data to train generative models risks creating a feedback loop that degrades model performance and biases outputs. This paper introduces Self-IMproving diffusion models with Synthetic data (SIMS), a novel approach to utilize synthetic data effectively without incurring Model Autophagy Disorder (MAD) or model collapse, setting new performance benchmarks and addressing biases in data distributions.
Introduction
In a world where generative artificial intelligence (AI) is transforming industries, the availability of quality data for training these models is becoming a pressing issue. Many generative models now rely on synthetic data from previous iterations, which can lead to a self-consuming loop that results in Model Autophagy Disorder (MAD) or model collapse. Over time, this process amplifies errors, degrades performance, and increases bias, presenting significant challenges to fairness and accuracy in AI outputs.
This paper tackles two critical questions:
- How can we best exploit synthetic data in generative model training to improve real data modeling and synthesis?
- How can we exploit synthetic data in generative model training in a way that does not lead to MADness in the future?
In this paper, we develop Self-IMproving diffusion models with Synthetic data (SIMS), a new learning framework for diffusion models that address both of the above issues simultaneously. Our key insight is that, to most effectively exploit synthetic data in training a generative model, we need to change how we employ synthetic data. Instead of naĂŻvely training a model on synthetic data as though it were real, SIMS guides the model toward better performance but away from the patterns that arise from synthetic data training.
We focus here on SIMS for diffusion models in the context of image generation because their robust guidance capabilities enable us to efficiently guide them away from their own generated synthetic data. The method involves using a base model’s synthetic data to calculate a synthetic score function, which provides negative guidance during generation, steering the process toward real data distributions.
Key features include:
- Negative Guidance: Synthetic data is used as a reference to counteract its own biases.
- Iterative Refinement: An auxiliary model fine-tuned on synthetic data refines the score function to better align with real data distributions.
Algorithm Overview:
- Train a base diffusion model on real data, i.e., standard training.
- Generate new synthetic data using the base model.
- Fine-tune an auxiliary model on the synthetic data.
- Combine the score functions from the base and auxiliary models with negative guidance to improve synthetic data quality.
Experimental Results:
- Self-Improvement: SIMS delivers superior performance compared to standard training approaches. By integrating negative guidance from the auxiliary model, SIMS enhances results beyond those achieved using only the base model for data generation. It also sets new benchmarks in Fréchet Inception Distance (FID) with scores of 1.33 for CIFAR-10 and 0.92 for ImageNet-64.
- MAD-prophylactic: Experiments demonstrated that SIMS effectively prevents MAD even after 100 iterations of self-consuming training cycles. In this setup, the first generation of models is trained on real data, while subsequent generations are trained on a combination of real and synthetic data from the previous generation. With standard training, model performance typically declines as the number of generations increases. However, with SIMS, the performance of later-generation models remains consistent with that of the first-generation model trained exclusively on real data.
- Bias Mitigation and Distribution Control: SIMS has the capability to tailor synthetic data distributions to match any desired in-domain target. The process involves modifying the synthetic data distribution used to train the auxiliary model to act as a complement to the target distribution. With the auxiliary model providing negative guidance, the final model naturally shifts towards the desired distribution while simultaneously improving generation quality. For instance, in experiments on FFHQ-64, SIMS successfully adjusted gender representation in generated images while enhancing overall generation quality. This approach helps mitigate biases, promote fairness, and improve output quality.Â
Between the lines
Synthetic data has become a key component in model development, offering advantages such as cost-efficiency, limitless availability, and reduced privacy concerns (e.g., in medical applications). Over time, datasets sourced online will increasingly include synthetic content.
SIMS addresses this by using synthetic data as negative guidance, steering models toward true data distributions. This unconventional approach not only improves generative model performance but also prevents MAD. By eliminating the first-mover advantage of early adopters who trained on purely real data, SIMS ensures future models remain competitive, breaking potential monopolies and fostering fairness in the AI industry.