🔬 Research Summary by Josue Casco-Rodriguez and Sina Alemohammad.
Josue is a 2nd-year PhD student at Rice University. He is interested in illuminating the intersection of machine learning and neuroscience from first principles.
Sina is a 5th-year PhD student at Rice University. He is interested in deep learning theory.
[Original paper by Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G. Baraniuk]
Overview: Seismic advances in generative AI algorithms for imagery, text, and other data types have led to the temptation to use AI-synthesized data to train next-generation models; repeating this process creates a self-consuming loop. Across various models and datasets, this paper finds that without enough fresh (previously unseen) real data at each loop iteration, the quality (realism) or diversity (variety) of synthetic data inevitably decreases – even if the original training data is available at every iteration. Since the amount of synthetic data on the Internet is exploding, this raises serious concerns about the quality and diversity of future generative models and, subsequently, the entire Web.
Introduction
Due to rapid advances in generative models, synthetic data of many modalities is rapidly proliferating, especially on the Internet. Since the training datasets for generative models tend to be sourced from the Internet, today’s generative models are unwittingly being trained on increasing amounts of AI-synthesized data. Moreover, throwing caution to the wind, AI-synthesized data is increasingly used by choice in a wide range of applications for several reasons: (a) synthetic data is far easier to collect than real data, (b) synthetic data can complement (augment) existing real datasets to boost AI model performance, (c) synthetic data can protect privacy in sensitive applications like medical imaging, and (d) AI models are rapidly outgrowing the amount of available real data on the Internet. Using synthetic data to train generative models departs from standard AI training practice in one important aspect: repeating this process for generation after generation of models forms a self-consuming loop. No matter the training set makeup or sampling method, the potential ramifications of self-consuming loops are poorly understood. This paper carefully studies self-consuming loops from the perspective of generative image models.
Key Insights
We studied three types of realistic self-consuming loops. They each include synthetic data and potentially real data in a feedback loop: “fully synthetic loops,” “synthetic augmentation loops” (where fixed real data is present), and “fresh data loops” (where new, unseen real data is present). We also considered how generative models synthesize data during these loops. In particular, generative models have mechanisms that boost synthetic quality (realism) at the cost of synthetic diversity (variety); we call these mechanisms “sampling bias” or “cherry-picking.” Across models and datasets, we found that self-consuming loops suffer from Model Autophagy Disorder (MAD), wherein synthetic quality or diversity progressively decreases.
Training exclusively on synthetic data leads to MADness
Fully synthetic loops occur when generative models train solely on synthetic data from previous models. For example, when generative models are trained on their own cherry-picked outputs. In these scenarios, we found that the quality or diversity of synthetic data decreases‒‒i.e., training solely on synthetic data produces MADness. Incorporating sampling biases (cherry-picking) can prevent or slow synthetic quality from decreasing, but only at the cost of accelerating losses in synthetic diversity.
For example, we trained two identical facial image generators in fully synthetic loops: one with and one without sampling biases. Without sampling biases, wave-like artifacts embedded themselves into the synthetic images, decreasing synthetic faces’ realism (quality). Meanwhile, with sampling biases in place, these artifacts were negligible, but the synthetic data instead became less and less diverse, eventually converging to just a few nearly identical faces.
Fixed real training data can delay but not prevent MADness
However, in practice, people usually have access to real training datasets. If they were to incorporate existing real data into a fully synthetic loop, they would produce a synthetic augmentation loop. These loops occur when generative models train on synthetic data (from previous models) and previously seen real data. In general, increasing the amount of training data improves AI model performance. However, the presence of synthetic data introduces uncertainty since synthetic data can deviate from reality. We found that incorporating fixed real training data only delays (i.e., cannot prevent) the inevitable degradation of synthetic quality or diversity. In other words, previously seen real data cannot prevent MADness in self-consuming generative models.
Fresh real data can prevent MADness
But what if the real data in a self-consuming loop is new? What if each generative model has access to real data that was never seen by any generative model before? We call these situations fresh loops; they occur when each generative model trains on synthetic data from previous models and a new set of previously unseen real data. On the broader scale of the Internet, fresh loops are already happening: new data is constantly being uploaded to the Internet, except now a portion of it is AI-generated. To confirm this, one need only examine the popular LAION-5B dataset (the training data for Stable Diffusion). Amidst the real images, there are already synthetic images from previous generative models like StyleGAN and pix2pix.
In the fresh loop, we found that with enough fresh (previously unseen) real data, the quality and diversity of synthetic data depend on the ratio of fresh to synthetic data. If the ratio is large enough, we won’t see a decrease in quality and diversity. If the ratio is too small, MADness will ensue. Take, for example, the Internet: what our findings suggest is that if synthetic data outgrows real data, then future generative models will suffer from MADness. Exactly what ratio of fresh to synthetic data is needed to avoid MADness depends on various factors (such as the complexity of the real and synthetic data, the type of generative model used, and the presence of sampling bias).
Between the lines
Generative models are becoming increasingly capable of generating what users can imagine, producing an explosion of synthetic content on the Internet. As people trust and use generative models more and more, it is important to acknowledge how our current interactions with generative models may negatively impact the future. Our results show that if synthetic data outgrows real data on the Internet, future generative models will be stuck in a self-consuming loop and thus doomed to MADness. Future synthetic data could have notable artifacts (or, in the case of text, wrong statements) that become increasingly pronounced over time, or they could lack creativity and diversity. In a worst-case scenario, synthetic data could diverge from reality altogether.
AI industry leaders have recently pledged at the White House to implement measures like watermarking to make synthetic data identifiable from real data. These efforts aim to mitigate the negative effects of synthetic data on the Internet. Regarding Model Autophagy Disorder (MAD), watermarking could be used to prevent generative models from training on AI-generated data. However, whether such approaches are a viable solution to MADness remains to be seen and must be determined by future work.