Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
Diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic designpurposes.
In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated.
Applying our frameworks to diffusion models trained on multiple datasets including oxfordflowers, celeb-a, imagenet, and laion, we discuss how factors such as trainingset size impact rates of content replication.
We also identify cases where diffusion models, including the popular stable diffusion model, blatantly copy from their training data.
Authors
Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein
Diffusion models have the potential to forge new generative tools with the potential to be used for commercial art and graphic design, but also bring with them a number of legal and ethical risks.
There is a risk that diffusion models might, without notice, reproduce data from the training set directly, or present a collage of multiple training images.
In principle, replicating partial or complete information from the training data has implications for the ethical and legal use of diffusion models in terms of attributions to artists and photographers.
We begin with a study of how to detect content replication, and we consider a range of image similarity metrics developed in the self-supervised learning and image retrieval communities.
We benchmark the performance of different image feature extractors using real and purpose-built synthetic datasets and show that state-of-the-art instance retrieval models work well for this task.
Armed with new and existing tools, we search for data replication behavior in a range of diffusion models with different dataset properties.
For small and medium dataset sizes, replication happens frequently, while for a model trained on the large and diverse imagenet dataset, replication seems undetectable.
This latter finding may lead one to believe that replication is not a problem for large-scale models.
However, the even larger model exhibits clear replication in various forms.
Result
Diffusion models are capable of reproducing high-fidelity content from their training data, and we find that they are.
While typical images from large-scale models do not appear to contain copied content that was detectable using our feature extractors, copies do appear to occur often enough that their presence can not be safely ignored ; images with dataset similarity @xmath0 as depicted in, account for approximate @xmath1 of our random generations.