Ambient Diffusion Omni: Training Good Models with Bad Data

Authors: Giannis Daras*, Adrian Rodriguez-Munoz*, Adam Klivans, Antonio Torralba, Constantinos Daskalakis

Filter not

It’s well-known that the secret sauce in training high-quality generative models is the data. By that, we typically mean that we need to curate a large-scale high-quality diverse training set. But how does one do that?

The typical pipeline is to first form a big pool of candidate samples. These samples can be obtained from the Web or (if you are rich/lucky) from proprietary data sources. Once this large pool of data is formed, there goes the filtering. We typically filter for a lot of things, i.e. duplicates, copyrighted content, NSFW samples, etc. But the type of filtering we are interested here is the quality filtering.

Why do we filter for quality? We do so because we don’t want low-quality images to leak in the generations of the generative model. But this leads to data (and hence information) loss – low-quality images still contain useful information about the world.

Key insight: In this work, we show that there is immense value in the low-quality data that is often discarded. We introduce Ambient-o, a framework that uses synthetic, low-quality, and out-of-distribution data to improve diffusion models.

Sample quality varies

We start by observing that even within the same dataset image quality varies. For example, let’s take a look at images from the all-time-favorite ImageNet dataset:

The images on the left are of higher quality than the images on the right – yet they all belong in the same dataset and are currently used as equal in the training algorithms of ImageNet generative models. To be formal, one would need to define what a “high-quality” image really is and there have been a lot of attempts in the Computer Vision literature to do so.

Here, we refrain from this hard task by showing examples of what is and what isn’t a high-quality image. We would like to have a model that produces samples that have the quality of the images on the left without sacrificing diversity.

Here is what Ambient-o does in comparison to the baseline:

Our model increases the quality of the generations without any changes in the architecture, training/sampling hyperparameters or the optimization algorithm. The benefit comes solely from using cleverly the available data. But how do we do that?

Learning in the high-noise regime

The first key idea of our work is that low-quality images can be used to train for high-noise diffusion times.

This is because distribution distances contract under noise. We can train a classifier to find the merging point for each sample, like below:

or we can select the merging point for different samples heuristically if we want to avoid the classifier training/annotation.

Once we annotate our samples (dataset preprocessing step), we can launch the diffusion training where different images are used for different subsets of the diffusion times, as shown below.

Synthetic data

Synthetic data can be viewed as “low-quality” versions of real samples. Hence, we can use synthetic data for certain diffusion times, borrowing the diversity of the synthetic generations without sacrificing the image quality. This is what we do for our text-to-image results.

Learning in the low-noise regime

How about the low-noise regime of the diffusion training? As we show in the paper, for low noise the diffusion model only looks at crops to perform the denoising.

Out of distribution images

One nice consequence of the fact that the diffusion model only looks at crops for low noise is that we can leverage crops from out-of-distribution images, as long as locally they look like they are coming from the distribution of interest.

For example, we can use crops of cats to make a generative model for dogs! Below we show the “dogness” probability for different patches for a cat image.

As shown, some of the crops from this cat are useful for learning dog features in the low-noise regime of the diffusion.

High-quality crops

Images that might be overall low-quality can still contain useful patches. These patches can be used to train the diffusion model in the low-noise part of the diffusion trajectory. Below, we show examples of low-quality and high-quality crops from ImageNet.

Results

Our model does fairly well in both unconditional/class-conditional and text-conditional settings. For example, these are the obtained FIDs on ImageNet:

ImageNet-512	Train FID ↓				Test FID ↓				Model Size
	FID		FIDv2		FID		FIDv2
	no CFG	w/ CFG	no CFG	w/ CFG	no CFG	w/ CFG	no CFG	w/ CFG	Mparams	NFE
EDM2-XS	3.57	2.91	103.39	79.94	3.77	3.68	115.16	93.86	125	63
Ambient-o-XS 🔥	3.59	2.89	107.26	79.56	3.69	3.58	115.02	92.96	125	63
EDM2-XXL	1.91 (1.93)	1.81	42.84	33.09	2.88	2.73	56.42	46.22	1523	63
Ambient-o-XXL 🔥	1.99	1.87	43.38	33.34	2.81	2.68	56.40	46.02	1523	63
Ambient-o-XXL+crops 🔥	1.91	1.80	42.84	32.63	2.78	2.53	56.39	45.78	1523	63

Our text-to-image model significantly surpasses the quality of the model that treats all the data the same according to GPT-4o and other evals in the paper. This is solely because we use synthetic data as low-quality data.

Concluding Thoughts

Is it possible to get good generative models from bad data? Our framework extracts value from low-quality, synthetic, and out-of-distribution sources. At a time when the ever-growing data demands of GenAI are at odds with the need for quality control, Ambient-o lights a path for both to be achieved simultaneously.

Are we learning something from nothing? Not quite. High-quality images provide sharpness. Low-quality ones bring diversity. We squeeze as much value as possible from our data — instead of filtering it out or treating all samples equally.

Acknowledgements

We used the EDM codebase for our pixel diffusion experiments, the EDM-2 codebase for our ImageNet experiments and the Microdiffusion codebase for our text-to-image results.

We thank the authors for making their work publicly available.

Share on

Twitter Facebook LinkedIn