Diffusion models for ad creative production

Denoising diffusion probabilistic models (DDPM), also known as diffusion models, represent an important area of frontier research in machine learning. Diffusion models are generative and have been used to produce high-quality samples of various media, including images and audio. Examples of popular diffusion models that empower consumer applications are those from the DALL-E family, Midjourney, and the Stable Diffusion family.

While other models, such as Generative Adversarial Networks (GANs) and variational autoencoders (VAEs) [1], can be used for image generation, diffusion models present several advantages. First, diffusion models tend to exhibit better mode coverage, meaning they sample from a broader range of the data distribution, capturing a wider variety of details. GANs, especially, are prone to mode collapse, which is when a model generates only a few modes (dense regions) of the distribution, ignoring all others.

Second, while both GANs and diffusion models are capable of generating high-fidelity, photorealistic images, diffusion models are more stable. This is largely due to the adversarial training procedure used in GANs, in which a generator and a discriminator are trained in tandem. That dynamic can make convergence difficult and is more likely to result in the presence of artifacts or distorted features in generated images.

The use of the word “mode” here, and the reference to distributions, can be confusing. In the context of a diffusion model, or any generative model, the goal of the inference process is to replicate a data sample in some modality (image, audio, text, etc.) as could be sampled from some theoretical distribution. In the case of a diffusion model for producing images, the distribution can be thought of as all possible combinations of pixels, with some underlying probability law defining a distribution for how they are grouped. In practice, the distribution is approximated by the empirical distribution of the training set (ie. the images used for training, in the case of an image model).

In Denoising Diffusion Probabilistic Models [3], one of the seminal papers on the topic by Ho et al., the training process is described with:

Here, the

notation means that x0 is an image being sampled from this distribution, q. What the rest of the text describes is the diffusion model, trained with two processes: the forward process and the reverse process.

In the forward process, an image (x0) is gradually corrupted with Gaussian noise across time steps 1:T until it represents pure noise, as in this set of equations from Ho et al.:

Here, βt is the variance of the Gaussian noise added at each timestep to the previous perturbation of the image, xt-1. The notation here shows that q(x1:T|x0) is the product over t = 1:T for all noised timesteps, with each conditional step q(xt|xt-1) being an increasingly noised version of the previous variation. In this way, the forward process is a Markov chain, where each step is only dependent on the previous step, given the variance βt. Ho et al. find that in the forward process, a closed form exists that allows xt to be sampled for any t during training without iterating through each timestep:

The reverse process trains a neural network to predict the noise applied to any given noisy sample, xt. The underlying analytical insight for the reverse process is based on work by Anderson [4] in a paper published in 1982: he found that the gradient of the log density appears in the reverse-time stochastic differential equation (SDE) as a drift term that guides a sample back to the underlying data distribution at any given time step [5].

Training a DDPM minimizes the variational bound L (maximizes the ELBO), as presented in Equation (5) from Ho et al.:

This notion is captured in the diffusion modeling literature through the recognition that the reverse one-step distribution is modeled as a Gaussian with pθ(xt-1|xt) for small values of βt, using a fixed posterior variance derived from the forward process, with a mean computed from xt and the neural network’s prediction of noise at that time step. Note here that q(xt|xt-1) is Gaussian by design in the forward process. Per Ho et al. in Equation (5) above, training a diffusion model can be done by minimizing the KL divergence between these Gaussian distributions, which is effectively the MSE between the predicted and true noise, with a loss function that simplifies to:

Note that the first term in the loss function doesn’t depend on the model parameters θ, and many model implementations drop it and use unweighted MSE.

Ho et al. outline the training and sampling processes in the above pseudocode (note the dropped term in the loss function). Note that this process describes an unconditional DDPM: the model learns to de-noise data at some timestep, and the sampling process simply draws i.i.d. standard-normal noise for each element of the tensor (more here). There is no condition on the training or sampling processes, though, so sampling from pure noise would generate a random variant of whatever dataset the model was trained on.

Standard text-to-image models like Stable Diffusion are conditional, meaning that they utilize text inputs (prompt embeddings) in the reverse process to steer the data distribution toward some prompt. During training, an image-label pair (x0, y) is sampled from the distribution, with the forward process run on x0 (the image) and the reverse process predicting xt-t​ conditioned on xtand y.

The labels (y) are encoded into embeddings. The pseudocode above is an adaptation that I created of the table from Ho et al. that incorporates conditionality. The condition would also be incorporated into the loss function:

Stable Diffusion itself was trained using the text encoder from OpenAI’s CLIP neural network, which was trained on 400MM image-text pairs sourced from the internet. CLIP encodes both the text and the image components into vectors and was trained using contrastive loss, meaning that the model rewards correct text-image pairs and penalizes incorrect text-image pairs.

The conditional aspect of diffusion models for generative image production has obvious implications for advertising creative production. Creative testing often invokes the concept-variant sensibility that I discuss in Producing and deploying advertising creative at scale. An example of a concept might be a dog riding a skateboard; variants might be that same dog riding the same skateboard in different settings, such as in a city, on the beach, or in space.

The general logic behind introducing multiple variations of creative concepts is that variants may appeal with different degrees of resonance across audience segments. And since advertisers increasingly yield targeting decisions to ad platforms, supplying those platforms with sufficient creative volume to achieve those optimal pairings becomes an imperative.

An obvious and straightforward condition for an image generation diffusion model to serve the advertising use case is image background variation. This was one of the first use cases implemented by Meta for generative creative production, for instance. The process of creative production, testing, and iteration, broadly, can be seen as an exercise in conditional generation: promoting some product through ad creatives that vary in specific ways to find the most resonant messaging per audience segment. Conditional image generation with a DDPM model is a natural fit.

To showcase the application of DDPM to ad creative generation, I have fine-tuned a DDPM model for producing eCommerce advertising creative with the image background as a condition, supplied by the text prompt. You can find and copy the Google Colab notebook here.

The premise of this toy example is this. Imagine an eCommerce clothing retailer wants to portray its products in three specific settings: on the beach, in space, and in a city. The retailer aims to utilize a generative model to produce advertising creative variants using those three backgrounds, ensuring thematic consistency by matching specific aesthetic characteristics. So it fine-tunes an existing diffusion model with image-label pairs, where each product image contains a background that matches those characteristics.

For this example, I fine-tune the Stable Diffusion 1.5 model using the Fashionpedia dataset, which contains (1) images of various pieces of clothing and (2) associated annotations that provide categorical labels as well as mask coordinates for isolating just the clothing items. I only used the validation and test images, which can be found here. Since this is a toy model, I didn’t use the larger training set of images, and since it’s not being used for classification, there’s no need for separate training / validation / test data sets. Rather than using the entire dataset, I targeted 100 instances each for the ‘dress’, ‘pants’, ‘shirt’, and ‘sweater’ items, and created the three background variants (beach, space, and city) for each of them (note that only 30 instances of ‘sweater’ were found that were large enough to be included).

Note: to run the notebook in your own Google Colab instance, you’ll need to download the val_test2020.zip and instances_attributes_val2020.json files from the Fashionpedia GitHub, upload them to your Google Drive, and update the relevant path variables in the third cell.

In the notebook, Stable Diffusion 1.5 is loaded using the Hugging Face diffusers library. Because the model checkpoint is public, no authentication or HF key is needed. Once the image variants are created, Stable Diffusion 1.5 is fine-tuned using low-rank adaptation, or LoRA. LoRA reduces the memory demands of full fine-tuning by learning a low-rank update while keeping base weights frozen, limiting the number of trainable parameters needed in fine-tuning. Pre-trained weight matrices are often full-rank, and updating the gradients of a weight matrix requires updating parameters equal to the number of rows in the matrix times the number of columns. LoRA reduces this by introducing a hyperparameter, r, which is the inner dimension that sets an upper bound on the rank of the update that contains the trainable parameters, with the pre-trained weight matrix frozen (no gradients) in backpropagation:

LoRA can dramatically reduce the number of trainable parameters, presenting significant efficiency benefits in terms of storage and memory usage, as optimizer states for the frozen parameters do not need to be stored. The LoRA paper notes that, with the hardware configuration used, fine-tuning GPT-3 175B with LoRA reduced VRAM consumption by 2/3, from 1.2TB to 350GB. This corresponded with a 25% increase in training speed.

To fine-tune Stable Diffusion 1.5 with the beach / space / city backgrounds dataset, I used the above LoRA configuration, which resulted in just under 800k trainable parameters.

The training loop encodes each image’s caption into a text embedding and encodes each image into a latent, which is a low-dimensional tensor. Noise is applied to the latent, and then the amount of noise applied is predicted by the model, with loss calculated from that. Then the model backpropagates over the weights. This is run for 8000 training steps (35 epochs). Training time on Google Colab, using an A100 GPU, took roughly 30 minutes.

While imperfect, the output clearly showcases the model’s ability to generate specific clothing items with explicit backgrounds (eg., “shirt on the beach”). Given more training data — for both clothing item examples and background image variants — the model could be improved in terms of diversity and fidelity of output. Obviously, this is a toy example, but its application to ad creative production is clear: the model can use conditionality within the prompt to specify the background for a new, generative image, unlocking near-instantaneous variant production (each of the images above was rendered in ~1 second).

Of course, inescapable limitations to a model like this exist:

  1. Brand consistency may be jeopardized by the generated output, even if the model is fine-tuned entirely on brand-safe imagery.
  2. It may be difficult or prohibitively expensive to amass a sufficient volume of training data to deploy a model that produces advertising creative of an acceptable quality standard.
  3. Merely changing superficial and acute aspects of the creative variants (conditions) may not result in advertising performance gains.

These are not trivial challenges or impediments. But as large ad platforms increasingly promote their end-to-end automation suites, creative variation — and the ability to generate large volumes of diverse creative — becomes vital for advertisers. While many of those platforms will offer this capability natively, not all will, and advertisers may see value in developing their own proprietary DDPM models for producing and deploying creative. Hopefully, this piece and the associated notebook provide an informative overview of how that might be done.

References:

[1] Xiao, G., Kreis, K., & Vahdat, A. (2022). Tackling the generative learning trilemma with denoising diffusion GANs. International Conference on Learning Representations (ICLR 2022). arXiv:2112.07804.

[2] Zhong, Y., Mo, S., Xiao, C., Chen, P.-Y., & Zheng, S. (2019). Rethinking generative mode coverage: A pointwise guaranteed approach. arXiv preprint arXiv:1902.04697.

[3] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS 2020). arXiv:2006.11239.

[4] Anderson, B. D. O. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3), 313–326.

[5] Song, J., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations (ICLR 2021). arXiv:2011.13456.

[6] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., Wang, Y., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

[7] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (ICML 2015). arXiv:1503.03585.

[8] Neal, R. M. (1998). Annealed importance sampling. arXiv preprint arXiv:physics/9803008. (Later published in Statistics and Computing, 11, 125–139, 2001.)

[9] Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. International Conference on Machine Learning (ICML 2021). arXiv:2102.09672.

[10] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., et al. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML 2021). arXiv:2103.00020.

Comments: