Faster customization of text-to-image models with DVAR

Diffusion models are currently the most powerful approach to generating images from a text description. One popular application is the adaptation of such models to new concepts: usually, it is quite difficult or even impossible to write a prompt that accurately describes the target object.
At the same time, existing methods are too slow for practical purposes. In “Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics”, our paper accepted at NeurIPS 2023, we propose DVAR  — a lightweight early stopping criterion for gradient-based adaptation methods. Our experiments with Stable Diffusion show up to 8 times acceleration with minor losses in the quality of identity preservation and less overfitting. We release the source code of DVAR so that anyone can speed up their adaptation pipeline.

Task description The link has been copied to clipboard

Overview of the task using DreamBooth pipeline as an example. In this case, "[V]" is the pseudo-token which can now be used in a prompt
A typical task of adapting text-to-image models is to teach the model to generate images of a new concept, which is presented in the form of 3–4 pictures (see figure above from the DreamBooth paper). For example, you want to draw your dog on the beach, but the model hasn’t seen it. At the same time, it is extremely difficult to describe exactly how it looks with a textual prompt. It's much easier to take a few photos and show them to the model.
There are many methods to solve this problem. Still, most of them are computationally expensive (taking up to two hours on a single GPU) due to a large number of optimization steps. We consider the three most common approaches: Textual Inversion [1], Custom Diffusion [2], and DreamBooth [3]. Each method creates a new pseudo-token in the text encoder vocabulary and uses gradient descent to find its vector representation.
Returning to the example of the dog, we can create a new word, "[V]," and write a prompt: "[V] dog on the beach" and pass it to the model.

Diffusion models The link has been copied to clipboard

To explain our method, we need to recall how diffusion models work. Diffusion models generate samples by iterative denoising. The forward process $x_0 \rightarrow x_T$ is defined as a stepwise addition of noise $\epsilon_t$ to an image $x_t$. Our model $\epsilon_{\theta}$ will simulate the reverse process $x_T \rightarrow x_0$ predicting the difference between steps, following the objective:
$|| \epsilon_t - \epsilon_{\theta}(x_t, c, t)||^2_2$, 
where $c$ represents a condition (caption embedding in case of text-to-image generation). These processes are visualized in the picture below.
Visualization of the forward process in diffusion models. $q(x_t|x_{t-1})$ and $p_{\theta}(x_{t-1}|x_t)$ denote iterative transitions in the forward and reverse processes, respectively. The image is taken from [4].
The sampling process occurs with a fixed timestep $t$ and starts from $x_t$ equal to the  normal noise. 
In this work, we study Stable Diffusion, which is a latent diffusion model [5]. This model acts in the lower-dimensional latent space of an autoencoder to reduce the required computation. More specifically, it uses a variational autoencoder (VAE) [6], which is also not deterministic.

Loss correction The link has been copied to clipboard

An overview of the training process for Textual Inversion with an example concept.
Intuitively, early stopping would seem as the easiest way to speed up learning, but we need an indicator that the quality of the model is good enough. Typically, such an indicator is the loss function value or the norm of gradients. In the context of our task, they turned out to be not indicative: the behavior is shown in the picture above. On the other hand, pairwise image CLIP [7] similarity with the train set grows sharply only in the early stages, indicating that a predefined number of training steps might be excessive for these adaptation methods. However, the use of this metric for early stopping is excessively costly due to the need for intermediate image sampling and CLIP evaluation.
After further investigation, we observed that the training loss contains a high amount of noise, which was not addressed in previous works. The reason of this is that the training and generating processes include multiple factors of randomness, namely:
  1. Sampling of input images
  2. Sampling of corresponding captions
  3. Stochastic autoencoder of image representations (VAE)
  4. Selection of the diffusion timestep
  5. Sampling of starting Gaussian noise
We have identified how each factor affects the training loss by training the model in the original setup and additionally evaluating semi-deterministic losses with resampling only a single factor. We found that captions and VAE encoder noise do not affect the loss value; on the other hand, sampled diffusion timesteps, noise, and images make the loss uninformative.
Loss behavior in the semi-deterministic setup: column names correspond to inputs that are resampled for each evaluation batch.

DVAR The link has been copied to clipboard

Our method consists of two parts: a deterministic loss function $\mathcal{L}_{det}$ and an early stopping criterion. To obtain $\mathcal{L}_{det}$, it is enough to fix a single batch before starting training and conduct an additional forward pass of the model on this batch every few steps. Since we do not perform the backward pass, this procedure does not significantly slow down the learning process or affect it in any way. As you can see in the figure below, this loss becomes indicative.
Comparison of original and deterministic loss for a specific concept.
Based on this loss, we developed Deterministic VARiance Evaluation (DVAR), a simple variance-based early stopping criterion. It maintains a rolling variance estimate of $\mathcal{L}_{det}$ over the last N steps. Once this rolling variance becomes less than $\alpha$ of the global variance estimate, we stop training. This criterion is easy to implement and has just two hyperparameters that are easy to tune. A NumPy/Pytorch implementation is shown below; for a full example of using DVAR, see our paper or the GitHub repository.
def DVAR(losses, window_size, threshold):
	running_var = losses[-window_size:].var()
	total_var = losses.var()
	ratio = running_var / total_var
	return ratio < threshold

Results The link has been copied to clipboard

To evaluate the quality of customized models, we identified three characteristics: identity  preservation, reconstruction ability, and unseen prompt alignment. To assess the first two, we compute CLIP image-to-image similarity between train/validation generated images and reference images. We named these metrics Train CLIP img and Val CLIP img respectively. To evaluate the latest quality, we use CLIP image-to-text similarity between validation images and prompts; we called this metric Val CLIP txt.
Our experiments were conducted on datasets provided by the authors of all three customization methods (48 concepts in total). Aggregated results for each approach are shown in the table below. We compare our approach with several baselines: the original training setup (Baseline), CLIP-based early stopping (CLIP-s), and a baseline with a fixed number of steps equal to average CLIP-s iterations (Few Iters). We observed that DVAR is more efficient than Baseline and CLIP-s in terms of overall runtime. An additional advantage of DVAR over Few Iters is its adaptability because not all concepts are equally easy to learn.

Textual Inversion The link has been copied to clipboard

MethodTrain CLIP imgVal CLIP imgVal CLIP txtIterationTime (min)
Few Iters0.7960.7440.2324751.6

DreamBooth The link has been copied to clipboard

MethodTrain CLIP imgVal CLIP imgVal CLIP txtIterationTime (min)
Few Iters0.8550.8060.2193671.9

Custom Diffusion The link has been copied to clipboard

MethodTrain CLIP imgVal CLIP imgVal CLIP txtIterationTime (min)
Few Iters0.8550.8060.2193671.9
In addition to the automatic evaluation, we also conducted a human evaluation. We ran two surveys on a crowdsourcing platform, Toloka. The first survey compared the ability to recreate training images of a model trained with DVAR to the baseline. Annotators were asked to choose the image that was most similar to the reference object. The second survey compared samples created with new prompts to see how well the model could be customized. Participants had to decide which image matched the prompt better. Our findings are shown in the table below; as you can see, DVAR enables early stopping without compromising reconstruction quality for two out of three customization methods. Although applying DVAR to Textual Inversion slightly decreases reconstruction quality, this is likely due to overfitting of the original method. Other customization methods that use fewer iterations can avoid this problem.
Textual Inversion41.679.9
Custom Diffusion69.993.8
Another important advantage of DVAR is reduced overfitting. In the adaptation task, overfitting leads to a worse alignment with an unseen validation prompt. In other words, the model tends to generate outputs that resemble the training images, regardless of the input prompt, as shown in the figure below.
Side-by-side comparison of DVAR reconstruction quality and validation prompt alignment.

Conclusion The link has been copied to clipboard

The field of rapid diffusion models adaptation is an emerging area of research. In our work we present DVAR — a light-weight and easy to implement early stopping criterion. Our approach simplifies the practical use of personalized models by reducing the computational cost of training them. We also believe that our analysis of the diffusion objective behavior will lead to a deeper understanding of the entire training process. Since the proposed deterministic loss better corresponds to model performance than the original, it could be used not only for the adaptation task, but also for training diffusion models.