What are Transformer models?
The first diffusion model, introduced in 2015, was inspired by non-equilibrium statistical physics. It proposed a novel method for generating novel data by iteratively destroying data structures through the process of diffusion and teaching a model to undo the destruction and reconstruct the data.
The first diffusion model, introduced in 2015, was inspired by non-equilibrium statistical physics. It proposed a novel method for generating novel data by iteratively destroying data structures through the process of diffusion and teaching a model to undo the destruction and reconstruct the data.
Diffusion models are inspired by the process of diffusion, which describes the spreading of particles from a dense space to a less dense space. Think about dropping a drop of blue food coloring into a beaker of water. Before the food coloring reaches the water, its blue-colored molecules are packed into a highly concentrated little droplet. Once the droplet hits the water, the blue-colored molecules gradually spread out, turning the water blue.
The same thing can happen with images. Just as the food coloring disperses into the water, an image can be transformed into a noisy version where its pixels are spread out and scrambled up, making the original image unrecognizable. This is part of how diffusion models are pre-trained. Images in the training dataset undergo the process of diffusion until the pixels disperse into large arrays of random static that no longer resemble images. After diffusion occurs, the model essentially teaches itself to undo the diffusion and restore the images to their original state.
Diffusion models have become a popular choice for use as image generators compared to more traditional methods including, over GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders) for several reasons, as we'll discuss later. But most importantly, diffusion models generate high-quality, photorealistic images that are becoming increasingly difficult for humans to distinguish from non-generated images. Understanding the model architecture will help demonstrate why diffusion models are such a good choice to image generation.
Pre-training diffusion models
Training diffusion models for image generation requires a large dataset of existing images. For instance, if the goal is to train a model that generates images of dogs, you would need thousands of images of different breeds of dogs for the model to learn from.
The pre-training process begins by taking a real image, such as a photo of a corgi dog, and feeding it to an image encoder, which transforms the high-dimensional input image into a lower-dimensional representation capturing the essential features of the image while discarding the noise. Once an image passes through the encoder, it's essentially not an image anymore, but rather a vector of numbers that represents all the features of the image. Now the image is embedded in the latent space, a high-dimensional space where each point corresponds to some feature of the data. In the case of images, points in the latent space could represent different breeds, colors, or other characteristics of dogs.
Next, the image undergoes diffusion, a process in which noise is iteratively added to the image with the intention of distorting it. In this context, "noise" refers to random variations or distortions added to the image, gradually obscuring the original details until the image is completely unrecognizable. The image undergoes a transformation until it becomes pure noise. This process, however, is not the end goal. Instead, it serves as a setup for the model to learn the reverse process.
Once the image is completely distorted, it undergoes the process of denoising, in which the model learns to map the noisy images to points in this latent space, capturing the essential features of the images and iteratively reversing the distortions added to the image in the initial diffusion process. The model architecture often used for this task is a UNet. For more information on the components of UNet and a more detailed explanation of diffusion, check out this article by Jay Alammar.
After the vector is denoised, it's sent to the final image decoder, which converts the vector from a numerical representation to a real image, which serves as the final output.
The model is trained using a method that iteratively adjusts its parameters to better predict the original data from the noisy images. This way, the model learns to create new images by reversing the noise addition, turning a point in the latent space into a realistic image.
Running inference on diffusion models
After the diffusion model has been successfully pre-trained or fine-tuned, it's ready to be used to generate novel images. Just as running inference on a transformer, running inference on a diffusion model requires a text prompt, which in this case is an in-depth description of the image you'd like the model to generate.
First, the prompt is run through a text encoder, which transforms the textual description into a numerical representation, often a vector in a high-dimensional space. This representation, which captures the semantic meaning of the prompt, serves as the starting point in the latent space for the image generation process.
The encoded vector then does through the process of diffusion, in which a noisy version of the image is generated based on representations learned during the training phase. The model then begins the denoising process, which it learned during training. It progressively removes the noise from the vector, step by step. The resulting denoised vector is then fed to the image decoder, which converts it from a collection of numbers to an image, which serves as the final output from the model.
Initially, diffusion models were mainly used for generating images from prompts. However, in 2023, the practical applications of diffusion models expanded significantly with the launch of Adobe's Firefly, a suite of generative AI tools integrated into Adobe's existing products. These tools allow users to not only create new images but edit and transform parts of or entire existing images exclusively through prompting.
Advantages of diffusion models
Diffusion models are able to produce high-quality images while also requiring fewer computational resources, compared to other more traditional image generation methods, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). These models experienced a transformative year in 2022, with the release of MidJourney, OpenAI's DALL·E 2, and Google's Imagen. Nevertheless, one model reigns supreme as the queen of diffusion models: Stability AI's Stable Diffusion.
At the time of its release in 2022, the Stable Diffusion model represented a significant leap forward for diffusion models on many fronts. Benefiting from a robust training dataset and a more advanced and efficient model architecture, Stable Diffusion is the current SOTA leader in diffusion models. If you're interested in learning more about the revolutionary nature of Stable Diffusion, I recommend checking out this article on Towards Data Science.
Let's take a look at the advantages of using Stable Diffusion, in particular:
High-quality output - Output images tend to more accurately reflect their corresponding text prompts and include nuanced details that other models don't capture. Computational efficiency - Diffusion models cost less to pre-train (Stable Diffusion only cost ~$600,000) and inference than other competing models due to tweaks in model architecture, like the introduction of FlashAttention to Stable Diffusion. Better scaling - Better handling of spatial data means leads to high image quality during both compression and enlargement of images. Diffusion models will likely become even more efficient and less expensive to pre-train in the future. In fact, in early 2023, MosaicML calculated that it would cost less than $160,000 to train Stable Diffusion from scratch using its own platform. However, it's critical to be aware of the pitfalls and risks of using diffusion models for image generation.
Limitations of diffusion models
Difficulty handling high-dimensional data - Slight loss of image quality may occur on tasks that require high precision, like dealing with superresolution images. Quality control - The probabilistic nature of diffusion models means that they produce varying results for each run, even with identical inputs, which creates challenges in maintaining consistent quality or achieving specific outcomes. Deep fakes - Highly realistic outputs exacerbate the risk of exploitation by bad actors who use the models to specifically manipulate images in order to deceive or cause harm to others. The technology is new - Because the use of diffusion models for image generation is new, scientists still have a limited understanding of the technology compared to other methods. Model bias - Generative models always exacerbate the biases of their training data. Bloomberg journalists already found that Stable Diffusion amplifies toxic stereotypes about race and gender. Overall, diffusion models have come a long way in the relatively short time they have been used for image generation. In 2022, diffusion models reached a turning point, demonstrating considerable improvements in image generation quality and computational efficiency, with Stable Diffusion reigning supreme. Stable Diffusion's improved model architecture, cost efficiency, and high-quality outputs make it the ideal choice for most image generation tasks. As with transformers, and all other generative machine learning models, diffusion models can produce harmful and misleading content and intensify the existing biases of its training data.