Stable Diffusion, Intuitively and Exhaustively Explained

January 30, 2024

22nd of August, 2022. Ever since the dawn of computers, we fantasized a human-like interface that would listen to us and create. In seconds!

The dream of visual creators everywhere had been realized. That visual creativity had been unleashed.

Stable Diffusion came out. The 1st Fully-Fledged Text-to-Image AI model.

Naturally, the 1st thing everybody wanted to create was …

a flying cat in space

Image created with Stable Diffusion

or …

a pirate labrador with a walking stick

Image created with Stable Diffusion

But what sparked this revolution in image prompt creation and how can it create such fine-grained images? ↓

Here’s What You’ll Learn:

  1. How the “Overnight” Revolution in Text-to-Image Occured
  2. How Diffusion Models Work Intuitively
  3. Stable Diffusion Architecture
  4. What Lies Ahead for Diffusion Models in 2024

Who should read this?

Who is this blog post useful for? ML Researchers, but also simply Tech Enthusiasts, either Beginners or Experts

How advanced is this post? Anybody previously acquainted with ML terms should be able to follow along.

Generate your own pirate dogs(and other stuff) with Stable Diffusion here: https://huggingface.co/spaces/stabilityai/stable-diffusion

Okay, It wasn’t really overnight …

AI models, like Stable Diffusion, are a result of more than 50 years of research in computer vision — a study of trying to understand and rewrite our perception of vision. Here’s what had to happen for Stable Diffusion to be possible:

  • 2015: alignDRAW from the University of Toronto introduced a pioneering text-to-image model, marking an early exploration into generating images from textual input [2].
  • 2016: Reed, Scott, et al. leveraged Generative Adversarial Networks (GANs), a breakthrough in neural network architecture, to enhance image quality.
  • 2021: OpenAI’s DALL-E brought forth a revolutionary use of the transformer architecture, a novel neural network model. This innovation significantly improved image generation by capturing intricate details and expanding the range of possible visual outputs.
  • 2022: Google Brain’s Imagen entered the scene as a competitor to DALL-E, introducing advancements that further pushed the boundaries of image quality. The rivalry spurred a race for refining image generation capabilities.
  • 2022: Stable Diffusion emerged as a notable improvement in diffusion models within the latent space. It facilitated open-source variations and fine-tuned models, contributing to enhanced image quality. The diffusion model allowed for smoother and more coherent transitions in generating images, addressing issues of clarity and realism.
  • 2023: The progression continued with new models and applications, extending beyond text-to-image to text-to-video or text-to-3D, showcasing the evolution and versatility of image generation technology.

Image created by Author

Diffusion Models, Intuitively Explained

In the context of machine learning, diffusion is often used to model the spread of information or influence through a network or graph.The goal of diffusion models is to learn the latent structure of a dataset by modeling t he way in which data points diffuse through the latent space. In computer vision, this means that a neural network is trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.

Image by Author

Up to this point, we have been using GANs to create images. A well-trained game between the discriminator and the generator that given an image would create a noisy vector image.

Now, imagine taking a noisy picture and running it through a neural network that predicts and subtracts the noise. This process, known as a diffusion model, aims to reveal the original image by iteratively adding less noise to the subtracted estimate. It’s like peeling back layers until the true picture emerges from the chaos. The neural network refines the image with each iteration, converging towards its pristine form. In essence, diffusion models utilize mathematical magic to restore clarity and unveil the hidden beauty in noisy images.

Whereas GANs attempt to add noise to the trained images, latent diffusion models attempt to distract noise.

Image by SabrePC

How Stable Diffusion Works

Now, let’s talk about stable diffusion. The goal of stable diffusion is to learn a stable and smooth representation of a graph or dataset, by iteratively smoothing out the features of each node (or data point) based on the features of its neighbors.

Source: Stable Diffusion Paper

Intuitively, we can think of stable diffusion as a process of “blurring” the features of each node with the features of its neighbours, similar to how an image is blurred to create a smoother or less detailed version of the original. However, unlike traditional diffusion methods, stable diffusion is designed to be stable and robust to noise and perturbations in the input data.Peek into the Future. 2024 and Beyond!

1. Latent Space Embedding

We can use a pretrained embedding model to learn to compress the image data into lower-dimensional representations.

  • To generate an image from a prompt, we first encode the prompt into a vector representation using a pre-trained encoder network. This vector serves as the initial input to the stable diffusion process.
  • We then iteratively apply the stable diffusion process to the input vector, using a diffusion step size or temperature parameter that controls the level of smoothing or blurring applied to the input.
Source: Stable Diffusion Paper

2. Diffusion in Latent Space

  • Forward Diffusion = diffusing the latent space image with noise.
  • Reverse Diffusion = deferring noise from images
Source: Stable Diffusion Paper
  1. At each iteration, we generate a new image by decoding the current state of the input vector using a pre-trained decoder network.
  2. The resulting image is then compared to the target image described by the prompt using a similarity metric such as the L1 or L2 distance.
  3. The difference between the generated image and the target image is used to update the input vector for the next iteration of the diffusion process, such that the input vector is gradually refined to become more similar to the target image.

3. Input Text Conditioning

The true power of the Stable Diffusion model is that it can generate images from text prompts. This is done by modifying the inner diffusion model to accept conditioning inputs.

Source: Stable Diffusion Paper

By repeating this process for multiple iterations, we can generate a sequence of images that gradually become more and more similar to the target image described by the prompt. This process can be further optimized using techniques such as annealed Langevin dynamics, which helps to improve the stability and convergence of the diffusion process.

It’s 2024! What’s Next for Diffusion Models?

What is the future of text-to-image?

  1. ControlNets: conditioning diffusion model outputs with an additional image.
  2. Spanning Across Business: You can expect diffusion models to have a variety of product applications across verticals in content, design, and e-commerce.
  3. Sharper Pics: Larger models, larger datasets, and higher accuracy.
  4. Real-Time Boost: Watch out for diffusion models doing their thing in real-time. It’s not just about pics; we’re talking quick fixes for all kinds of visuals, from selfies to medical scans.
  5. Mixing It Up: Imagine diffusion models teaming up with audio or text for mind-blowing content. Get ready for a wild ride in multimedia and virtual worlds.

Hey, thanks for reading!

Thanks for getting to the end of this article. My name is Tim, I love to elaborate ML research papers or ML applications with emphasis on business use cases! Get in touch!