22nd of August, 2022. Ever since the dawn of computers, we fantasized a human-like interface that would listen to us and create. In seconds!
The dream of visual creators everywhere had been realized. That visual creativity had been unleashed.
Stable Diffusion came out. The 1st Fully-Fledged Text-to-Image AI model.
Naturally, the 1st thing everybody wanted to create was …
a flying cat in space
or …
a pirate labrador with a walking stick
But what sparked this revolution in image prompt creation and how can it create such fine-grained images? ↓
Who is this blog post useful for? ML Researchers, but also simply Tech Enthusiasts, either Beginners or Experts
How advanced is this post? Anybody previously acquainted with ML terms should be able to follow along.
Generate your own pirate dogs(and other stuff) with Stable Diffusion here: https://huggingface.co/spaces/stabilityai/stable-diffusion
AI models, like Stable Diffusion, are a result of more than 50 years of research in computer vision — a study of trying to understand and rewrite our perception of vision. Here’s what had to happen for Stable Diffusion to be possible:
In the context of machine learning, diffusion is often used to model the spread of information or influence through a network or graph.The goal of diffusion models is to learn the latent structure of a dataset by modeling t he way in which data points diffuse through the latent space. In computer vision, this means that a neural network is trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.
Up to this point, we have been using GANs to create images. A well-trained game between the discriminator and the generator that given an image would create a noisy vector image.
Now, imagine taking a noisy picture and running it through a neural network that predicts and subtracts the noise. This process, known as a diffusion model, aims to reveal the original image by iteratively adding less noise to the subtracted estimate. It’s like peeling back layers until the true picture emerges from the chaos. The neural network refines the image with each iteration, converging towards its pristine form. In essence, diffusion models utilize mathematical magic to restore clarity and unveil the hidden beauty in noisy images.
Whereas GANs attempt to add noise to the trained images, latent diffusion models attempt to distract noise.
Now, let’s talk about stable diffusion. The goal of stable diffusion is to learn a stable and smooth representation of a graph or dataset, by iteratively smoothing out the features of each node (or data point) based on the features of its neighbors.
Intuitively, we can think of stable diffusion as a process of “blurring” the features of each node with the features of its neighbours, similar to how an image is blurred to create a smoother or less detailed version of the original. However, unlike traditional diffusion methods, stable diffusion is designed to be stable and robust to noise and perturbations in the input data.Peek into the Future. 2024 and Beyond!
We can use a pretrained embedding model to learn to compress the image data into lower-dimensional representations.
The true power of the Stable Diffusion model is that it can generate images from text prompts. This is done by modifying the inner diffusion model to accept conditioning inputs.
By repeating this process for multiple iterations, we can generate a sequence of images that gradually become more and more similar to the target image described by the prompt. This process can be further optimized using techniques such as annealed Langevin dynamics, which helps to improve the stability and convergence of the diffusion process.
What is the future of text-to-image?
Thanks for getting to the end of this article. My name is Tim, I love to elaborate ML research papers or ML applications with emphasis on business use cases! Get in touch!