This text is aimed toward those that wish to perceive precisely how Diffusion Models work, with no prior information anticipated. I’ve tried to make use of illustrations wherever doable to supply visible intuitions on every a part of these fashions. I’ve saved mathematical notation and equations to a minimal, and the place they’re obligatory I’ve tried to outline and clarify them as they happen.
Intro
I’ve framed this text round three most important questions:
- What precisely is it that diffusion fashions study?
- How and why do diffusion fashions work?
- When you’ve skilled a mannequin, how do you get helpful stuff out of it?
The examples can be based mostly on the glyffuser, a minimal text-to-image diffusion mannequin that I beforehand implemented and wrote about. The structure of this mannequin is an ordinary text-to-image denoising diffusion mannequin with none bells or whistles. It was skilled to generate footage of latest “Chinese language” glyphs from English definitions. Take a look on the image under — even in case you’re not conversant in Chinese language writing, I hope you’ll agree that the generated glyphs look fairly just like the actual ones!
What precisely is it that diffusion fashions study?
Generative Ai fashions are sometimes mentioned to take a giant pile of knowledge and “study” it. For text-to-image diffusion fashions, the info takes the type of pairs of photos and descriptive textual content. However what precisely is it that we wish the mannequin to study? First, let’s overlook in regards to the textual content for a second and focus on what we are attempting to generate: the pictures.
Chance distributions
Broadly, we are able to say that we wish a generative AI mannequin to study the underlying chance distribution of the info. What does this imply? Take into account the one-dimensional regular (Gaussian) distribution under, generally written 𝒩(μ,σ²) and parameterized with imply μ = 0 and variance σ² = 1. The black curve under exhibits the chance density operate. We will pattern from it: drawing values such that over a lot of samples, the set of values displays the underlying distribution. Lately, we are able to merely write one thing like x = random.gauss(0, 1)
in Python to pattern from the usual regular distribution, though the computational sampling course of itself is non-trivial!

We may consider a set of numbers sampled from the above regular distribution as a easy dataset, like that proven because the orange histogram above. On this specific case, we are able to calculate the parameters of the underlying distribution utilizing most chance estimation, i.e. by understanding the imply and variance. The conventional distribution estimated from the samples is proven by the dotted line above. To take some liberties with terminology, you would possibly take into account this as a easy instance of “studying” an underlying chance distribution. We will additionally say that right here we explicitly learnt the distribution, in distinction with the implicit strategies that diffusion fashions use.
Conceptually, that is all that generative AI is doing — studying a distribution, then sampling from that distribution!
Information representations
What, then, does the underlying chance distribution of a extra advanced dataset seem like, akin to that of the picture dataset we wish to use to coach our diffusion mannequin?
First, we have to know what the illustration of the info is. Typically, a machine studying (ML) mannequin requires knowledge inputs with a constant illustration, i.e. format. For the instance above, it was merely numbers (scalars). For photos, this illustration is often a fixed-length vector.
The picture dataset used for the glyffuser mannequin is ~21,000 footage of Chinese language glyphs. The pictures are all the identical measurement, 128 × 128 = 16384 pixels, and greyscale (single-channel shade). Thus an apparent alternative for the illustration is a vector x of size 16384, the place every factor corresponds to the colour of 1 pixel: x = (x₁,x₂,…,x₁₆₃₈₄). We will name the area of all doable photos for our dataset “pixel area”.

Dataset visualization
We make the idea that our particular person knowledge samples, x, are literally sampled from an underlying chance distribution, q(x), in pixel area, a lot because the samples from our first instance had been sampled from an underlying regular distribution in 1-dimensional area. Observe: the notation x ∼ q(x) is often used to imply: “the random variable x sampled from the chance distribution q(x).”
This distribution is clearly way more advanced than a Gaussian and can’t be simply parameterized — we have to study it with a ML mannequin, which we’ll talk about later. First, let’s attempt to visualize the distribution to realize a greater intution.
As people discover it tough to see in additional than 3 dimensions, we have to scale back the dimensionality of our knowledge. A small digression on why this works: the manifold hypothesis posits that pure datasets lie on decrease dimensional manifolds embedded in a better dimensional area — consider a line embedded in a 2-D airplane, or a airplane embedded in 3-D area. We will use a dimensionality discount approach akin to UMAP to venture our dataset from 16384 to 2 dimensions. The two-D projection retains a whole lot of construction, according to the concept our knowledge lie on a decrease dimensional manifold embedded in pixel area. In our UMAP, we see two giant clusters equivalent to characters through which the elements are organized both horizontally (e.g. 明) or vertically (e.g. 草). An interactive model of the plot under with popups on every datapoint is linked here.

Let’s now use this low-dimensional UMAP dataset as a visible shorthand for our high-dimensional dataset. Keep in mind, we assume that these particular person factors have been sampled from a steady underlying chance distribution q(x). To get a way of what this distribution would possibly seem like, we are able to apply a KDE (kernel density estimation) over the UMAP dataset. (Observe: that is simply an approximation for visualization functions.)

This provides a way of what q(x) ought to seem like: clusters of glyphs correspond to high-probability areas of the distribution. The true q(x) lies in 16384 dimensions — that is the distribution we wish to study with our diffusion mannequin.
We confirmed that for a easy distribution such because the 1-D Gaussian, we may calculate the parameters (imply and variance) from our knowledge. Nevertheless, for advanced distributions akin to photos, we have to name on ML strategies. Furthermore, what we’ll discover is that for diffusion fashions in observe, moderately than parameterizing the distribution immediately, they study it implicitly by the method of studying how one can remodel noise into knowledge over many steps.
Takeaway
The purpose of generative AI akin to diffusion fashions is to study the advanced chance distributions underlying their coaching knowledge after which pattern from these distributions.
How and why do diffusion fashions work?
Diffusion fashions have not too long ago come into the highlight as a very efficient methodology for studying these chance distributions. They generate convincing photos by ranging from pure noise and progressively refining it. To whet your curiosity, take a look on the animation under that exhibits the denoising course of producing 16 samples.

On this part we’ll solely speak in regards to the mechanics of how these fashions work however in case you’re thinking about how they arose from the broader context of generative fashions, take a look on the further reading part under.
What’s “noise”?
Let’s first exactly outline noise, because the time period is thrown round rather a lot within the context of diffusion. Specifically, we’re speaking about Gaussian noise: take into account the samples we talked about within the part about probability distributions. You may consider every pattern as a picture of a single pixel of noise. A picture that’s “pure Gaussian noise”, then, is one through which every pixel worth is sampled from an unbiased commonplace Gaussian distribution, 𝒩(0,1). For a pure noise picture within the area of our glyph dataset, this may be noise drawn from 16384 separate Gaussian distributions. You’ll be able to see this within the earlier animation. One factor to remember is that we are able to select the means of those noise distributions, i.e. heart them, on particular values — the pixel values of a picture, for example.
For comfort, you’ll usually discover the noise distributions for picture datasets written as a single multivariate distribution 𝒩(0,I) the place I is the identification matrix, a covariance matrix with all diagonal entries equal to 1 and zeroes elsewhere. That is merely a compact notation for a set of a number of unbiased Gaussians — i.e. there aren’t any correlations between the noise on totally different pixels. Within the primary implementations of diffusion fashions, solely uncorrelated (a.okay.a. “isotropic”) noise is used. This article comprises a superb interactive introduction on multivariate Gaussians.
Diffusion course of overview
Beneath is an adaptation of the somewhat-famous diagram from Ho et al.’s seminal paper “Denoising Diffusion Probabilistic Fashions” which provides an outline of the entire diffusion course of:

I discovered that there was rather a lot to unpack on this diagram and easily understanding what every element meant was very useful, so let’s undergo it and outline the whole lot step-by-step.
We beforehand used x ∼ q(x) to seek advice from our knowledge. Right here, we’ve added a subscript, xₜ, to indicate timestep t indicating what number of steps of “noising” have taken place. We seek advice from the samples noised a given timestep as x ∼ q(xₜ). x₀ is clear knowledge and xₜ (t = T) ∼ 𝒩(0,1) is pure noise.
We outline a ahead diffusion course of whereby we corrupt samples with noise. This course of is described by the distribution q(xₜ|xₜ₋₁). If we may entry the hypothetical reverse course of q(xₜ₋₁|xₜ), we may generate samples from noise. As we can not entry it immediately as a result of we would wish to know x₀, we use ML to study the parameters, θ, of a mannequin of this course of, 𝑝θ(𝑥ₜ₋₁∣𝑥ₜ). (That must be p subscript θ however medium can not render it.)
Within the following sections we go into element on how the ahead and reverse diffusion processes work.
Ahead diffusion, or “noising”
Used as a verb, “noising” a picture refers to making use of a change that strikes it in the direction of pure noise by cutting down its pixel values towards 0 whereas including proportional Gaussian noise. Mathematically, this transformation is a multivariate Gaussian distribution centered on the pixel values of the previous picture.
Within the ahead diffusion course of, this noising distribution is written as q(xₜ|xₜ₋₁) the place the vertical bar image “|” is learn as “given” or “conditional on”, to point the pixel means are handed ahead from q(xₜ₋₁) At t = T the place T is a big quantity (generally 1000) we purpose to finish up with photos of pure noise (which, considerably confusingly, can be a Gaussian distribution, as mentioned previously).
The marginal distributions q(xₜ) symbolize the distributions which have gathered the consequences of all of the earlier noising steps (marginalization refers to integration over all doable situations, which recovers the unconditioned distribution).
Because the conditional distributions are Gaussian, what about their variances? They’re decided by a variance schedule that maps timesteps to variance values. Initially, an empirically decided schedule of linearly growing values from 0.0001 to 0.02 over 1000 steps was offered in Ho et al. Later analysis by Nichol & Dhariwal advised an improved cosine schedule. They state {that a} schedule is only when the speed of knowledge destruction by noising is comparatively even per step all through the entire noising course of.
Ahead diffusion instinct
As we encounter Gaussian distributions each as pure noise q(xₜ, t = T) and because the noising distribution q(xₜ|xₜ₋₁), I’ll strive to attract the excellence by giving a visible instinct of the distribution for a single noising step, q(x₁∣x₀), for some arbitrary, structured 2-dimensional knowledge:

The distribution q(x₁∣x₀) is Gaussian, centered round every level in x₀, proven in blue. A number of instance factors x₀⁽ⁱ⁾ are picked as an instance this, with q(x₁∣x₀ = x₀⁽ⁱ⁾) proven in orange.
In observe, the primary utilization of those distributions is to generate particular situations of noised samples for coaching (mentioned additional under). We will calculate the parameters of the noising distributions at any timestep t immediately from the variance schedule, because the chain of Gaussians is itself additionally Gaussian. That is very handy, as we don’t must carry out noising sequentially—for any given beginning knowledge x₀⁽ⁱ⁾, we are able to calculate the noised pattern xₜ⁽ⁱ⁾ by sampling from q(xₜ∣x₀ = x₀⁽ⁱ⁾) immediately.
Ahead diffusion visualization
Let’s now return to our glyph dataset (as soon as once more utilizing the UMAP visualization as a visible shorthand). The highest row of the determine under exhibits our dataset sampled from distributions noised to numerous timesteps: xₜ ∼ q(xₜ). As we enhance the variety of noising steps, you possibly can see that the dataset begins to resemble pure Gaussian noise. The underside row visualizes the underlying chance distribution q(xₜ).

Reverse diffusion overview
It follows that if we knew the reverse distributions q(xₜ₋₁∣xₜ), we may repeatedly subtract a small quantity of noise, ranging from a pure noise pattern xₜ at t = T to reach at an information pattern x₀ ∼ q(x₀). In observe, nonetheless, we can not entry these distributions with out understanding x₀ beforehand. Intuitively, it’s straightforward to make a recognized picture a lot noisier, however given a really noisy picture, it’s a lot tougher to guess what the unique picture was.
So what are we to do? Since we now have a considerable amount of knowledge, we are able to practice an ML mannequin to precisely guess the unique picture that any given noisy picture got here from. Particularly, we study the parameters θ of an ML mannequin that approximates the reverse noising distributions, pθ(xₜ₋₁ ∣ xₜ) for t = 0, …, T. In observe, that is embodied in a single noise prediction mannequin skilled over many various samples and timesteps. This enables it to denoise any given enter, as proven within the determine under.

Subsequent, let’s go over how this noise prediction mannequin is carried out and skilled in observe.
How the mannequin is carried out
First, we outline the ML mannequin — usually a deep neural community of some kind — that can act as our noise prediction mannequin. That is what does the heavy lifting! In observe, any ML mannequin that inputs and outputs knowledge of the right measurement can be utilized; the U-net, an structure notably suited to studying photos, is what we use right here and steadily chosen in observe. More moderen fashions additionally use vision transformers.

Then we run the coaching loop depicted within the determine above:
- We take a random picture from our dataset and noise it to a random timestep tt. (In observe, we pace issues up by doing many examples in parallel!)
- We feed the noised picture into the ML mannequin and practice it to foretell the (recognized to us) noise within the picture. We additionally carry out timestep conditioning by feeding the mannequin a timestep embedding, a high-dimensional distinctive illustration of the timestep, in order that the mannequin can distinguish between timesteps. This could be a vector the identical measurement as our picture immediately added to the enter (see here for a dialogue of how that is carried out).
- The mannequin “learns” by minimizing the worth of a loss operate, some measure of the distinction between the expected and precise noise. The imply sq. error (the imply of the squares of the pixel-wise distinction between the expected and precise noise) is utilized in our case.
- Repeat till the mannequin is effectively skilled.
Observe: A neural community is actually a operate with an enormous variety of parameters (on the order of 10⁶ for the glyffuser). Neural community ML fashions are skilled by iteratively updating their parameters utilizing backpropagation to attenuate a given loss operate over many coaching knowledge examples. This is a superb introduction. These parameters successfully retailer the community’s “information”.
A noise prediction mannequin skilled on this method ultimately sees many various mixtures of timesteps and knowledge examples. The glyffuser, for instance, was skilled over 100 epochs (runs by the entire knowledge set), so it noticed round 2 million knowledge samples. By this course of, the mannequin implicity learns the reverse diffusion distributions over all the dataset in any respect totally different timesteps. This enables the mannequin to pattern the underlying distribution q(x₀) by stepwise denoising ranging from pure noise. Put one other method, given a picture noised to any given stage, the mannequin can predict how one can scale back the noise based mostly on its guess of what the unique picture. By doing this repeatedly, updating its guess of the unique picture every time, the mannequin can remodel any noise to a pattern that lies in a high-probability area of the underlying knowledge distribution.
Reverse diffusion in observe

We will now revisit this video of the glyffuser denoising course of. Recall a lot of steps from pattern to noise e.g. T = 1000 is used throughout coaching to make the noise-to-sample trajectory very straightforward for the mannequin to study, as adjustments between steps can be small. Does that imply we have to run 1000 denoising steps each time we wish to generate a pattern?
Fortunately, this isn’t the case. Basically, we are able to run the single-step noise prediction however then rescale it to any given step, though it may not be excellent if the hole is just too giant! This enables us to approximate the total sampling trajectory with fewer steps. The video above makes use of 120 steps, for example (most implementations will enable the person to set the variety of sampling steps).
Recall that predicting the noise at a given step is equal to predicting the unique picture x₀, and that we are able to entry the equation for any noised picture deterministically utilizing solely the variance schedule and x₀. Thus, we are able to calculate xₜ₋ₖ based mostly on any denoising step. The nearer the steps are, the higher the approximation can be.
Too few steps, nonetheless, and the outcomes change into worse because the steps change into too giant for the mannequin to successfully approximate the denoising trajectory. If we solely use 5 sampling steps, for instance, the sampled characters don’t look very convincing in any respect:

There’s then an entire literature on extra superior sampling strategies past what we’ve mentioned to this point, permitting efficient sampling with a lot fewer steps. These usually reframe the sampling as a differential equation to be solved deterministically, giving an eerie high quality to the sampling movies — I’ve included one on the end in case you’re . In production-level fashions, these are normally most well-liked over the straightforward methodology mentioned right here, however the primary precept of deducing the noise-to-sample trajectory is similar. A full dialogue is past the scope of this text however see e.g. this paper and its corresponding implementation within the Hugging Face diffusers
library for extra info.
Various instinct from rating operate
To me, it was nonetheless not 100% clear why coaching the mannequin on noise prediction generalises so effectively. I discovered that an alternate interpretation of diffusion fashions generally known as “score-based modeling” crammed a few of the gaps in instinct (for extra info, seek advice from Yang Track’s definitive article on the subject.)

I attempt to give a visible instinct within the backside row of the determine above: primarily, studying the noise in our diffusion mannequin is equivalent (to a continuing issue) to studying the rating operate, which is the gradient of the log of the chance distribution: ∇ₓ log q(x). As a gradient, the rating operate represents a vector area with vectors pointing in the direction of the areas of highest chance density. Subtracting the noise at every step is then equal to transferring following the instructions on this vector area in the direction of areas of excessive chance density.
So long as there’s some sign, the rating operate successfully guides sampling, however in areas of low chance it tends in the direction of zero as there’s little to no gradient to comply with. Utilizing many steps to cowl totally different noise ranges permits us to keep away from this, as we smear out the gradient area at excessive noise ranges, permitting sampling to converge even when we begin from low chance density areas of the distribution. The determine exhibits that because the noise stage is elevated, extra of the area is roofed by the rating operate vector area.
Abstract
- The purpose of diffusion fashions is study the underlying chance distribution of a dataset after which be capable of pattern from it. This requires ahead and reverse diffusion (noising) processes.
- The ahead noising course of takes samples from our dataset and progressively provides Gaussian noise (pushes them off the info manifold). This ahead course of is computationally environment friendly as a result of any stage of noise might be added in closed type a single step.
- The reverse noising course of is difficult as a result of we have to predict how one can take away the noise at every step with out understanding the unique knowledge level upfront. We practice a ML mannequin to do that by giving it many examples of knowledge noised at totally different timesteps.
- Utilizing very small steps within the ahead noising course of makes it simpler for the mannequin to study to reverse these steps, because the adjustments are small.
- By making use of the reverse noising course of iteratively, the mannequin refines noisy samples step-by-step, ultimately producing a sensible knowledge level (one which lies on the info manifold).
Takeaway
Diffusion fashions are a strong framework for studying advanced knowledge distributions. The distributions are learnt implicitly by modelling a sequential denoising course of. This course of can then be used to generate samples just like these within the coaching distribution.
When you’ve skilled a mannequin, how do you get helpful stuff out of it?
Earlier makes use of of generative AI akin to “This Person Does Not Exist” (ca. 2019) made waves just because it was the primary time most individuals had seen AI-generated photorealistic human faces. A generative adversarial community or “GAN” was utilized in that case, however the precept stays the identical: the mannequin implicitly learnt a underlying knowledge distribution — in that case, human faces — then sampled from it. To date, our glyffuser mannequin does an analogous factor: it samples randomly from the distribution of Chinese language glyphs.
The query then arises: can we do one thing extra helpful than simply pattern randomly? You’ve doubtless already encountered text-to-image fashions akin to Dall-E. They’re able to incorporate further which means from textual content prompts into the diffusion course of — this in generally known as conditioning. Likewise, diffusion fashions for scientific scientific purposes like protein (e.g. Chroma, RFdiffusion, AlphaFold3) or inorganic crystal construction era (e.g. MatterGen) change into way more helpful if might be conditioned to generate samples with fascinating properties akin to a particular symmetry, bulk modulus, or band hole.
Conditional distributions
We will take into account conditioning as a option to information the diffusion sampling course of in the direction of specific areas of our chance distribution. We talked about conditional distributions in the context of forward diffusion. Beneath we present how conditioning might be regarded as reshaping a base distribution.

Take into account the determine above. Consider p(x) as a distribution we wish to pattern from (i.e., the pictures) and p(y) as conditioning info (i.e., the textual content dataset). These are the marginal distributions of a joint distribution p(x, y). Integrating p(x, y) over y recovers p(x), and vice versa.
Sampling from p(x), we’re equally prone to get x₁ or x₂. Nevertheless, we are able to situation on p(y = y₁) to acquire p(x∣y = y₁). You’ll be able to consider this as taking a slice by p(x, y) at a given worth of y. On this conditioned distribution, we’re more likely to pattern at x₁ than x₂.
In observe, with the intention to situation on a textual content dataset, we have to convert the textual content right into a numerical type. We will do that utilizing giant language mannequin (LLM) embeddings that may be injected into the noise prediction mannequin throughout coaching.
Embedding textual content with an LLM
Within the glyffuser, our conditioning info is within the type of English text definitions. Now we have two necessities: 1) ML fashions desire fixed-length vectors as enter. 2) The numerical illustration of our textual content should perceive context — if we now have the phrases “lithium” and “factor” close by, the which means of “factor” must be understood as “chemical factor” moderately than “heating factor”. Each of those necessities might be met by utilizing a pre-trained LLM.
The diagram under exhibits how an LLM converts textual content into fixed-length vectors. The textual content is first tokenized (LLMs break textual content into tokens, small chunks of characters, as their primary unit of interplay). Every token is transformed right into a base embedding, which is a fixed-length vector of the dimensions of the LLM enter. These vectors are then handed by the pre-trained LLM (right here we use the encoder portion of Google’s T5 mannequin), the place they’re imbued with further contextual which means. We find yourself with a array of n vectors of the identical size d, i.e. a (n, d) sized tensor.

Observe: in some fashions, notably Dall-E, further image-text alignment is carried out utilizing contrastive pretraining. Imagen appears to point out that we are able to get away with out doing this.
Coaching the diffusion mannequin with textual content conditioning
The precise methodology that this embedding vector is injected into the mannequin can fluctuate. In Google’s Imagen mannequin, for instance, the embedding tensor is pooled (mixed right into a single vector within the embedding dimension) and added into the info because it passes by the noise prediction mannequin; it’s also included another way utilizing cross-attention (a technique of studying contextual info between sequences of tokens, most famously used within the transformer fashions that type the premise of LLMs like ChatGPT).

Within the glyffuser, we solely use cross-attention to introduce this conditioning info. Whereas a big architectural change is required to introduce this extra info into the mannequin, the loss operate for our noise prediction mannequin stays precisely the identical.
Testing the conditioned diffusion mannequin
Let’s do a easy take a look at of the absolutely skilled conditioned diffusion mannequin. Within the determine under, we attempt to denoise in a single step with the textual content immediate “Gold”. As touched upon in our interactive UMAP, Chinese language characters usually comprise elements generally known as radicals which may convey sound (phonetic radicals) or which means (semantic radicals). A typical semantic radical is derived from the character which means “gold”, “金”, and is utilized in characters which are in some broad sense related to gold or metals.

The determine exhibits that though a single step is inadequate to approximate the denoising trajectory very effectively, we now have moved right into a area of our chance distribution with the “金” radical. This means that the textual content immediate is successfully guiding our sampling in the direction of a area of the glyph chance distribution associated to the which means of the immediate. The animation under exhibits a 120 step denoising sequence for a similar immediate, “Gold”. You’ll be able to see that each generated glyph has both the 釒 or 钅 radical (the identical radical in conventional and simplified Chinese language, respectively).

Takeaway
Conditioning permits us to pattern significant outputs from diffusion fashions.
Additional remarks
I discovered that with the assistance of tutorials and present libraries, it was doable to implement a working diffusion mannequin regardless of not having a full understanding of what was happening beneath the hood. I believe this can be a good option to begin studying and extremely advocate Hugging Face’s tutorial on coaching a easy diffusion mannequin utilizing their diffusers
Python library (which now contains my small bugfix!).
I’ve omitted some subjects which are essential to how production-grade diffusion fashions operate, however are pointless for core understanding. One is the query of how one can generate excessive decision photos. In our instance, we did the whole lot in pixel area, however this turns into very computationally costly for big photos. The overall strategy is to carry out diffusion in a smaller area, then upscale it in a separate step. Strategies embody latent diffusion (utilized in Steady Diffusion) and cascaded super-resolution fashions (utilized in Imagen). One other subject is classifier-free steering, a really elegant methodology for enhancing the conditioning impact to offer a lot better immediate adherence. I present the implementation in my earlier submit on the glyffuser and extremely advocate this article if you wish to study extra.
Additional studying
A non-exhaustive record of supplies I discovered very useful:
Enjoyable extras

Diffusion sampling utilizing the DPMSolverSDEScheduler
developed by Katherine Crowson and carried out in Hugging Face diffusers
—notice the graceful transition from noise to knowledge.
Source link