Denoising Diffusion Probabilistic Models (DDPM)
Generate images by reversing a noising process. The forward process gradually destroys
a clean image with Gaussian noise over T steps; a neural network is
trained to undo one step at a time. Sample by starting from pure noise and denoising.
x_0
into pure noise x_T ~ N(0, I). Train a neural net (almost always a
U-Net) to predict the noise eps that was added at
a randomly chosen step t. At inference, start from x_T ~ N(0, I)
and iteratively subtract predicted noise from t = T down to t = 1.
Stable Diffusion runs the same algorithm in a compressed VAE latent space and conditions
the U-Net on text via cross-attention. State of the art for image, audio, and protein
generation; USAAIO uses DDPMs for image generation under tight compute.
1. The intuition
A GAN learns to generate in one shot — push z ~ N(0, I) through a network
and out comes an image. That is hard: the network has to leap from pure noise to a
sharp natural image in a single function evaluation. GANs work, but training is
notoriously unstable.
Diffusion breaks the leap into many small denoising steps. Each step asks only: "given this slightly-noisy image, what was the noise I added?" That's a tiny, well-posed regression problem with clean MSE targets. Stack hundreds of such steps and the cumulative effect is a clean image. The training objective collapses to a single MSE loss; there is no adversary, no balance to tune, and the model rarely fails to train.
Two costs: (1) slow sampling — 50 to 1000 forward passes per image vs 1 for a GAN; (2) high memory for high-resolution images, which Stable Diffusion fixes by running diffusion in a compressed VAE latent space.
2. The math
Forward (noising) process
Fixed Markov chain with variance schedule {beta_1, ..., beta_T} in
(0, 1):
Let alpha_t = 1 - beta_t and alpha_bar_t = prod_{s=1..t} alpha_s.
A key derivation: composing Gaussians gives a closed form for any step t
starting from x_0, so we never simulate the chain step by step at training
time:
As t -> T, alpha_bar_t -> 0 and x_T -> N(0, I).
The schedule is fixed (linear beta_t from 1e-4 to
0.02 in the original paper; cosine schedules are now standard).
Reverse (denoising) process
Parameterise the reverse step as a Gaussian whose mean is predicted by a network:
With the right derivation (Bayes on the forward process), the optimal mean has the form:
where eps_theta is the network's prediction of the noise that was added.
The variance Sigma_t is usually fixed to beta_t * I (or a
closed form posterior variance); some variants learn it.
Training loss (simplified)
The full ELBO has many KL terms, but Ho et al. (2020) showed a dramatically simpler surrogate works better in practice:
In English: pick a random training image x_0, pick a random timestep
t in {1..T}, draw fresh Gaussian noise eps, build the noisy
sample x_t via the closed form, and train the network to predict
eps from (x_t, t). Plain MSE.
Stable Diffusion pointer
Stable Diffusion is exactly DDPM with two tweaks: (1) diffusion runs in a 64x64x4 VAE latent rather than 512x512x3 pixel space — much cheaper; (2) the U-Net is conditioned on text embeddings via cross-attention at every level (Q from image features, K and V from CLIP text features). Classifier-free guidance scales the conditional vs unconditional noise prediction to control prompt adherence.
3. PyTorch reference implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
def linear_beta_schedule(T, beta_start=1e-4, beta_end=0.02):
return torch.linspace(beta_start, beta_end, T)
class NoiseSchedule:
"""Precomputes betas / alphas / alpha_bars for fast lookup at step t."""
def __init__(self, T=1000, device="cpu"):
self.T = T
self.betas = linear_beta_schedule(T).to(device)
self.alphas = 1.0 - self.betas
self.alpha_bars = torch.cumprod(self.alphas, dim=0)
def q_sample(self, x0, t, noise):
"""Forward closed form: x_t = sqrt(ab) * x0 + sqrt(1-ab) * noise."""
ab = self.alpha_bars[t].view(-1, 1, 1, 1) # (B, 1, 1, 1)
return torch.sqrt(ab) * x0 + torch.sqrt(1.0 - ab) * noise
class DummyDenoiser(nn.Module):
"""Stub stand-in for a U-Net; takes (x_t, t) and returns predicted noise."""
def __init__(self, channels=1):
super().__init__()
self.t_embed = nn.Embedding(1000, 32)
self.net = nn.Sequential(
nn.Conv2d(channels + 1, 32, 3, padding=1), nn.SiLU(),
nn.Conv2d(32, 32, 3, padding=1), nn.SiLU(),
nn.Conv2d(32, channels, 3, padding=1),
)
def forward(self, x_t, t):
# Broadcast timestep as an extra channel (a real U-Net injects via FiLM).
B, _, H, W = x_t.shape
t_chan = (t.float() / 1000.0).view(B, 1, 1, 1).expand(B, 1, H, W)
return self.net(torch.cat([x_t, t_chan], dim=1))
def training_step(model, sched, x0, optim):
B = x0.size(0)
t = torch.randint(0, sched.T, (B,), device=x0.device)
noise = torch.randn_like(x0)
x_t = sched.q_sample(x0, t, noise)
eps_p = model(x_t, t)
loss = F.mse_loss(eps_p, noise)
optim.zero_grad()
loss.backward()
optim.step()
return float(loss)
@torch.no_grad()
def sample(model, sched, shape, device="cpu"):
"""Ancestral sampler: x_T ~ N(0, I), iteratively denoise to x_0."""
model.train(False) # equivalent to .e+val()
x = torch.randn(shape, device=device)
for t in reversed(range(sched.T)):
t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
eps_p = model(x, t_b)
a = sched.alphas[t]
ab = sched.alpha_bars[t]
b = sched.betas[t]
mean = (1.0 / torch.sqrt(a)) * (x - (b / torch.sqrt(1.0 - ab)) * eps_p)
if t > 0:
x = mean + torch.sqrt(b) * torch.randn_like(x)
else:
x = mean # last step: no noise added
return x
if __name__ == "__main__":
torch.manual_seed(0)
sched = NoiseSchedule(T=100)
model = DummyDenoiser(channels=1)
optim = torch.optim.Adam(model.parameters(), lr=1e-3)
x0 = torch.randn(8, 1, 16, 16) # fake batch
for step in range(3):
loss = training_step(model, sched, x0, optim)
print(step, loss)
out = sample(model, sched, (4, 1, 16, 16))
print(out.shape) # torch.Size([4, 1, 16, 16])
A production DDPM swaps DummyDenoiser for a real U-Net with attention and
a time-step embedding fed in via FiLM-style scale-shift at each block. The math above
is unchanged. model.train(False) is the standard inference switch.
4. Common USAAIO / IOAI applications
- Conditional image generation — class-conditional MNIST/CIFAR for a contest CV task with a generation component.
- Data augmentation — generate extra in-distribution samples when you have a small labeled dataset.
- Inpainting / super-resolution — condition the denoiser on a masked or low-res input.
- Stable Diffusion fine-tuning (LoRA / DreamBooth) — when the contest allows pretrained weights, fine-tuning text-to-image models is the dominant play for generation prompts.
- Audio & protein generation — DDPM-style models (AudioLM, AlphaFold-style) extend the same math beyond images.
5. Drills
D1 · Closed-form forward step
With T = 1000, linear schedule, what is x_t at
t = 0?
Solution
alpha_bar_0 = alpha_1 ≈ 1, so x_0 = sqrt(1) * x_0 + 0 * eps = x_0.
Identity — no noise added at step 0. By t = T, alpha_bar
is near 0 and x_T is essentially pure noise.
D2 · Why predict noise, not the clean image?
The simplified loss predicts eps rather than x_0.
Why does that work better empirically?
Solution
At large t, x_t is almost pure noise; the network has
essentially no signal to predict x_0. But predicting the noise itself
is well-posed at every t — the target is just the Gaussian draw used
to build x_t. It also gives a more uniform loss scale across timesteps,
stabilising training.
D3 · Number of sampling steps
You trained with T = 1000 but inference is too slow. Options?
Solution
Use a deterministic ODE sampler (DDIM, DPM-Solver) — they produce comparable quality in 20-50 steps. Alternatively, distill the model into a few-step student (progressive distillation, consistency models). The schedule, not the training, sets the step budget at inference.
D4 · Why a U-Net for the denoiser?
One sentence.
Solution
The target eps lives at the same resolution as the input; U-Net's
skip connections preserve high-resolution detail while letting deep bottleneck
layers reason globally — exactly the symmetric mapping needed.
D5 · Debugging mode collapse vs blur
Your DDPM produces blurry, low-contrast samples. Diagnose.
Solution
Possible causes: (1) too short training — DDPMs need many epochs; (2) noise
schedule too aggressive (try cosine); (3) predicting x_0 instead of
eps; (4) too few sampling steps; (5) network capacity too small.
Note that diffusion models do not collapse the way GANs do — diversity is
usually fine; sharpness is the typical failure.
Next step
Loop back to U-Net for the architecture of the denoiser, to VAE for the latent compressor used by Stable Diffusion, and to Transformers for the cross-attention conditioning mechanism. Then drill ELBO and noise-schedule short answers in Round 2 theory.