Score Matching and Diffusion Models: Generating Data by Learning to Denoise
March 2025generative-modelsdiffusionscore-matchingsde
Stable Diffusion, DALL-E, and Sora share a common mathematical foundation that is surprisingly clean. At its core: you cannot learn a probability distribution directly, but you can learn its score — the gradient of its log-density. And learning the score turns out to be equivalent to learning to denoise a corrupted signal. This post derives that equivalence from scratch.
1. The Generative Modeling Problem
Goal. Given i.i.d. samples {xi}i=1n from an unknown distribution pdata on Rd, learn a model that can generate new samples from pdata.
The maximum likelihood approach. Fit a parametric model pθ(x)=Z(θ)e−Eθ(x) where Eθ is an energy function and Z(θ)=∫e−Eθ(x)dx is the normalizing constant. Maximize:
logpθ(x)=−Eθ(x)−logZ(θ)
The problem: Z(θ) is a d-dimensional integral over Rd. For d=512×512×3≈786,000 (an image), this integral is completely intractable. Every gradient step on logpθ requires estimating Z(θ) — which requires samples from pθ — which requires knowing Z(θ). A circular dependency.
The score-based escape. Instead of learning pθ directly, learn its score function:
s(x)=∇xlogp(x)=−∇xE(x)−=0∇xlogZ
The normalizing constant Z drops out when we differentiate — the score does not depend on Z. If we can learn s(x), we can generate samples without ever computing Z.
2. The Score Function and Langevin Dynamics
The score s(x)=∇xlogp(x) is a vector field on Rd. At any point x, it points in the direction of steepest increase of logp — toward regions of higher probability density.
Langevin dynamics uses the score to generate samples. Starting from any x0, iterate:
xt+1=xt+2αs(xt)+αεt,εt∼N(0,I)
Under mild conditions, as t→∞ and α→0, the distribution of xt converges to p. The step 2αs(xt) drifts toward high-density regions; the noise αεt prevents collapse to a single mode and allows exploration.
No Z appears anywhere. If we have the score, we can sample.
3. Hyvärinen's Score Matching Trick
We want to learn sθ(x)≈s(x)=∇xlogpdata(x) by minimizing the Fisher divergence:
J(θ)=21Epdata[∥sθ(x)−∇xlogpdata(x)∥2]
The problem: ∇xlogpdata(x) is unknown (it's what we're trying to learn). We cannot minimize J(θ) directly.
The constant 21E[∥∇xlogpdata∥2] doesn't depend on θ. The problematic term is the cross-term E[sθ(x)T∇xlogpdata(x)].
Apply integration by parts to the cross-term. Writing p=pdata for brevity:
Ep[sθ(x)T∇xlogp(x)]=∫sθ(x)T∇xlogp(x)⋅p(x)dx
=∫sθ(x)T∇xp(x)dx=∫sθ(x)T∇xp(x)dx
Integrating by parts component-wise (assuming p(x)→0 as ∥x∥→∞, so boundary terms vanish):
∫[sθ(x)]j∂xj∂p(x)dx=−∫∂xj∂[sθ(x)]jp(x)dx
Summing over j:
Ep[sθ(x)T∇xlogp(x)]=−Ep[tr(∇xsθ(x))]
Substituting back:
J(θ)=Epdata[tr(∇xsθ(x))+21∥sθ(x)∥2]+const
This is the score matching objective. It depends only on sθ and its Jacobian ∇xsθ, both evaluated at data points x∼pdata. No pdata appears explicitly — only samples from it. We can minimize this with stochastic gradient descent.
The trace term tr(∇xsθ(x)) is the divergence of the score field. Computing it naively requires d backpropagation passes (one per output dimension). In practice, Hutchinson's estimator is used: tr(∇xsθ)≈εT∇xsθε for ε∼N(0,I), reducing to a single pass.
4. Denoising Score Matching and Tweedie's Formula
Score matching as derived above has a problem: pdata is typically concentrated on a low-dimensional manifold (natural images are not uniformly distributed in R786000). The score ∇xlogpdata is undefined off the manifold. Langevin dynamics starting away from the manifold gets no useful signal.
The fix: add noise. Instead of matching the score of pdata, match the score of the noisy distribution pσ(x~)=∫pdata(x)N(x~;x,σ2I)dx. This is pdata convolved with Gaussian noise — it has full support on Rd.
For the Gaussian noise kernel qσ(x~∣x)=N(x~;x,σ2I), Tweedie's formula gives:
This is the central identity of diffusion models. The score of the noisy distribution equals the (expected clean signal minus the noisy signal) divided by σ2. Equivalently:
E[x∣x~]=x~+σ2∇x~logpσ(x~)
Learning the score ≡ learning to denoise. If we parameterize a denoiser Dθ(x~,σ)≈E[x∣x~], then the implied score estimate is:
sθ(x~,σ)=σ2Dθ(x~,σ)−x~
The denoising score matching objective — minimize over θ:
JDSM(θ)=Ex∼pdata,ε∼N(0,I)[∥Dθ(x+σε,σ)−x∥2]
This is a standard supervised learning objective: given noisy input x~=x+σε, predict the clean x. No score or log-density appears explicitly.
5. Diffusion Models: Many Noise Levels
A single noise level σ is not enough. At large σ, the noisy distribution has full support but is far from pdata. At small σ, it is close to pdata but has poor coverage. We need all scales simultaneously.
The Forward Process
Define a continuous-time noise schedule from t=0 (clean) to t=T (pure noise):
dx=f(x,t)dt+g(t)dW
where W is a Wiener process (Brownian motion). For DDPM (Ho et al., 2020), the specific choice f(x,t)=−2β(t)x and g(t)2=β(t) gives a closed-form marginal:
q(xt∣x0)=N(αˉtx0,(1−αˉt)I)
where αˉt=exp(−∫0tβ(s)ds) is the noise schedule. As t→T, αˉT≈0 and xT≈N(0,I) — pure noise.
At any intermediate time t, we can write xt=αˉtx0+1−αˉtε with ε∼N(0,I). Tweedie's formula applies at each t with σt2=1−αˉt:
Equivalently, learn to predict the noise ε from the noisy image xt. This is the denoising diffusion training objective.
Sampling: Start from xT∼N(0,I). Integrate the reverse SDE from t=T to t=0 using the learned sθ, producing a sample from (approximately) pdata.
6. Comparison to GANs
Both GANs and diffusion models aim to sample from pdata.
| | GANs | Diffusion Models |
|---|---|---|
| Objective | Minimax game minGmaxD | Denoising regression (MSE) |
| Training stability | Unstable; mode collapse | Stable; standard gradient descent |
| Sample quality | Fast (single pass) | Slow (T denoising steps) |
| Theoretical grounding | Wasserstein distance | Maximum likelihood via score |
| Scalability | Hard to scale | Scales well (U-Net + attention) |
GANs minimize the Wasserstein distance between pθ and pdata via an adversarial game — mathematically elegant but notoriously difficult to train. The generator and discriminator must stay in balance; if the discriminator dominates, gradients vanish; if the generator dominates, mode collapse occurs.
Diffusion models reduce generation to a sequence of denoising steps, each of which is a standard regression problem. No adversarial game, no mode collapse, no balancing act. The price is speed: generating one image requires T=1000 denoising steps. Recent work (consistency models, DDIM, flow matching) reduces this to 1–10 steps while preserving quality.
Score matching is the theoretical unification: both GANs and diffusion models are estimating properties of pdata, just through different mathematical lenses.
Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.