← All Posts

Mathematics of Data

Score Matching and Diffusion Models: Generating Data by Learning to Denoise

March 2025generative-modelsdiffusionscore-matchingsde

Stable Diffusion, DALL-E, and Sora share a common mathematical foundation that is surprisingly clean. At its core: you cannot learn a probability distribution directly, but you can learn its score — the gradient of its log-density. And learning the score turns out to be equivalent to learning to denoise a corrupted signal. This post derives that equivalence from scratch.


1. The Generative Modeling Problem

Goal. Given i.i.d. samples {xi}i=1n\{x_i\}_{i=1}^n from an unknown distribution pdatap_{\text{data}} on Rd\mathbb{R}^d, learn a model that can generate new samples from pdatap_{\text{data}}.

The maximum likelihood approach. Fit a parametric model pθ(x)=eEθ(x)Z(θ)p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z(\theta)} where EθE_\theta is an energy function and Z(θ)=eEθ(x)dxZ(\theta) = \int e^{-E_\theta(x)} dx is the normalizing constant. Maximize:

logpθ(x)=Eθ(x)logZ(θ)\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)

The problem: Z(θ)Z(\theta) is a dd-dimensional integral over Rd\mathbb{R}^d. For d=512×512×3786,000d = 512 \times 512 \times 3 \approx 786,000 (an image), this integral is completely intractable. Every gradient step on logpθ\log p_\theta requires estimating Z(θ)Z(\theta) — which requires samples from pθp_\theta — which requires knowing Z(θ)Z(\theta). A circular dependency.

The score-based escape. Instead of learning pθp_\theta directly, learn its score function:

s(x)=xlogp(x)=xE(x)xlogZ=0s(x) = \nabla_x \log p(x) = -\nabla_x E(x) - \underbrace{\nabla_x \log Z}_{= 0}

The normalizing constant ZZ drops out when we differentiate — the score does not depend on ZZ. If we can learn s(x)s(x), we can generate samples without ever computing ZZ.


2. The Score Function and Langevin Dynamics

The score s(x)=xlogp(x)s(x) = \nabla_x \log p(x) is a vector field on Rd\mathbb{R}^d. At any point xx, it points in the direction of steepest increase of logp\log p — toward regions of higher probability density.

Langevin dynamics uses the score to generate samples. Starting from any x0x_0, iterate:

xt+1=xt+α2s(xt)+αεt,εtN(0,I)x_{t+1} = x_t + \frac{\alpha}{2} s(x_t) + \sqrt{\alpha}\, \varepsilon_t, \qquad \varepsilon_t \sim \mathcal{N}(0, I)

Under mild conditions, as tt \to \infty and α0\alpha \to 0, the distribution of xtx_t converges to pp. The step α2s(xt)\frac{\alpha}{2}s(x_t) drifts toward high-density regions; the noise αεt\sqrt{\alpha}\varepsilon_t prevents collapse to a single mode and allows exploration.

No ZZ appears anywhere. If we have the score, we can sample.


3. Hyvärinen's Score Matching Trick

We want to learn sθ(x)s(x)=xlogpdata(x)s_\theta(x) \approx s(x) = \nabla_x \log p_{\text{data}}(x) by minimizing the Fisher divergence:

J(θ)=12Epdata ⁣[sθ(x)xlogpdata(x)2]J(\theta) = \frac{1}{2}\mathbb{E}_{p_{\text{data}}}\!\left[\|s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\|^2\right]

The problem: xlogpdata(x)\nabla_x \log p_{\text{data}}(x) is unknown (it's what we're trying to learn). We cannot minimize J(θ)J(\theta) directly.

Hyvärinen's key insight (2005): Expand J(θ)J(\theta):

J(θ)=12E ⁣[sθ(x)2]E ⁣[sθ(x)Txlogpdata(x)]+constJ(\theta) = \frac{1}{2}\mathbb{E}\!\left[\|s_\theta(x)\|^2\right] - \mathbb{E}\!\left[s_\theta(x)^T \nabla_x \log p_{\text{data}}(x)\right] + \text{const}

The constant 12E[xlogpdata2]\frac{1}{2}\mathbb{E}[\|\nabla_x \log p_{\text{data}}\|^2] doesn't depend on θ\theta. The problematic term is the cross-term E[sθ(x)Txlogpdata(x)]\mathbb{E}[s_\theta(x)^T \nabla_x \log p_{\text{data}}(x)].

Apply integration by parts to the cross-term. Writing p=pdatap = p_{\text{data}} for brevity:

Ep[sθ(x)Txlogp(x)]=sθ(x)Txlogp(x)p(x)dx\mathbb{E}_p[s_\theta(x)^T \nabla_x \log p(x)] = \int s_\theta(x)^T \nabla_x \log p(x) \cdot p(x)\, dx

=sθ(x)Txp(x)dx=sθ(x)Txp(x)dx= \int s_\theta(x)^T \nabla_x p(x)\, dx = \int s_\theta(x)^T \nabla_x p(x)\, dx

Integrating by parts component-wise (assuming p(x)0p(x) \to 0 as x\|x\| \to \infty, so boundary terms vanish):

[sθ(x)]jp(x)xjdx=[sθ(x)]jxjp(x)dx\int [s_\theta(x)]_j \frac{\partial p(x)}{\partial x_j}\, dx = -\int \frac{\partial [s_\theta(x)]_j}{\partial x_j} p(x)\, dx

Summing over jj:

Ep[sθ(x)Txlogp(x)]=Ep[tr(xsθ(x))]\mathbb{E}_p[s_\theta(x)^T \nabla_x \log p(x)] = -\mathbb{E}_p[\operatorname{tr}(\nabla_x s_\theta(x))]

Substituting back:

J(θ)=Epdata ⁣[tr(xsθ(x))+12sθ(x)2]+const\boxed{J(\theta) = \mathbb{E}_{p_{\text{data}}}\!\left[\operatorname{tr}(\nabla_x s_\theta(x)) + \frac{1}{2}\|s_\theta(x)\|^2\right] + \text{const}}

This is the score matching objective. It depends only on sθs_\theta and its Jacobian xsθ\nabla_x s_\theta, both evaluated at data points xpdatax \sim p_{\text{data}}. No pdatap_{\text{data}} appears explicitly — only samples from it. We can minimize this with stochastic gradient descent.

The trace term tr(xsθ(x))\operatorname{tr}(\nabla_x s_\theta(x)) is the divergence of the score field. Computing it naively requires dd backpropagation passes (one per output dimension). In practice, Hutchinson's estimator is used: tr(xsθ)εTxsθε\operatorname{tr}(\nabla_x s_\theta) \approx \varepsilon^T \nabla_x s_\theta \varepsilon for εN(0,I)\varepsilon \sim \mathcal{N}(0,I), reducing to a single pass.


4. Denoising Score Matching and Tweedie's Formula

Score matching as derived above has a problem: pdatap_{\text{data}} is typically concentrated on a low-dimensional manifold (natural images are not uniformly distributed in R786000\mathbb{R}^{786000}). The score xlogpdata\nabla_x \log p_{\text{data}} is undefined off the manifold. Langevin dynamics starting away from the manifold gets no useful signal.

The fix: add noise. Instead of matching the score of pdatap_{\text{data}}, match the score of the noisy distribution pσ(x~)=pdata(x)N(x~;x,σ2I)dxp_\sigma(\tilde{x}) = \int p_{\text{data}}(x) \mathcal{N}(\tilde{x}; x, \sigma^2 I)\, dx. This is pdatap_{\text{data}} convolved with Gaussian noise — it has full support on Rd\mathbb{R}^d.

For the Gaussian noise kernel qσ(x~x)=N(x~;x,σ2I)q_\sigma(\tilde{x}|x) = \mathcal{N}(\tilde{x}; x, \sigma^2 I), Tweedie's formula gives:

x~logpσ(x~)=E[xx~]x~σ2\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}) = \frac{\mathbb{E}[x|\tilde{x}] - \tilde{x}}{\sigma^2}

Proof. By Bayes:

pσ(x~)=pdata(x)qσ(x~x)dxp_\sigma(\tilde{x}) = \int p_{\text{data}}(x) q_\sigma(\tilde{x}|x)\, dx

Differentiating:

x~logpσ(x~)=x~pσ(x~)pσ(x~)=pdata(x)x~qσ(x~x)dxpσ(x~)\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}) = \frac{\nabla_{\tilde{x}} p_\sigma(\tilde{x})}{p_\sigma(\tilde{x})} = \frac{\int p_{\text{data}}(x) \nabla_{\tilde{x}} q_\sigma(\tilde{x}|x)\, dx}{p_\sigma(\tilde{x})}

For Gaussian qσq_\sigma: x~logqσ(x~x)=xx~σ2\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}|x) = \frac{x - \tilde{x}}{\sigma^2}, so x~qσ(x~x)=xx~σ2qσ(x~x)\nabla_{\tilde{x}} q_\sigma(\tilde{x}|x) = \frac{x-\tilde{x}}{\sigma^2} q_\sigma(\tilde{x}|x).

Therefore:

x~logpσ(x~)=(xx~)pdata(x)qσ(x~x)dxσ2pσ(x~)=E[xx~]x~σ2\nabla_{\tilde{x}} \log p_\sigma(\tilde{x}) = \frac{\int (x-\tilde{x})\, p_{\text{data}}(x) q_\sigma(\tilde{x}|x)\, dx}{\sigma^2 p_\sigma(\tilde{x})} = \frac{\mathbb{E}[x|\tilde{x}] - \tilde{x}}{\sigma^2} \qquad \square

This is the central identity of diffusion models. The score of the noisy distribution equals the (expected clean signal minus the noisy signal) divided by σ2\sigma^2. Equivalently:

E[xx~]=x~+σ2x~logpσ(x~)\mathbb{E}[x|\tilde{x}] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p_\sigma(\tilde{x})

Learning the score \equiv learning to denoise. If we parameterize a denoiser Dθ(x~,σ)E[xx~]D_\theta(\tilde{x}, \sigma) \approx \mathbb{E}[x|\tilde{x}], then the implied score estimate is:

sθ(x~,σ)=Dθ(x~,σ)x~σ2s_\theta(\tilde{x}, \sigma) = \frac{D_\theta(\tilde{x},\sigma) - \tilde{x}}{\sigma^2}

The denoising score matching objective — minimize over θ\theta:

JDSM(θ)=Expdata,εN(0,I) ⁣[Dθ(x+σε,σ)x2]J_{\text{DSM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}},\, \varepsilon \sim \mathcal{N}(0,I)}\!\left[\left\|D_\theta(x + \sigma\varepsilon,\, \sigma) - x\right\|^2\right]

This is a standard supervised learning objective: given noisy input x~=x+σε\tilde{x} = x + \sigma\varepsilon, predict the clean xx. No score or log-density appears explicitly.


5. Diffusion Models: Many Noise Levels

A single noise level σ\sigma is not enough. At large σ\sigma, the noisy distribution has full support but is far from pdatap_{\text{data}}. At small σ\sigma, it is close to pdatap_{\text{data}} but has poor coverage. We need all scales simultaneously.

The Forward Process

Define a continuous-time noise schedule from t=0t=0 (clean) to t=Tt=T (pure noise):

dx=f(x,t)dt+g(t)dWdx = f(x,t)\, dt + g(t)\, dW

where WW is a Wiener process (Brownian motion). For DDPM (Ho et al., 2020), the specific choice f(x,t)=β(t)2xf(x,t) = -\frac{\beta(t)}{2}x and g(t)2=β(t)g(t)^2 = \beta(t) gives a closed-form marginal:

q(xtx0)=N ⁣(αˉtx0,  (1αˉt)I)q(x_t | x_0) = \mathcal{N}\!\left(\sqrt{\bar{\alpha}_t}\, x_0,\; (1 - \bar{\alpha}_t) I\right)

where αˉt=exp ⁣(0tβ(s)ds)\bar{\alpha}_t = \exp\!\left(-\int_0^t \beta(s)\, ds\right) is the noise schedule. As tTt \to T, αˉT0\bar{\alpha}_T \approx 0 and xTN(0,I)x_T \approx \mathcal{N}(0,I) — pure noise.

At any intermediate time tt, we can write xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\, \varepsilon with εN(0,I)\varepsilon \sim \mathcal{N}(0,I). Tweedie's formula applies at each tt with σt2=1αˉt\sigma_t^2 = 1 - \bar{\alpha}_t:

xtlogq(xt)=E[x0xt]xt/αˉt(1αˉt)/αˉtε1αˉt\nabla_{x_t} \log q(x_t) = \frac{\mathbb{E}[x_0 | x_t] - x_t / \sqrt{\bar{\alpha}_t}}{(1-\bar{\alpha}_t)/\sqrt{\bar{\alpha}_t}} \approx -\frac{\varepsilon}{\sqrt{1-\bar{\alpha}_t}}

The Reverse Process

Anderson's formula (1982) gives the reverse-time SDE:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dWˉdx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar{W}

where Wˉ\bar{W} is a reverse-time Wiener process. This SDE runs backward from t=Tt=T to t=0t=0, progressively removing noise.

We learn a neural network sθ(xt,t)xtlogpt(xt)s_\theta(x_t, t) \approx \nabla_{x_t} \log p_t(x_t) for all tt simultaneously by minimizing:

L(θ)=Et,x0,ε ⁣[sθ ⁣(αˉtx0+1αˉtε,  t)+ε1αˉt2]\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \varepsilon}\!\left[\left\|s_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1-\bar{\alpha}_t}\,\varepsilon,\; t\right) + \frac{\varepsilon}{\sqrt{1-\bar{\alpha}_t}}\right\|^2\right]

Equivalently, learn to predict the noise ε\varepsilon from the noisy image xtx_t. This is the denoising diffusion training objective.

Sampling: Start from xTN(0,I)x_T \sim \mathcal{N}(0,I). Integrate the reverse SDE from t=Tt=T to t=0t=0 using the learned sθs_\theta, producing a sample from (approximately) pdatap_{\text{data}}.


6. Comparison to GANs

Both GANs and diffusion models aim to sample from pdatap_{\text{data}}.

| | GANs | Diffusion Models | |---|---|---| | Objective | Minimax game minGmaxD\min_G \max_D | Denoising regression (MSE) | | Training stability | Unstable; mode collapse | Stable; standard gradient descent | | Sample quality | Fast (single pass) | Slow (TT denoising steps) | | Theoretical grounding | Wasserstein distance | Maximum likelihood via score | | Scalability | Hard to scale | Scales well (U-Net + attention) |

GANs minimize the Wasserstein distance between pθp_\theta and pdatap_{\text{data}} via an adversarial game — mathematically elegant but notoriously difficult to train. The generator and discriminator must stay in balance; if the discriminator dominates, gradients vanish; if the generator dominates, mode collapse occurs.

Diffusion models reduce generation to a sequence of denoising steps, each of which is a standard regression problem. No adversarial game, no mode collapse, no balancing act. The price is speed: generating one image requires T=1000T = 1000 denoising steps. Recent work (consistency models, DDIM, flow matching) reduces this to 1–10 steps while preserving quality.

Score matching is the theoretical unification: both GANs and diffusion models are estimating properties of pdatap_{\text{data}}, just through different mathematical lenses.


Part of the Mathematics of Data series — mathematical notes on EE-556 at EPFL.