Score based models

6 minute read

Relevant papers:

[1] : Generative modeling by estimating gradients of the data

[2] : Adversarial score matching and improved sampling for image generation

[3] : Sliced-score matching: A Scalable Approach to Density and Score Estimation

[4] : Score-based generative modeling through stochastic differential equations

[5] : Denoising Diffusion Probabilistic Models (Jonathan Ho, Ajay Jain, Pieter Abbeel)

[6] : Bayesian Learning via Stochastic Gradient Langevin Dynamics

[7] : Estimation of Non-Normalized Statistical Models by Score Matching

[8] : A connection between score-matching and denoising auto-encoders


Definition (score): \(\textrm{score}(x) = \nabla \log p(x).\)

Definition (typical set) – informal:Points in the distribution with a good amount of density and volume.

Let $x_0$ be a point in the typical set. Let ${x_i}_{i=1}^{T}$ be a sequence of points that we get by following the gradient of the log likelihood. Specifically,

\[x_{k+1} = x_{k} + a\nabla_{x}\log p(x_k).\]

Using this rule, for $T\to \infty$ we end up in some mode of the distribution and not in the typical set.

Metropolis Adjusted Langevin Algorithm (MALA)

A slight modification of this rule, will provably get us to the typical set. We use a scheme similar to Metropolis-Hastings.

First, we propose a new state with the rule: \(x_{k+1} = x_{k} + a\nabla_{x}\log p_(x_k) + \sqrt{2a}z, \quad z \sim \mathcal N(0, 1)\)

Then, we accept this state with probability: \(\textrm{Pr}(\textrm{accept}) = \min\left( 1, \frac{p(x_{k+1}) \cdot q(x_{k}|x_{k+1})}{p(x_k) \cdot q(x_{k+1} | x_k) }\right)\)

where $q$ is a pre-defined distribution that models the transitions.

Removing the Metropolis-Hastings scheme

The Metropolis-Hastings scheme requires access to the density function. Estimating the density function is difficult because of the normalization constraint. [6] proposes to shrink the learning rate towards 0 over time. This technique is called annealing.

The new scheme is:

\[x_{k+1} = x_{k} + a_t\nabla_{x}\log p_(x_k) + \sqrt{2a_t}z, \quad z \sim \mathcal N(0, 1)\]

Generative modeling by estimating gradients of the data

It seems that if we can estimate $\nabla_x \log p(x)$ for any $x$, then we can sample from the typical set of the distribution.

First idea (Score-matching [7])

Neural network $s_{\theta}(x)$ that tries to approximate $\nabla_x \log p(x)$.

Loss function:

\[\frac{1}{2}E_{x \sim p}\left[ ||s_{\theta}(x) - \nabla_x\log p(x)||^2\right]\]

This can be proven [7] equivalent to: \(E_{x \sim p}\left[ \textrm{tr}\left( \nabla_x s_{\theta}(x)\right) + \frac{1}{2}||s_{\theta}(x)||_2^2\right]\)


  1. The manifold hypothesis. TLDR: we cannot estimate well the score in the low-density regions because we don’t get a lot of samples there.
  2. The first term of the loss requires a lot of computation (that happens in every forward pass).

Second idea (Denoising Score-Matching [8])

We will add noise to the data and try to estimate the score for the noisy samples.

New loss: \(\frac{1}{2}E_{\tilde x \sim q}\left[ ||s_{\theta}(\tilde x) - \nabla_{\tilde x}\log q(\tilde x)||^2\right]\)

[8] proves that the latter is equivalent to minimizing:

\[\frac{1}{2}E\left[ ||s_{\theta}(\tilde x) - \nabla_{\tilde x} \log q_{\sigma}(\tilde x | x)||^2\right]\]


How to choose the noise-scale? For small noises, we still have the manifold hypothesis problem. For large noises, we will end up getting noisy samples.

Main idea of [1]

Learn multiple noisy scores jointly. During sampling, smoothly transition between the scales.

\[l(\theta, \sigma) = \frac{1}{2}E_{p_{data}(x)}E_{\tilde x \sim \mathcal N(x, \sigma^2 I)}\left[ \left|\left| s_{\theta}(\tilde x, \sigma) + \frac{\tilde x - x}{\sigma^2}\right|\right|^2\right]\]

New loss: \(\mathcal L(\theta, \{ \sigma_i\}_{i=1}^L) = \frac{1}{L}\sum_{i=1}^{L}\lambda(\sigma_i)l(\theta, \sigma_i)\)

Sampling of [1]

Initialize x_0 to a random image
for i in [L]:
  a_i = epsilon * sigma_i^2 / sigma_L^2
  for t in [T]:
    x_t = x_{t - 1} + 0.5 a_i * s(x_{t-1}, sigma_i) + sqrt{a_i} z

Inpainting with [1]

Input: y (given image) and m binary mask
Initialize x_0 to a random image
for i in [L]:
  a_i = epsilon * sigma_i^2 / sigma_L^2
  for t in [T]:
    x_t = x_{t - 1} + 0.5 a_i * s(x_{t-1}, sigma_i) + sqrt{a_i} z
    x_t = (1 - m) * x_t + (y + z) * m


  1. We do not have any guarantee that by doing that we will get samples from $p(xy)$.
  2. Can you do inversion?

Score-based generative modeling through stochastic differential equations ([4])


Instead of pertubing data with a finite number of noise distributions, we consider a continuum of distributions that evolve over time according to a diffusion process, given by a prescribed SDE. By reversing this process, we can smoothly mold random noise into data for sample generation. Crucially, this reverse process satisfies a reverse-time SDE.

Forward SDE:

\[dx = f(x, t)dt + g(t)dw\]

Quantities involved in forward SDE

$w$: standard Wiener process (a.k.a. Brownian motion)

$f(\cdot, t): \mathbb R^d \to \mathbb R^d$, drift coefficient

$g(\cdot): \mathbb R \to \mathbb R$, diffusion coefficient


$p_t(x)$ the probability density of $x(t)$.

$p_{st}(x(t)x(s))$: quantifies how likely is to be at step $t$ in $x(t)$ if we were at $x(s)$ at step $s$.

$x(0) \sim p_0$ (the data distribution)

$x(T) \sim p_T$ (usually Gaussian noise)

Reverse SDE

\[dx = [f(x, t) - g^2(t) \nabla_x\log p_t(x)]dt + g(t)d\tilde w\]

Observe that if we can approximate $p_t(x)$ for any $x, t$, then we can mold noise to data.

Loss function

\[E_t\left( \lambda(t) E_{x(0)} E_{x(t) | x(0)}\left[ \left|\left| s_{\theta}(x(t), t) - \nabla_{x(t)}\log p_{0t}(x(t) | x(0)) \right|\right|^2\right] \right)\]

When $f(\cdot , t)$ is affine, then the transition kernel is a Gaussian distribution for which we can typically find the mean and the variance in closed form.

Solving general inverse problems

Basic idea: We can sample from $p_0(x(0)y)$ if $p_t(yx(t))$ is known.

Consider the forward SDE:

\[dx = f(x, t)dt + g(t)dw\]
with $x_0 \sim p_0(x(0)y)$.

Then, the reverse SDE is:

\[dx = \{f(x, t) - g^2(t)\nabla_x \log p_t(x | y) \}dt + g(t)d\tilde w.\]

Now, observe that:

\[p_t(x(t) | y) \propto p_t(x(t)) \cdot p_t(y | x(t))\]

Hence, we can solve the following SDE:

\[dx = \{f(x, t) - g^2(t)\cdot \left[ \nabla_x \log p_t(x) + \nabla_x \log p_t(y|x) \right] \}dt + g(t)d\tilde w\]

Idea to solve the reverse SDE

Train a classifier to estimate $p_t(yx_t)$.

Solve the reverse SDE without training a classifier

We need an estimate of $\nabla_x p_t(x(t) | y)$.