# Score based models

## Relevant papers:

[1] : Generative modeling by estimating gradients of the data

[2] : Adversarial score matching and improved sampling for image generation

[3] : Sliced-score matching: A Scalable Approach to Density and Score Estimation

[4] : Score-based generative modeling through stochastic differential equations

[5] : Denoising Diffusion Probabilistic Models (Jonathan Ho, Ajay Jain, Pieter Abbeel)

[6] : Bayesian Learning via Stochastic Gradient Langevin Dynamics

[7] : Estimation of Non-Normalized Statistical Models by Score Matching

[8] : A connection between score-matching and denoising auto-encoders

## Background

*Definition (score)*: \(\textrm{score}(x) = \nabla \log p(x).\)

*Definition (typical set) – informal*:Points in the distribution with a good amount of density and volume.

Let $x_0$ be a point in the typical set. Let ${x_i}_{i=1}^{T}$ be a sequence of points that we get by following the gradient of the log likelihood. Specifically,

\[x_{k+1} = x_{k} + a\nabla_{x}\log p(x_k).\]Using this rule, for $T\to \infty$ we end up in some mode of the distribution and not in the typical set.

### Metropolis Adjusted Langevin Algorithm (MALA)

A slight modification of this rule, will provably get us to the typical set. We use a scheme similar to Metropolis-Hastings.

First, we propose a new state with the rule: \(x_{k+1} = x_{k} + a\nabla_{x}\log p_(x_k) + \sqrt{2a}z, \quad z \sim \mathcal N(0, 1)\)

Then, we accept this state with probability: \(\textrm{Pr}(\textrm{accept}) = \min\left( 1, \frac{p(x_{k+1}) \cdot q(x_{k}|x_{k+1})}{p(x_k) \cdot q(x_{k+1} | x_k) }\right)\)

where $q$ is a pre-defined distribution that models the transitions.

### Removing the Metropolis-Hastings scheme

The Metropolis-Hastings scheme requires access to the density function. Estimating the density function is difficult because of the normalization constraint. [6] proposes to shrink the learning rate towards 0 over time. This technique is called annealing.

The new scheme is:

\[x_{k+1} = x_{k} + a_t\nabla_{x}\log p_(x_k) + \sqrt{2a_t}z, \quad z \sim \mathcal N(0, 1)\]## Generative modeling by estimating gradients of the data

It seems that if we can estimate $\nabla_x \log p(x)$ for any $x$, then we can sample from the typical set of the distribution.

### First idea (Score-matching [7])

Neural network $s_{\theta}(x)$ that tries to approximate $\nabla_x \log p(x)$.

Loss function:

\[\frac{1}{2}E_{x \sim p}\left[ ||s_{\theta}(x) - \nabla_x\log p(x)||^2\right]\]This can be proven [7] equivalent to: \(E_{x \sim p}\left[ \textrm{tr}\left( \nabla_x s_{\theta}(x)\right) + \frac{1}{2}||s_{\theta}(x)||_2^2\right]\)

#### Challenges

- The manifold hypothesis. TLDR: we cannot estimate well the score in the low-density regions because we don’t get a lot of samples there.
- The first term of the loss requires a lot of computation (that happens in every forward pass).

### Second idea (Denoising Score-Matching [8])

We will add noise to the data and try to estimate the score for the noisy samples.

New loss: \(\frac{1}{2}E_{\tilde x \sim q}\left[ ||s_{\theta}(\tilde x) - \nabla_{\tilde x}\log q(\tilde x)||^2\right]\)

[8] proves that the latter is equivalent to minimizing:

\[\frac{1}{2}E\left[ ||s_{\theta}(\tilde x) - \nabla_{\tilde x} \log q_{\sigma}(\tilde x | x)||^2\right]\]#### Challenges

How to choose the noise-scale? For small noises, we still have the manifold hypothesis problem. For large noises, we will end up getting noisy samples.

### Main idea of [1]

Learn multiple noisy scores jointly. During sampling, smoothly transition between the scales.

\[l(\theta, \sigma) = \frac{1}{2}E_{p_{data}(x)}E_{\tilde x \sim \mathcal N(x, \sigma^2 I)}\left[ \left|\left| s_{\theta}(\tilde x, \sigma) + \frac{\tilde x - x}{\sigma^2}\right|\right|^2\right]\]New loss: \(\mathcal L(\theta, \{ \sigma_i\}_{i=1}^L) = \frac{1}{L}\sum_{i=1}^{L}\lambda(\sigma_i)l(\theta, \sigma_i)\)

### Sampling of [1]

```
Initialize x_0 to a random image
for i in [L]:
a_i = epsilon * sigma_i^2 / sigma_L^2
for t in [T]:
x_t = x_{t - 1} + 0.5 a_i * s(x_{t-1}, sigma_i) + sqrt{a_i} z
```

### Inpainting with [1]

```
Input: y (given image) and m binary mask
Initialize x_0 to a random image
for i in [L]:
a_i = epsilon * sigma_i^2 / sigma_L^2
for t in [T]:
x_t = x_{t - 1} + 0.5 a_i * s(x_{t-1}, sigma_i) + sqrt{a_i} z
x_t = (1 - m) * x_t + (y + z) * m
```

### Open-problems

We do not have any guarantee that by doing that we will get samples from $p(x y)$. - Can you do inversion?

## Score-based generative modeling through stochastic differential equations ([4])

### Idea

Instead of pertubing data with a finite number of noise distributions, we consider a continuum of distributions that evolve over time according to a diffusion process, given by a prescribed SDE. By reversing this process, we can smoothly mold random noise into data for sample generation. Crucially, this reverse process satisfies a reverse-time SDE.

### Forward SDE:

\[dx = f(x, t)dt + g(t)dw\]#### Quantities involved in forward SDE

$w$: standard Wiener process (a.k.a. Brownian motion)

$f(\cdot, t): \mathbb R^d \to \mathbb R^d$, drift coefficient

$g(\cdot): \mathbb R \to \mathbb R$, diffusion coefficient

#### Semantics

$p_t(x)$ the probability density of $x(t)$.

$p_{st}(x(t) | x(s))$: quantifies how likely is to be at step $t$ in $x(t)$ if we were at $x(s)$ at step $s$. |

$x(0) \sim p_0$ (the data distribution)

$x(T) \sim p_T$ (usually Gaussian noise)

### Reverse SDE

\[dx = [f(x, t) - g^2(t) \nabla_x\log p_t(x)]dt + g(t)d\tilde w\]Observe that if we can approximate $p_t(x)$ for any $x, t$, then we can mold noise to data.

### Loss function

\[E_t\left( \lambda(t) E_{x(0)} E_{x(t) | x(0)}\left[ \left|\left| s_{\theta}(x(t), t) - \nabla_{x(t)}\log p_{0t}(x(t) | x(0)) \right|\right|^2\right] \right)\]When $f(\cdot , t)$ is affine, then the transition kernel is a Gaussian distribution for which we can typically find the mean and the variance in closed form.

### Solving general inverse problems

Basic idea: We can sample from $p_0(x(0) | y)$ if $p_t(y | x(t))$ is known. |

Consider the forward SDE:

\[dx = f(x, t)dt + g(t)dw\]with $x_0 \sim p_0(x(0) | y)$. |

Then, the reverse SDE is:

\[dx = \{f(x, t) - g^2(t)\nabla_x \log p_t(x | y) \}dt + g(t)d\tilde w.\]Now, observe that:

\[p_t(x(t) | y) \propto p_t(x(t)) \cdot p_t(y | x(t))\]Hence, we can solve the following SDE:

\[dx = \{f(x, t) - g^2(t)\cdot \left[ \nabla_x \log p_t(x) + \nabla_x \log p_t(y|x) \right] \}dt + g(t)d\tilde w\]#### Idea to solve the reverse SDE

Train a classifier to estimate $p_t(y | x_t)$. |

#### Solve the reverse SDE without training a classifier

We need an estimate of $\nabla_x p_t(x(t) | y)$.