Score based models
Relevant papers:
[1] : Generative modeling by estimating gradients of the data
[2] : Adversarial score matching and improved sampling for image generation
[3] : Sliced-score matching: A Scalable Approach to Density and Score Estimation
[4] : Score-based generative modeling through stochastic differential equations
[5] : Denoising Diffusion Probabilistic Models (Jonathan Ho, Ajay Jain, Pieter Abbeel)
[6] : Bayesian Learning via Stochastic Gradient Langevin Dynamics
[7] : Estimation of Non-Normalized Statistical Models by Score Matching
[8] : A connection between score-matching and denoising auto-encoders
Background
Definition (score): \(\textrm{score}(x) = \nabla \log p(x).\)
Definition (typical set) – informal:Points in the distribution with a good amount of density and volume.
Let $x_0$ be a point in the typical set. Let ${x_i}_{i=1}^{T}$ be a sequence of points that we get by following the gradient of the log likelihood. Specifically,
\[x_{k+1} = x_{k} + a\nabla_{x}\log p(x_k).\]Using this rule, for $T\to \infty$ we end up in some mode of the distribution and not in the typical set.
Metropolis Adjusted Langevin Algorithm (MALA)
A slight modification of this rule, will provably get us to the typical set. We use a scheme similar to Metropolis-Hastings.
First, we propose a new state with the rule: \(x_{k+1} = x_{k} + a\nabla_{x}\log p_(x_k) + \sqrt{2a}z, \quad z \sim \mathcal N(0, 1)\)
Then, we accept this state with probability: \(\textrm{Pr}(\textrm{accept}) = \min\left( 1, \frac{p(x_{k+1}) \cdot q(x_{k}|x_{k+1})}{p(x_k) \cdot q(x_{k+1} | x_k) }\right)\)
where $q$ is a pre-defined distribution that models the transitions.
Removing the Metropolis-Hastings scheme
The Metropolis-Hastings scheme requires access to the density function. Estimating the density function is difficult because of the normalization constraint. [6] proposes to shrink the learning rate towards 0 over time. This technique is called annealing.
The new scheme is:
\[x_{k+1} = x_{k} + a_t\nabla_{x}\log p_(x_k) + \sqrt{2a_t}z, \quad z \sim \mathcal N(0, 1)\]Generative modeling by estimating gradients of the data
It seems that if we can estimate $\nabla_x \log p(x)$ for any $x$, then we can sample from the typical set of the distribution.
First idea (Score-matching [7])
Neural network $s_{\theta}(x)$ that tries to approximate $\nabla_x \log p(x)$.
Loss function:
\[\frac{1}{2}E_{x \sim p}\left[ ||s_{\theta}(x) - \nabla_x\log p(x)||^2\right]\]This can be proven [7] equivalent to: \(E_{x \sim p}\left[ \textrm{tr}\left( \nabla_x s_{\theta}(x)\right) + \frac{1}{2}||s_{\theta}(x)||_2^2\right]\)
Challenges
- The manifold hypothesis. TLDR: we cannot estimate well the score in the low-density regions because we don’t get a lot of samples there.
- The first term of the loss requires a lot of computation (that happens in every forward pass).
Second idea (Denoising Score-Matching [8])
We will add noise to the data and try to estimate the score for the noisy samples.
New loss: \(\frac{1}{2}E_{\tilde x \sim q}\left[ ||s_{\theta}(\tilde x) - \nabla_{\tilde x}\log q(\tilde x)||^2\right]\)
[8] proves that the latter is equivalent to minimizing:
\[\frac{1}{2}E\left[ ||s_{\theta}(\tilde x) - \nabla_{\tilde x} \log q_{\sigma}(\tilde x | x)||^2\right]\]Challenges
How to choose the noise-scale? For small noises, we still have the manifold hypothesis problem. For large noises, we will end up getting noisy samples.
Main idea of [1]
Learn multiple noisy scores jointly. During sampling, smoothly transition between the scales.
\[l(\theta, \sigma) = \frac{1}{2}E_{p_{data}(x)}E_{\tilde x \sim \mathcal N(x, \sigma^2 I)}\left[ \left|\left| s_{\theta}(\tilde x, \sigma) + \frac{\tilde x - x}{\sigma^2}\right|\right|^2\right]\]New loss: \(\mathcal L(\theta, \{ \sigma_i\}_{i=1}^L) = \frac{1}{L}\sum_{i=1}^{L}\lambda(\sigma_i)l(\theta, \sigma_i)\)
Sampling of [1]
Initialize x_0 to a random image
for i in [L]:
a_i = epsilon * sigma_i^2 / sigma_L^2
for t in [T]:
x_t = x_{t - 1} + 0.5 a_i * s(x_{t-1}, sigma_i) + sqrt{a_i} z
Inpainting with [1]
Input: y (given image) and m binary mask
Initialize x_0 to a random image
for i in [L]:
a_i = epsilon * sigma_i^2 / sigma_L^2
for t in [T]:
x_t = x_{t - 1} + 0.5 a_i * s(x_{t-1}, sigma_i) + sqrt{a_i} z
x_t = (1 - m) * x_t + (y + z) * m
Open-problems
We do not have any guarantee that by doing that we will get samples from $p(x y)$. - Can you do inversion?
Score-based generative modeling through stochastic differential equations ([4])
Idea
Instead of pertubing data with a finite number of noise distributions, we consider a continuum of distributions that evolve over time according to a diffusion process, given by a prescribed SDE. By reversing this process, we can smoothly mold random noise into data for sample generation. Crucially, this reverse process satisfies a reverse-time SDE.
Forward SDE:
\[dx = f(x, t)dt + g(t)dw\]Quantities involved in forward SDE
$w$: standard Wiener process (a.k.a. Brownian motion)
$f(\cdot, t): \mathbb R^d \to \mathbb R^d$, drift coefficient
$g(\cdot): \mathbb R \to \mathbb R$, diffusion coefficient
Semantics
$p_t(x)$ the probability density of $x(t)$.
$p_{st}(x(t) | x(s))$: quantifies how likely is to be at step $t$ in $x(t)$ if we were at $x(s)$ at step $s$. |
$x(0) \sim p_0$ (the data distribution)
$x(T) \sim p_T$ (usually Gaussian noise)
Reverse SDE
\[dx = [f(x, t) - g^2(t) \nabla_x\log p_t(x)]dt + g(t)d\tilde w\]Observe that if we can approximate $p_t(x)$ for any $x, t$, then we can mold noise to data.
Loss function
\[E_t\left( \lambda(t) E_{x(0)} E_{x(t) | x(0)}\left[ \left|\left| s_{\theta}(x(t), t) - \nabla_{x(t)}\log p_{0t}(x(t) | x(0)) \right|\right|^2\right] \right)\]When $f(\cdot , t)$ is affine, then the transition kernel is a Gaussian distribution for which we can typically find the mean and the variance in closed form.
Solving general inverse problems
Basic idea: We can sample from $p_0(x(0) | y)$ if $p_t(y | x(t))$ is known. |
Consider the forward SDE:
\[dx = f(x, t)dt + g(t)dw\]with $x_0 \sim p_0(x(0) | y)$. |
Then, the reverse SDE is:
\[dx = \{f(x, t) - g^2(t)\nabla_x \log p_t(x | y) \}dt + g(t)d\tilde w.\]Now, observe that:
\[p_t(x(t) | y) \propto p_t(x(t)) \cdot p_t(y | x(t))\]Hence, we can solve the following SDE:
\[dx = \{f(x, t) - g^2(t)\cdot \left[ \nabla_x \log p_t(x) + \nabla_x \log p_t(y|x) \right] \}dt + g(t)d\tilde w\]Idea to solve the reverse SDE
Train a classifier to estimate $p_t(y | x_t)$. |
Solve the reverse SDE without training a classifier
We need an estimate of $\nabla_x p_t(x(t) | y)$.