Learning to Sample - part 1

I’ve become really interested in strategies for sampling from intractable probability distributions, especially MCMC. Not only does sampling crop up all over machine learning but I also find the problem intrinsically intellectually appealing. Sampling is deceptively difficult - a very easy problem to state but hugely complex to solve.

Recently my attention has been caught by MCMC strategies that involve learning how to sample. There have been a couple of papers in this vein: Generalising Hamiltonain Monte Carlo with Neural Nets and A-NICE-MC are both excellent examples.

In this post I will:

Describe the problem MCMC is trying to solve.
Recap the basics of Metropolis Hastings.

Delve into some of the cutting edge strategies for learning samplers.

The Problem

Concretely the problem that frequently crops up in ML is either:

Draw samples ${x}$ from a probability distribution $P(x)$ or
Compute the expectation $E_{p(x)}[f(x)]$

where the distribution $p(x)$ is very complicated. Often we only know $p(x)$ up to some normalisation constant $p(x)=p^*(x)/Z$ and we can only evaluate $p^*(x)$ at a point.

This problem comes up in inference when marginalising over variables, learning with intractable normalisers and in model comparison when computing the model evidence.

One Concrete Example - Bayesian Deep learning

I find it helps to have a motivating example in mind. One cool example of MCMC that bridges the worlds of traditional probabilistic learning and Deep Learning is sampling the posterior over the weights of a neural network. I think Radford Neal was one of the first to notice the opportunity here and applied an MCMC algorithm called Hamiltonian Montecarlo to the problem and got excellent results.

Imagine your using a neural net for a regression task. Given an input $x$ your neural net outputs a prediction $f_W(x)$ . Typically neural nets are trained by optimising a cost function, maybe with some kind of regulariser:

$\mathcal{L} = \sum_n \frac{1}{2} \left(y_n - f_W(x_n) \right)^T \left(y_n - f_W(x_n) \right) + \frac{1}{2} W^TW$

Another equivalent view, is that this cost function corresponds to MAP inference in the following probabilistic model. The likelihood is just a gaussian whose mean is $f_W(x)$ :

$P(y|x, W) = |2\pi I|\ e^{- \frac{1}{2} \left(y - f_W(x) \right)^T \left(y - f_W(x) \right) }$

and the prior is:

$P(W) = (2 \pi)^{- \frac{D}{2}} e^{- \frac{1}{2} W^TW}$

and the data are assumed to be drawn i.i.d. In this case our cost function $\mathcal{L}$ is equivalent to:

$\mathcal{L} = - \sum_n \log P(y_n|x_n, W) - \log P(W)$

Rather than doing MAP inference in this model we could actually draw samples from the posterior over model parameters, $P(W| \{x_1, ..., x_N\}, \{y_1, ..., y_n\})$ . If we do this, not only will our model predictions likely be more accurate than just taking the MAP but we can also calculate uncertainty estimates for all of our predictions. Our predictions now become:

$y^* | x^* = E_{P(y|x^*, W)}[y] \approx \sum_m^M \frac{1}{M} f_{W_m}(x)$

where:

$W_m \sim P(W| \{x_1, ..., x_N\}, \{y_1, ..., y_n\})$

and our uncertainty becomes:

$\sigma_y = Var(y|x) \approx \sum_m \frac{1}{M} f_{W_m}(x)f_{W_m}(x)^T - y^*y^{*T}$

Being able to estimate uncertainty is incredibly important when deploying deep learning in the real world. For example when used in medical diagnosis, we may not want to trust our deep learning model when the uncertainty is high.

In order to get these uncertainty estimates we need to able to sample from $P(W| \{x_1, ..., x_N\}, \{y_1, ..., y_n\})$ . This is far from trivial.

What makes this so hard?

Sampling requires global information but if we can only evaluate the density point-wise, our information is inherently local. By definition, to be able to draw samples from a distribution we need samples on average to come from places that contain a large fraction of the total probability mass. However, knowing the fraction of probability mass in a given region is an inherently global property. We need to know not just how big the unnormalised density is around a point of interest but how this compares to other parts of this space. Since we can only evaluate the density locally, we have to somehow combine sparse local information to get a picture of the whole. This becomes increasingly hard as the dimensionality of the problem grows.

Markov Chain Monte Carlo (MCMC) and Metropolis-Hastings

Markov Chain Monte Carlo is a strategy for approximately solving the sampling problem. MCMC works by simulating a particle roaming around the sampling space in such a way that, in the limit of infinite exploration, the amount of time it spends in each part of the space is proportional to the probability of that part of the space. In practice a good approximation can be achieved after a finite amount of exploration.

How do we run the simulation?

We initialise our particle at some random point $X_0$ and then sample $X_t$ from a carefully chosen proposal distribution $T(X_{t}|X_{t-1})$ . As long as we are judicious in our choice of proposal we can prove that the particle will eventually be a collection of samples from the target distribution $P_{target}(x)$ .

$X_0 \rightarrow X_1 \rightarrow X_2 \rightarrow ... \rightarrow X_t$ $P(X_t) \rightarrow P_{target}$

In order for this to work we need our Markov transition distribution to have two very important properties. First we need to know that our target distribution is a fixed point of this transition operator. i.e. We need to know that if we sample a point from $P(x)$ and then repeatedly sample from $T(X_{t}|X_{t-1})$ that the distribution over our samples will marginally still be $P(x)$ . Formally we require:

$\int T(x'|x)P_{target}(x) dx = P_{target}(x')$

One sufficient (but not necessary) condition for this to be true is that $T(x'|x)$ satisfies detailed balance:

$P_{target}(x)T(x'|x) = P_{target}(x')T(x|x')$

If detailed balance holds than we are guaranteed that the target distribution is a fixed point. You can see this by substituting the condition into the stationarity equation. (The downside of detailed balance is that we’re always as likely to go forward in our simulation as we are to go backwards and this slows us down)

Once we’ve established that $P_{target}$ is a fixed point of our operator, we need to know that no matter where we start we will end up at this fixed point. That is, we need our chain of samples to be “ergodic”. We can guarantee ergodicity by making sure that no point in the sample space is visited with a fixed period and no parts of the sample space are inaccessible from each other.

Metropolis-Hastings

The Metropolis-Hastings algorithm is a strategy for shoe-horning any Markov proposal distribution into one that has the above desired properties. The way that this is done is to introduce an accept-reject step at each stage of the simulation. Roughly we sample from any Markovian transition operator $Q(x'|x)$ and then either accept that point as our next sample or reject it, depending on how likely the proposed point is under our target distribution. Concretely we sample $x' \sim Q(x'|x)$ and then accept or reject with probability:

$a = min \{ \frac{P_{target}(x)}{P_{target}(x')} \frac{Q(x|x')}{Q(x'|x)}, 1 \}$

The transition from this combined process of sampling and rejecting is then:

$T(x'|x) = Q(x'|x)\ min \{ \frac{P_{target}(x)}{P_{target}(x')} \frac{Q(x|x')}{Q(x'|x)}, 1 \}$

and this transition has the property, that NO MATTER what $Q$ is we have satisfied detailed balance! This is because (assuming w.l.o.g the acceptance fraction is less than 1):

$P_{target}(x)T(x'|x) = P_{target}(x) Q(x'|x) \frac{P_{target}(x')}{P_{target}(x)} \frac{Q(x|x')}{Q(x'|x)} = Q(x|x')P_{target}(x')$

and

$P_{target}(x')T(x|x') = P_{target}(x') Q(x|x') * 1$

so detailed balance holds.

The power of Metropolis-Hastings is that it gives us enormous flexibility over our choice of proposal distribution whilst still guaranteeing that we’ll eventually get accurate samples. This is why it forms the back-bone of many popular sampling algorithms including the learning to sample methods that I will talk about in my next post but also Gibbs sampling, Hamiltonain Monte-Carlo, NUTS and Slice sampling.

However, if we make a poor choice of $Q$ we will find that the probability of accepting a new point is very low and it’ll take a very long time for our simulation to converge to the distribution of interest. Also, we shoe-horn $Q$ into a valid transition operator by enforcing detailed balance and as was mentioned before, detailed balance can seriously slow down the generation of samples.

Summary

MCMC is one very powerful technique to solving the sampling problem. I personally find it pretty amazing that a single point particle wandering around a high-dimensional sampling space can gather enough information to give us a representative summary of the overall distribution.

Whilst Metropolis-Hastings represents a good starting place for building sampling algorithms its performance depends heavily on the choice of proposal distribution. MCMC only gives us exact samples in the limit of infinite time, how good an approximation we get in finite time is very variable and actually quite hard to even assess.

There are many variants of MCMC algorithms but a lot of them can be viewed as Metropolis-Hastings with different carefully crafted choices for the proposal distribution $Q$ . One strategy that’s recently been proposed has been to parameterise $Q$ and somehow learn the parameters of $Q$ to provide efficient sampling for a given distribution.