The Reparamaterisation Trick

Machine learning is full of lots of seemingly simple mathematical “tricks” that have a disproportionate usefulness relative to their complexity. The Reparamaterisation trick is one of these. In the next few posts I want to go into more detail about variational auto-encoders and the trick will come up there again but for the time being I just want to present a neat and useful “trick” that I think can be confusing when you first come across it.

Often times in machine learning we want to estimate the gradient of an expectation using sampling. This happens in reinforcement learning when doing policy gradients and in stochastic variational inference. The problem we are trying to solve is to find:

$\nabla_\theta E_{p(x)}[f(x)].$

As long as $p(x)$ is independent of $\theta$ there’s no problem here, we can use Leibniz rule and simply bring the derivative into the expectation like follows:

$E_{p(x)}[\nabla_\theta f_\theta(x)]$

and then construct a monte-carlo estimate for the gradient:

$E_{p(x)}[\nabla_\theta f_\theta(x)] \approx \sum_i \nabla_\theta f_\theta(x_i)$

where $x_i$ is sampled from $p(x)$ . If however, as is often the case, the distribution depends on $\theta$ . i.e $P(x) = P(x|\theta)$ , then the above trick wont work because after you bring the derivative inside the integral (or sum), the new integral is no longer an expectation and we cant construct a straightforward monte-carlo estimate.

There are two ways to get around this. The first is used in the Reinforce algorithm and is known as the log derivative trick. It relies on using the following identity, which is easy to verify:

$\nabla_\theta g(\theta) \equiv g(\theta)\nabla_\theta log (g(\theta))$

If we substitute this identity for $\nabla_\theta p_\theta(x)$ in the following equation then we get:

$\nabla_\theta E_{p_\theta(x)}[f(x)] = \int f(x)\nabla_\theta p_\theta(x) dx = \int f(x)p_\theta(x)\nabla_\theta log p_\theta(x) dx = E_{p_{\theta}(x)}[f(x)\nabla_\theta log p_{\theta}(x)]$

and since this is still an expectation we can again construct a simple monte-carlo estimate of the gradient.

$E_{p_{\theta}(x)}[f(x)\nabla_\theta log p_{\theta}(x)] \approx \sum_i f(x_i)\nabla_\theta log p_{\theta}(x_i)$

job done right? well sometimes yes but sometimes this naive substitution yields a gradient estimate that is too high variance to be useful. In those cases we can turn to the reparamaterisation trick.

The idea behind the reparamaterisation trick is write $x \sim p_\theta(x)$ as a function of a varible $z \sim q(z)$ where $q(z)$ does not depend on $\theta$ . If we can do this, then we are back in the first situation we considered where the gradient operator can simply be brought into the expectation without any complexity. Put another way we want to find a differentiable function $\phi_\theta(z)$ such that:

$x = \phi_\theta(z)$

and

$z \sim q(z)$

where $q$ is a distribution that has no $\theta$ dependence. An example of such a function would be the following reparamaterisation of a normal distribution $P_\theta(x) = N_x(\mu, \Sigma)$ , where what we were calling $\theta$ is now $\mu$ and $\Sigma$ . In this case if we define:

$\phi_\theta(z) = \mu + \Sigma^{\frac{1}{2}}z$

and then sample $z \sim q(z)= N_z(0,I)$ , we then have that $x = \phi_\theta(z)$ .

Once we’ve made this reparamaterisation we can return to our original problem and see that the issues have disappeared.

$\nabla_\theta E_{p_\theta(x)}[f(x)] = \nabla_\theta E_{q(z)}[f(\phi_\theta(z))] \approx \sum_i \nabla_\theta f(\phi_\theta(z))$

with $z \sim q(z)$ . Or more concretely for our Gaussian case:

$\nabla_{\mu,\Sigma} E_{N_x(\mu, \Sigma)}[f(x)] = \nabla_{\mu,\Sigma} E_{N_z(0,I)}[f(\mu + \Sigma^{\frac{1}{2}}z)] \approx \sum_i \nabla_{\mu,\Sigma} f(\mu + \Sigma^{\frac{1}{2}}z_i)$

where $z \sim N_z(0,I)$ .

It might at first seem that constructing these reparamaterisations might be very hard. In fact for discrete distributions, it is very hard and in a later post we may discuss ways around this such as the Gumbel Softmax. However for continuos distributions when you consider the fact that most random number generators first generate uniform distributed variables and transform them, you realise that a lot of complex distributions can be built straightforwardly from simpler distributions.