Exact evidence for Variational Auto-Encoders
adapted from arXiv:1912.10309 §4.2
Every article about Variational Auto-Encoders talks about the evidence lower bound (ELBO), a lower bound on the log evidence in the data in support of the model parameters. We make do with a lower bound because the exact value is presumably too difficult to work with. But now I am going to tell you the exact log evidence is actually super easy. Oh, and the ELBO is actually a tight bound after all.
First let's look at the ELBO: $$ \text{ELBO} = -\mathrm{KL}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) + \e_{q(\mathbf{z}|\mathbf{x})} \left[\log p(\mathbf{x}|\mathbf{z})\right]. $$
Typically $q(\mathbf{z}|\mathbf{x})\!=\!\mathcal{N}(\mathbf{z}; \bm{\mu}(\mathbf{x}), \bm{\Sigma}(\mathbf{x}))$ and $p(\mathbf{z})$ is standard normal. Let's also use a unit normal likelihood $p(\mathbf{x}|\mathbf{z})\!=\!\mathcal{N}(\mathbf{x}; \bm{\nu}(\mathbf{z}), \mathbf{I})$ which acts like an L2 loss. So: $$ \text{ELBO} = \tfrac{1}{2} \! \left( \log \det{e \bm{\Sigma}} - \tr \bm{\Sigma} - |\bm{\mu}|^2 \right) + \e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [-\tfrac{m}{2} \log 2 \pi - \tfrac{1}{2} \! |\mathbf{x} - \bm{\nu}(\mathbf{z})|^2]. $$
(Don't worry - the paper has a proof that other likelihoods can be transformed to unit normal likelihood by appropriately warping space.)
Now everybody knows $\e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [|\mathbf{x} - \bm{\nu}(\mathbf{z})|^2] \simeq |\mathbf{x} - \bm{\nu}(\bm{\mu})|^2 + \tr\mathbf{J}^\tran\!\mathbf{J}\bm{\Sigma}$, where $\mathbf{J}$ is the decoder Jacobian $\pder{\bm{\nu}}{\mathbf{z}}$ at $\mathbf{z}\!=\!\bm{\mu}.$ So we get: $$ \text{ELBO} \simeq - \tfrac{1}{2} \big( m \log 2 \pi + |\bm{\mu}|^2 + |\mathbf{x} - \bm{\nu}(\bm{\mu})|^2 + \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}} \big). $$
That's a lower bound on the log evidence. Here is the exact log evidence: $$ \log p_{\bm{\theta}}(\mathbf{x}) = - \tfrac{1}{2} \big( m \log 2 \pi + |\mathbf{z}|^2 + |\mathbf{x} - \bm{\nu}(\mathbf{z})|^2 + \log \det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}} \big). $$ It's derived in the paper using similar assumptions as a VAE: that $\mathbf{x}$ comes from some random latent process $\mathbf{z}$ via some mapping $\bm{\nu}$ plus noise.
Now here's something cool. Look what happens if we subtract the evidence lower bound from the exact log evidence (at $\mathbf{z}\!=\!\bm{\mu}$): $$ \log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO} \simeq \tfrac{1}{2} \big( \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}} \big). $$ This is zero if $\bm{\Sigma} = (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})^\inv.$ Now get this - the paper has a proof that this is true at the stationary points of the ELBO! That means if we optimize the ELBO to convergence, we actually get a tight bound on the log evidence.
More articles