Variational Auto-Encoders are a popular kind of generative machine learning model.
Unlike a regular encoder that produces a single code given an input,
the VAE encoder produces a distribution over code space with a mean and variance,
both learned as outputs of a neural network.
But actually the variance can be *fixed* instead of learned,
without impacting the quality of the VAE! Read on to see how and why.

more

First let's review what a VAE is real quick.
We've got a dataset with samples $\mathbf{x}$ that come from some unknown distribution $p(\mathbf{x}).$
The *fundamental assumption of machine learning* is that our data comes from some simpler underlying process.
In our case we'll assume that first a random latent vector $\mathbf{z}$ is sampled
following a distribution $p(\mathbf{z})$ (called the *prior*),
and then $\mathbf{x}$ is sampled according to a distribution $p(\mathbf{x}|\mathbf{z})$ (called the *likelihood*).
Sampling the likelihood can be described as a nonlinear mapping $\bm\nu(\mathbf{z})$ plus some noise.
In a VAE, the mapping $\bm\nu$ is called the *decoder*.

That's our idea of how $\mathbf{x}$ came to be.
Now we're going to estimate what $\mathbf{z}$ was, given $\mathbf{x}.$
But we don't know what the mapping $\bm\nu$ is, so we'll have to estimate that too.
This is all considered tricky to do directly, so we use a *variational* method meaning
we map $\mathbf{x}$ to a *distribution* $q(\mathbf{z}|\mathbf{x})$ (called the *posterior*) instead of one particular $\mathbf{z}.$
Sampling the posterior can be described as a nonlinear mapping $\bm\mu(\mathbf{x})$ plus some noise with variance $\bm{\Sigma(\mathbf{x})}.$
You may have already guessed that the mapping $\bm\mu$ is called the *encoder*.
We pack this all up to get a lower bound on the evidence in the data in support of our model, which is humorously called the ELBO:
$$\begin{align}
\text{ELBO} & = -\mathrm{KL}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z}))
+ \e_{q(\mathbf{z}|\mathbf{x})}\! \left[\log p(\mathbf{x}|\mathbf{z})\right] \\
& = \e_{q(\mathbf{z}|\mathbf{x})}\! \left[\log p(\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x}) + \log p(\mathbf{x}|\mathbf{z})\right].
\end{align}$$
The $\mathrm{KL}$ term is a way to compare two distributions called *KL divergence*
to get the posterior $q(\mathbf{z}|\mathbf{x})$ to resemble the prior $p(\mathbf{z})$,
and the expected log likelihood term is to get the encode-decode reconstruction to resemble the original $\mathbf{x}.$
These terms compete, so what happens is the posteriors of different $\mathbf{x}$ jockey for position near the center of the latent space without overlapping too much.
The ELBO is a lower bound because it's not exactly equal to the log evidence $\log p(\mathbf{x})$, but it's close.

The error turns out to be: $$\begin{align} \log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO} \simeq \tfrac{1}{4} \tr^2 ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})\bm{\Sigma} - \mathbf{I}) + O\!\left( \tr^3 ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})\bm{\Sigma} - \mathbf{I}) \right), \end{align}$$ where $\mathbf{J}$ is the decoder Jacobian $\pder{\mathbf{x}}{\mathbf{z}}.$ (This assumes a simple L2 loss for the likelihood, but an analogous formula exists for any likelihood.) Now, at the stationary points of the ELBO it turns out that $\bm{\Sigma} = (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})^\inv$, meaning the squared and cubed trace terms in the error approach zero as we optimize the ELBO. But these terms will still approach zero quite quickly even if $\bm{\Sigma}$ is a little off, due to the squaring. That means we can tolerate some error in $\bm{\Sigma}$ without impacting our result!

So then, what would happen if we forced all posterior variances to be equal?
The optimal choice is $\tilde{\bm{\Sigma}} = \e_{p(\mathbf{x})} [\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}]^\inv$,
and the error over the whole dataset is:
$$\begin{align}
\e_{p(\mathbf{x})}\! \left[ \log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO}_{\tilde{\bm{\Sigma}}} \right]
\simeq
\tfrac{1}{4} \tr \cv^2_{p(\mathbf{x})} [\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}],
\end{align}$$
where $\cv^2$ represents the squared coefficient of variation, or *relative variance*.
This error vanishes when the decoder Hessian is homogeneous. Otherwise, it acts as a regularization term to penalize large variations in the decoder Hessian.
Keep in mind, a homogeneous decoder Hessian is not that restrictive.
For example, we can still *freely rotate the Jacobian* across different regions of space.

We can take advantage of constant $\bm{\Sigma}$ to simplify any VAE.
For example, if we consider the expectation of the ELBO over the whole dataset,
some of the terms simplify when all $\bm{\Sigma}$ are the same:
$$\begin{align}
\e_{p(\mathbf{x})} [\text{ELBO}] =
-\tfrac{1}{2} \log \det{\mathbf{I} + \bm{\Sigma}^\inv \e_{p(\mathbf{x})} [\bm{\mu}\bm{\mu}^\tran]}
+ \e_{p(\mathbf{x})} \! \left[\e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [
\log p_{\bm{\theta}}(\mathbf{x}|\mathbf{z}) ]\right].
\end{align}$$
We can then reformulate the ELBO as an expectation over a batch of data $B$
in a way that is *invariant to our choice of* $\bm{\Sigma}$,
which is in all seriousness called the *Batch Information Lower Bound* (BILBO):
$$\begin{align}
\text{BILBO} =
-\tfrac{1}{2} \log \det{\mathbf{I} + \bm{\Sigma}^\inv \e_B [\bm{\mu}\bm{\mu}^\tran]}
+ \e_B [ \e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [
\log p_{\bm{\theta}}(\mathbf{x}|\mathbf{z})
] ].
\end{align}$$

So now you can use the BILBO instead of the ELBO to train your VAE, and you can just set $\bm{\Sigma}$ to a constant matrix instead of learning it. See the paper for more details.

adapted from arXiv:1912.10309 §4.2

Every article about Variational Auto-Encoders talks about the *evidence lower bound* (ELBO),
a lower bound on the log evidence in the data in support of the model parameters.
We make do with a lower bound because the exact value is presumably too difficult to work with.
But now I am going to tell you the exact log evidence is *actually super easy*.
Oh, and the ELBO is actually a tight bound after all.

more

First let's look at the ELBO: $$ \text{ELBO} = -\mathrm{KL}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) + \e_{q(\mathbf{z}|\mathbf{x})} \left[\log p(\mathbf{x}|\mathbf{z})\right]. $$

Typically $q(\mathbf{z}|\mathbf{x})\!=\!\mathcal{N}(\mathbf{z}; \bm{\mu}(\mathbf{x}), \bm{\Sigma}(\mathbf{x}))$ and $p(\mathbf{z})$ is standard normal. Let's also use a unit normal likelihood $p(\mathbf{x}|\mathbf{z})\!=\!\mathcal{N}(\mathbf{x}; \bm{\nu}(\mathbf{z}), \mathbf{I})$ which acts like an L2 loss. So: $$ \text{ELBO} = \tfrac{1}{2} \! \left( \log \det{e \bm{\Sigma}} - \tr \bm{\Sigma} - |\bm{\mu}|^2 \right) + \e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [-\tfrac{m}{2} \log 2 \pi - \tfrac{1}{2} \! |\mathbf{x} - \bm{\nu}(\mathbf{z})|^2]. $$

(Don't worry - the paper has a proof that other likelihoods can be transformed to unit normal likelihood by appropriately warping space.)

Now everybody knows $\e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [|\mathbf{x} - \bm{\nu}(\mathbf{z})|^2] \simeq |\mathbf{x} - \bm{\nu}(\bm{\mu})|^2 + \tr\mathbf{J}^\tran\!\mathbf{J}\bm{\Sigma}$, where $\mathbf{J}$ is the decoder Jacobian $\pder{\bm{\nu}}{\mathbf{z}}$ at $\mathbf{z}\!=\!\bm{\mu}.$ So we get: $$ \text{ELBO} \simeq - \tfrac{1}{2} \big( m \log 2 \pi + |\bm{\mu}|^2 + |\mathbf{x} - \bm{\nu}(\bm{\mu})|^2 + \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}} \big). $$

That's a lower bound on the log evidence. Here is the *exact log evidence*:
$$
\log p_{\bm{\theta}}(\mathbf{x})
=
- \tfrac{1}{2} \big(
m \log 2 \pi
+ |\mathbf{z}|^2
+ |\mathbf{x} - \bm{\nu}(\mathbf{z})|^2
+ \log \det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}
\big).
$$
It's derived in the paper using similar assumptions as a VAE:
that $\mathbf{x}$ comes from some random latent process $\mathbf{z}$ via some mapping $\bm{\nu}$ plus noise.

Now here's something cool.
Look what happens if we subtract the evidence lower bound from the exact log evidence (at $\mathbf{z}\!=\!\bm{\mu}$):
$$
\log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO}
\simeq
\tfrac{1}{2} \big(
\tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma})
- \log \det{e (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}}
\big).
$$
This is zero if $\bm{\Sigma} = (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})^\inv.$
Now get this - the paper has a proof that this is *true* at the stationary points of the ELBO!
That means if we optimize the ELBO to convergence, we actually get a *tight* bound on the log evidence.

adapted from arXiv:1912.10309 §4.2

The *log Jacobian determinant* is a concept that comes up a lot in machine learning.
It's like a mash-up of three very mathy things.
It shows up whenever we consider the impact that warping space has on entropy.
And warping space is like ninety percent of machine learning.
(The other ninety percent is data.)
It turns out we can compute this thing using only traces.

more

First let's talk about traces.

The *trace* of a matrix is the sum of its eigenvalues.
Suppose we have a matrix $\mathbf{A}$,
and for some reason we can't look directly at it, but we are allowed to know $\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}$ for any vector $\mathbf{v}.$
We can still learn something about $\mathbf{A}$ using a method called probing.
If we randomly sample vectors $\mathbf{v} \sim \mathcal{N}(\bm{0}, \bm{\Sigma})$,
we can estimate the *trace* of the matrix product $\mathbf{A} \bm{\Sigma}$:
$$
\e_{\mathcal{N}(\mathbf{v}; \bm{0}, \bm{\Sigma})} [\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}] = \tr \mathbf{A} \bm{\Sigma}.
$$

Now let's talk about log determinants.

The *log determinant* of a matrix is the sum of its log eigenvalues.
What if we want the log determinant of $\mathbf{A}$, but we can only estimate traces?
We can actually do it with a minimization problem, introducing an auxiliary matrix $\bm{\Sigma}$:
$$
\min_{\bm{\Sigma}} \tr \mathbf{A} \bm{\Sigma} - \log \det{e \bm{\Sigma}}
= \log \det{\mathbf{A}}.
$$
This means we can estimate the log determinant of $\mathbf{A}$ using e.g. stochastic gradient descent over
$\bm{\Sigma}$ by probing $\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}$ with vectors $\mathbf{v} \sim \mathcal{N}(\bm{0}, \bm{\Sigma})$,
which works assuming $\mathbf{A}$ is symmetric positive definite.
And since we control $\bm{\Sigma}$, we can construct it from some low-dimensional factorization
such that computing $\log \det{e \bm{\Sigma}}$ is barely an inconvenience.

Finally let's talk about Jacobians, and auto-encoders!

In the previous article on exact evidence for Variational Auto-Encoders,
we saw that the evidence lower bound (ELBO) is pretty similar to the *exact* log evidence.
The only difference is that the ELBO has $-\tfrac{1}{2}(\tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}})$
where the exact log evidence has $-\tfrac{1}{2}\log\det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}$,
which is the negative log of the Jacobian determinant measuring how much the decoder stretches space (plus noise).

Well guess what happens when we maximize the ELBO:
$$
\min_{\bm{\Sigma}}
\tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}}
= \log \det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}.
$$
That's just what we had before, but with $\mathbf{A} = \mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}.$
This is the only place that $\bm{\Sigma}$ appears in the ELBO, so this is its *only role*.
When we derive the math for variational auto-encoders, it's tempting to imagine $\bm{\Sigma}$ has some intrinsic meaning that we discover via the ELBO.
But we've just shown that $\bm{\Sigma}$ is merely an auxiliary matrix in a stochastic estimator for the log determinant of the decoder Jacobian.

I've pared down wavelet image compression to the bare essentials, producing a simple implementation with compression ratios similar to JPEG 2000, but several times faster. It's fast, it's 1000 lines of C++ with no dependencies, and it's free to use!

more

Check it out at gfwx.org or check it out (literally) from the github!

Oh, and read the white paper. It's like four pages long and it explains everything.

That's it. That's the whole website.