Computing the log Jacobian determinant using only traces

adapted from arXiv:1912.10309 §4.2

The log Jacobian determinant is a concept that comes up a lot in machine learning. It's like a mash-up of three very mathy things. It shows up whenever we consider the impact that warping space has on entropy. And warping space is like ninety percent of machine learning. (The other ninety percent is data.) It turns out we can compute this thing using only traces.

First let's talk about traces.

The trace of a matrix is the sum of its eigenvalues. Suppose we have a matrix $\mathbf{A}$, and for some reason we can't look directly at it, but we are allowed to know $\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}$ for any vector $\mathbf{v}.$ We can still learn something about $\mathbf{A}$ using a method called probing. If we randomly sample vectors $\mathbf{v} \sim \mathcal{N}(\bm{0}, \bm{\Sigma})$, we can estimate the trace of the matrix product $\mathbf{A} \bm{\Sigma}$: $$ \e_{\mathcal{N}(\mathbf{v}; \bm{0}, \bm{\Sigma})} [\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}] = \tr \mathbf{A} \bm{\Sigma}. $$

Now let's talk about log determinants.

The log determinant of a matrix is the sum of its log eigenvalues. What if we want the log determinant of $\mathbf{A}$, but we can only estimate traces? We can actually do it with a minimization problem, introducing an auxiliary matrix $\bm{\Sigma}$: $$ \min_{\bm{\Sigma}} \tr \mathbf{A} \bm{\Sigma} - \log \det{e \bm{\Sigma}} = \log \det{\mathbf{A}}. $$ This means we can estimate the log determinant of $\mathbf{A}$ using e.g. stochastic gradient descent over $\bm{\Sigma}$ by probing $\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}$ with vectors $\mathbf{v} \sim \mathcal{N}(\bm{0}, \bm{\Sigma})$, which works assuming $\mathbf{A}$ is symmetric positive definite. And since we control $\bm{\Sigma}$, we can construct it from some low-dimensional factorization such that computing $\log \det{e \bm{\Sigma}}$ is barely an inconvenience.

Finally let's talk about Jacobians, and auto-encoders!

In the previous article on exact evidence for Variational Auto-Encoders, we saw that the evidence lower bound (ELBO) is pretty similar to the exact log evidence. The only difference is that the ELBO has $-\tfrac{1}{2}(\tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}})$ where the exact log evidence has $-\tfrac{1}{2}\log\det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}$, which is the negative log of the Jacobian determinant measuring how much the decoder stretches space (plus noise).

Well guess what happens when we maximize the ELBO: $$ \min_{\bm{\Sigma}} \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}} = \log \det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}. $$ That's just what we had before, but with $\mathbf{A} = \mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}.$ This is the only place that $\bm{\Sigma}$ appears in the ELBO, so this is its only role. When we derive the math for variational auto-encoders, it's tempting to imagine $\bm{\Sigma}$ has some intrinsic meaning that we discover via the ELBO. But we've just shown that $\bm{\Sigma}$ is merely an auxiliary matrix in a stochastic estimator for the log determinant of the decoder Jacobian.