Back in 2015 we were rendering point clouds without gaps in AtomView. Nurulize hired me and Lynn through our company Happy Digital to create a new kind of VR experience - rendering huge immersive point clouds in such a way that the gaps between points are filled in, producing the illusion of an extremely detailed solid model that you could explore up close.
Scott Metzger at Nurulize envisioned that Lidar scans - 360 degree environments represented as a point cloud with RGB colors - should be visualized directly in VR with no further processing, if only there were a way to render them as solid looking models. He asked us if it was possible. Methods for rendering points as filled-in splats existed but lacked the performance needed for 90 fps stereo views on an Oculus. So Lynn and I put on our inventor hats and created a new point filling algorithm with the breakthrough performance that was called for. This thing could render a 360-degree Lidar scan with one billion points at 90 fps in stereo on a decent NVidia card back in 2016. (See the patent.)
We wrote the whole application in C++ and OpenGL with custom GLSL shaders to implement the point rendering for maximum compatability - no CUDA. We supported both Oculus SDK and OpenVR so we could run demos on Oculus and Vive. We used OpenColorIO for real-time color management on GPU. We built a custom vector graphics UI framework from scratch, levaraging NVidia Path Rendering for crisp vector graphics overlaid transparently over the 3D content, with fluid menu transitions animated at full frame rate. Every element of the experience was free from lag or jitter. All of this polish helped sell the experience, placing it firmly in the future. We implemented a scene graph and supported point cloud import from a growing number of formats including FLS, PLY, PTX, XYZ, FBX, E57, Alembic, and V-Ray vrmesh, and supported multi-frame animations. Tobias Anderberg took over development in 2017 after Lynn and I left to pursue other projects. He remade our circular radial menus into a slick hex design and continued to update the software in general.
The first public demos of AtomView were at SIGGRAPH 2015 in a hotel room, and companies like NVidia took notice. In 2017, AtomView won the NVidia VR Content Showcase, significantly raising Nurulize's profile in the industry. In 2019, Sony Innovation Studios acquired Nurulize to create new virtual production workflows built on AtomView, and it was further recognized with an Entertainment Technology Lumiere Award. While Lynn and I are no longer directly involved with AtomView, the technology we innovated continues to power virtual production at Sony.
Before Gaussian Splats, there was NeRF. Before NeRF, there was MPI. In 2019 I was part of the DeepView project (subtitle: view synthesis with learned gradient descent). We trained a neural network to turn a handful of photographs into a simple volumetric scene representation called a multi plane image or MPI. MPIs were poised to take the world by storm - they're virtually artifact-free, and render in real time with a simple Javascript viewer.
MPIs opened up numerous applications like accurate 6-DOF video stabilization, depth of field, and deep compositing, or simply enjoying the scene in 3D. The one limitation was you couldn't move the viewpoint very far.
Then in 2020, NeRFs landed. (NeRFs, or neural reflectance fields, were the hot topic until Gaussian Splats took the spotlight in 2023.) They were slow to render, even slower to prepare, and required a lot of input photos. But, they didn't have any restriction on moving the viewpoint, so MPIs were forgotten. 3D Gaussian Splatting landed next, which saw the return of in-browser rendering and the artifact-free feeling we had with MPIs. They're somewhat more expensive to render, but you can move the viewpoint anywhere.
Really, all three methods have merit and seeded a garden of variants.
Fun facts: We didn't invent MPIs; we created a better way to produce them using learned gradient descent. The authors of NeRF didn't invent reflectance fields; they developed a reflectance field function parameterization that was friendlier to stochastic gradient descent. And the authors of 3D Gaussian Splatting didn't invent 3D Gaussian splatting; they implemented an efficient differentiable splat renderer that allowed the Gaussians to be fitted to real photographs.
See the DeepView project page and don't miss the in-browser demos.
I've pared down wavelet image compression to the bare essentials, producing a simple implementation with compression ratios similar to JPEG 2000, but several times faster. It's fast, it's 1000 lines of C++ with no dependencies, and it's free to use!
Check it out at gfwx.org or check it out (literally) from the github!
Oh, and read the white paper. It's like four pages long and it explains everything.
Variational Auto-Encoders are a popular kind of generative machine learning model. Unlike a regular encoder that produces a single code given an input, the VAE encoder produces a distribution over code space with a mean and variance, both learned as outputs of a neural network. But actually the variance can be fixed instead of learned, without impacting the quality of the VAE! Read on to see how and why.
First let's review what a VAE is real quick. We've got a dataset with samples $\mathbf{x}$ that come from some unknown distribution $p(\mathbf{x}).$ The fundamental assumption of machine learning is that our data comes from some simpler underlying process. In our case we'll assume that first a random latent vector $\mathbf{z}$ is sampled following a distribution $p(\mathbf{z})$ (called the prior), and then $\mathbf{x}$ is sampled according to a distribution $p(\mathbf{x}|\mathbf{z})$ (called the likelihood). Sampling the likelihood can be described as a nonlinear mapping $\bm\nu(\mathbf{z})$ plus some noise. In a VAE, the mapping $\bm\nu$ is called the decoder.
That's our idea of how $\mathbf{x}$ came to be. Now we're going to estimate what $\mathbf{z}$ was, given $\mathbf{x}.$ But we don't know what the mapping $\bm\nu$ is, so we'll have to estimate that too. This is all considered tricky to do directly, so we use a variational method meaning we map $\mathbf{x}$ to a distribution $q(\mathbf{z}|\mathbf{x})$ (called the posterior) instead of one particular $\mathbf{z}.$ Sampling the posterior can be described as a nonlinear mapping $\bm\mu(\mathbf{x})$ plus some noise with variance $\bm{\Sigma(\mathbf{x})}.$ You may have already guessed that the mapping $\bm\mu$ is called the encoder. We pack this all up to get a lower bound on the evidence in the data in support of our model, which is humorously called the ELBO: $$\begin{align} \text{ELBO} & = -\mathrm{KL}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) + \e_{q(\mathbf{z}|\mathbf{x})}\! \left[\log p(\mathbf{x}|\mathbf{z})\right] \\ & = \e_{q(\mathbf{z}|\mathbf{x})}\! \left[\log p(\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x}) + \log p(\mathbf{x}|\mathbf{z})\right]. \end{align}$$ The $\mathrm{KL}$ term is a way to compare two distributions called KL divergence to get the posterior $q(\mathbf{z}|\mathbf{x})$ to resemble the prior $p(\mathbf{z})$, and the expected log likelihood term is to get the encode-decode reconstruction to resemble the original $\mathbf{x}.$ These terms compete, so what happens is the posteriors of different $\mathbf{x}$ jockey for position near the center of the latent space without overlapping too much. The ELBO is a lower bound because it's not exactly equal to the log evidence $\log p(\mathbf{x})$, but it's close.
The error turns out to be: $$\begin{align} \log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO} \simeq \tfrac{1}{4} \tr^2 ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})\bm{\Sigma} - \mathbf{I}) + O\!\left( \tr^3 ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})\bm{\Sigma} - \mathbf{I}) \right), \end{align}$$ where $\mathbf{J}$ is the decoder Jacobian $\pder{\mathbf{x}}{\mathbf{z}}.$ (This assumes a simple L2 loss for the likelihood, but an analogous formula exists for any likelihood.) Now, at the stationary points of the ELBO it turns out that $\bm{\Sigma} = (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})^\inv$, meaning the squared and cubed trace terms in the error approach zero as we optimize the ELBO. But these terms will still approach zero quite quickly even if $\bm{\Sigma}$ is a little off, due to the squaring. That means we can tolerate some error in $\bm{\Sigma}$ without impacting our result!
So then, what would happen if we forced all posterior variances to be equal? The optimal choice is $\tilde{\bm{\Sigma}} = \e_{p(\mathbf{x})} [\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}]^\inv$, and the error over the whole dataset is: $$\begin{align} \e_{p(\mathbf{x})}\! \left[ \log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO}_{\tilde{\bm{\Sigma}}} \right] \simeq \tfrac{1}{4} \tr \cv^2_{p(\mathbf{x})} [\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}], \end{align}$$ where $\cv^2$ represents the squared coefficient of variation, or relative variance. This error vanishes when the decoder Hessian is homogeneous. Otherwise, it acts as a regularization term to penalize large variations in the decoder Hessian. Keep in mind, a homogeneous decoder Hessian is not that restrictive. For example, we can still freely rotate the Jacobian across different regions of space.
We can take advantage of constant $\bm{\Sigma}$ to simplify any VAE. For example, if we consider the expectation of the ELBO over the whole dataset, some of the terms simplify when all $\bm{\Sigma}$ are the same: $$\begin{align} \e_{p(\mathbf{x})} [\text{ELBO}] = -\tfrac{1}{2} \log \det{\mathbf{I} + \bm{\Sigma}^\inv \e_{p(\mathbf{x})} [\bm{\mu}\bm{\mu}^\tran]} + \e_{p(\mathbf{x})} \! \left[\e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [ \log p_{\bm{\theta}}(\mathbf{x}|\mathbf{z}) ]\right]. \end{align}$$ We can then reformulate the ELBO as an expectation over a batch of data $B$ in a way that is invariant to our choice of $\bm{\Sigma}$, which is in all seriousness called the Batch Information Lower Bound (BILBO): $$\begin{align} \text{BILBO} = -\tfrac{1}{2} \log \det{\mathbf{I} + \bm{\Sigma}^\inv \e_B [\bm{\mu}\bm{\mu}^\tran]} + \e_B [ \e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [ \log p_{\bm{\theta}}(\mathbf{x}|\mathbf{z}) ] ]. \end{align}$$
So now you can use the BILBO instead of the ELBO to train your VAE, and you can just set $\bm{\Sigma}$ to a constant matrix instead of learning it. See the paper for more details.
adapted from arXiv:1912.10309 §4.2
Every article about Variational Auto-Encoders talks about the evidence lower bound (ELBO), a lower bound on the log evidence in the data in support of the model parameters. We make do with a lower bound because the exact value is presumably too difficult to work with. But now I am going to tell you the exact log evidence is actually super easy. Oh, and the ELBO is actually a tight bound after all.
First let's look at the ELBO: $$ \text{ELBO} = -\mathrm{KL}(q(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) + \e_{q(\mathbf{z}|\mathbf{x})} \left[\log p(\mathbf{x}|\mathbf{z})\right]. $$
Typically $q(\mathbf{z}|\mathbf{x})\!=\!\mathcal{N}(\mathbf{z}; \bm{\mu}(\mathbf{x}), \bm{\Sigma}(\mathbf{x}))$ and $p(\mathbf{z})$ is standard normal. Let's also use a unit normal likelihood $p(\mathbf{x}|\mathbf{z})\!=\!\mathcal{N}(\mathbf{x}; \bm{\nu}(\mathbf{z}), \mathbf{I})$ which acts like an L2 loss. So: $$ \text{ELBO} = \tfrac{1}{2} \! \left( \log \det{e \bm{\Sigma}} - \tr \bm{\Sigma} - |\bm{\mu}|^2 \right) + \e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [-\tfrac{m}{2} \log 2 \pi - \tfrac{1}{2} \! |\mathbf{x} - \bm{\nu}(\mathbf{z})|^2]. $$
(Don't worry - the paper has a proof that other likelihoods can be transformed to unit normal likelihood by appropriately warping space.)
Now everybody knows $\e_{\mathcal{N}(\mathbf{z}; \bm{\mu}, \bm{\Sigma})} [|\mathbf{x} - \bm{\nu}(\mathbf{z})|^2] \simeq |\mathbf{x} - \bm{\nu}(\bm{\mu})|^2 + \tr\mathbf{J}^\tran\!\mathbf{J}\bm{\Sigma}$, where $\mathbf{J}$ is the decoder Jacobian $\pder{\bm{\nu}}{\mathbf{z}}$ at $\mathbf{z}\!=\!\bm{\mu}.$ So we get: $$ \text{ELBO} \simeq - \tfrac{1}{2} \big( m \log 2 \pi + |\bm{\mu}|^2 + |\mathbf{x} - \bm{\nu}(\bm{\mu})|^2 + \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}} \big). $$
That's a lower bound on the log evidence. Here is the exact log evidence: $$ \log p_{\bm{\theta}}(\mathbf{x}) = - \tfrac{1}{2} \big( m \log 2 \pi + |\mathbf{z}|^2 + |\mathbf{x} - \bm{\nu}(\mathbf{z})|^2 + \log \det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}} \big). $$ It's derived in the paper using similar assumptions as a VAE: that $\mathbf{x}$ comes from some random latent process $\mathbf{z}$ via some mapping $\bm{\nu}$ plus noise.
Now here's something cool. Look what happens if we subtract the evidence lower bound from the exact log evidence (at $\mathbf{z}\!=\!\bm{\mu}$): $$ \log p_{\bm{\theta}}(\mathbf{x}) - \text{ELBO} \simeq \tfrac{1}{2} \big( \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}} \big). $$ This is zero if $\bm{\Sigma} = (\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I})^\inv.$ Now get this - the paper has a proof that this is true at the stationary points of the ELBO! That means if we optimize the ELBO to convergence, we actually get a tight bound on the log evidence.
adapted from arXiv:1912.10309 §4.2
The log Jacobian determinant is a concept that comes up a lot in machine learning. It's like a mash-up of three very mathy things. It shows up whenever we consider the impact that warping space has on entropy. And warping space is like ninety percent of machine learning. (The other ninety percent is data.) It turns out we can compute this thing using only traces.
First let's talk about traces.
The trace of a matrix is the sum of its eigenvalues. Suppose we have a matrix $\mathbf{A}$, and for some reason we can't look directly at it, but we are allowed to know $\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}$ for any vector $\mathbf{v}.$ We can still learn something about $\mathbf{A}$ using a method called probing. If we randomly sample vectors $\mathbf{v} \sim \mathcal{N}(\bm{0}, \bm{\Sigma})$, we can estimate the trace of the matrix product $\mathbf{A} \bm{\Sigma}$: $$ \e_{\mathcal{N}(\mathbf{v}; \bm{0}, \bm{\Sigma})} [\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}] = \tr \mathbf{A} \bm{\Sigma}. $$
Now let's talk about log determinants.
The log determinant of a matrix is the sum of its log eigenvalues. What if we want the log determinant of $\mathbf{A}$, but we can only estimate traces? We can actually do it with a minimization problem, introducing an auxiliary matrix $\bm{\Sigma}$: $$ \min_{\bm{\Sigma}} \tr \mathbf{A} \bm{\Sigma} - \log \det{e \bm{\Sigma}} = \log \det{\mathbf{A}}. $$ This means we can estimate the log determinant of $\mathbf{A}$ using e.g. stochastic gradient descent over $\bm{\Sigma}$ by probing $\mathbf{v}^\tran\!\mathbf{A} \mathbf{v}$ with vectors $\mathbf{v} \sim \mathcal{N}(\bm{0}, \bm{\Sigma})$, which works assuming $\mathbf{A}$ is symmetric positive definite. And since we control $\bm{\Sigma}$, we can construct it from some low-dimensional factorization such that computing $\log \det{e \bm{\Sigma}}$ is barely an inconvenience.
Finally let's talk about Jacobians, and auto-encoders!
In the previous article on exact evidence for Variational Auto-Encoders, we saw that the evidence lower bound (ELBO) is pretty similar to the exact log evidence. The only difference is that the ELBO has $-\tfrac{1}{2}(\tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}})$ where the exact log evidence has $-\tfrac{1}{2}\log\det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}$, which is the negative log of the Jacobian determinant measuring how much the decoder stretches space (plus noise).
Well guess what happens when we maximize the ELBO: $$ \min_{\bm{\Sigma}} \tr ((\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}) \bm{\Sigma}) - \log \det{e \bm{\Sigma}} = \log \det{\mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}}. $$ That's just what we had before, but with $\mathbf{A} = \mathbf{J}^\tran\!\mathbf{J} + \mathbf{I}.$ This is the only place that $\bm{\Sigma}$ appears in the ELBO, so this is its only role. When we derive the math for variational auto-encoders, it's tempting to imagine $\bm{\Sigma}$ has some intrinsic meaning that we discover via the ELBO. But we've just shown that $\bm{\Sigma}$ is merely an auxiliary matrix in a stochastic estimator for the log determinant of the decoder Jacobian.
That's it. That's the whole website.