Let's derive some things related to variational autoencoders (VAEs).
Evidence Lower Bound (ELBO)
First, we'll state some assumptions. We have a dataset of images, $x$. We'll assume that each image is generated from some unseen latent code $z$, and there's an underlying distribution of latents $p(z$). We'd like to discover parameters $\theta$ to maximize the likelihood of $p_\theta(xz)$ under the data. In practical terms  train a neural network that will generate images from the dataset.
Let's look at the posterior term closer.
$p_\theta(xz) = \dfrac{p(zx)p(x)}{p(z)}$
 We can calculate $p_\theta(xz)$ for any $z$.
 We can assume $p(z)$. In the VAE case, it's the standard Gaussian.
 We don't know $p(x)$. It is intractable to compute $p(x) = \int p_\theta(xz)p(z)dz$, because the space of all $z$ is large.
 We also don't know $p(zx)$.
To help us, we'll introduce an approximate model: $q_\phi(zx)$. The idea is to closely match the true distribution $p(zx)$, but in a way that we can sample from it. We can quantify the distance between the two distributions via KL divergence.
$KL(q_\phi(zx), p(zx)) = E_q[log \; q_\phi(zx)  log \; p(zx)]$
In fact, breaking down this term will give us a hint on how to measure $p(x)$.
$= E_q[log \; q_\phi(zx)  log \; p(zx)]$
$= E_q[log \; q_\phi(zx)  log \; p(z,x)/p(x)]$
$= E_q[log \; q_\phi(zx)  log \; p(z,x) + log \; p(x)]$
$= E_q[log \; q_\phi(zx)  log \; p(z,x)] + log \; p(x)$
$= E_q[log \; q_\phi(zx)  log \; p_\theta(xz)p(z)] + log \; p(x)$
We can rearrange terms as:
$log \; p(x) = E_q[log \; p_\theta(xz)p(z)  log \; q_\phi(zx)] + KL(q_\phi(zx), p(zx))$
So, $log \; p(x)$ breaks down into two terms.

The KL term here is intractable because it involves $p(zx)$, so we can't sample it. At least, we know that KL is always greater than zero.

The expectation is tractable! It involves three functions that we can evaluate. Note that $E_q$ means $x$ is sampled from the data, then $z$ computed via $q_\phi(zx)$. This expectation is a lower bound on $log \; p(x)$. We call it the evidence lower bound, or ELBO.
The ELBO can be further broken down into two components.
$E_q[log \; p_\theta(xz)p(z)  log \; q_\phi(zx)]$
$= E_q[log \; p_\theta(xz)+ log \; p(z)  log \; q_\phi(zx)]$
$= E_q[log \; p_\theta(xz)] + E_q[log \; p(z)  log \; q_\phi(zx)]$
$= E_q[log \; p_\theta(xz)] + KL(p(z), q_\phi(zx))$
The ELBO is equal to:
 A reconstruction objective. For all $x$ in the dataset, encoding it via $q_\phi$ then decoding via $p_\theta$ should give high probability for the original $x$.
 A priormatching objective. For all $x$ in the dataset, the distribution of $q_\phi(zx)$ should be similar to the prior $p(z)$.
Here's a practical way to look at the ELBO objective. We can't maximize $p(x)$ directly because we don't have access to $p(zx)$. But if we approximate $p(zx)$ with $q_\phi(zx)$, we can get a lower bound on $p(x)$ and maximize that instead. The lower bound depends on 1) How well $x' = p(q(x))$ recreates the data, and 2) How well $q(x)$ matches the true prior $p(z)$.
Analytical KL divergence for Gaussians
In the classic VAE setup, p(x) is a standard Gaussian. The priormatching objective above can be computed analytically without the need to sample q(zx).
Since the VAE setup defines an independent Gaussian for each data point in the batch, we only need to consider the univariate Gaussian case (instead of a multivariate). The encoder network will output a mean $\mu$ and standard deviation $\sigma$. We'll refer to this as $P = N(\mu, \sigma^2)$. The standard Gaussian is $Q = N(0,1)$.
The KL divergence measures how different two probability distributions are:
$KL(P,Q) = E_{P}[log \;P(x)  log \;Q(x)]$
We'll also need the probability density for a Gaussian:
$p(x\mu,\sigma) = \dfrac{1}{\sigma \sqrt{2 \pi}} e^{\dfrac{1}{2}(\dfrac{x\mu}{\sigma})^2}$
For Q, $\mu=0$ and $\sigma=1$ so the term simplifies.
$q(x) = \dfrac{1}{\sqrt{2 \pi}} e^{\dfrac{1}{2}x^2}$
Given the above, let's plug in to the KL divergence.
$KL(P,Q) = E_P[log(\dfrac{1}{\sigma \sqrt{2 \pi}}) \dfrac{1}{2}(\dfrac{x\mu}{\sigma})^2 log(\dfrac{1}{\sqrt{2 \pi}}) +\dfrac{1}{2}(x)^2]$
Let's break down this expectation and deal with each term onebyone.
$KL(P,Q) = E_P[log(\dfrac{1}{\sigma \sqrt{2 \pi}}) log(\dfrac{1}{\sqrt{2 \pi}})] + E_P[ \dfrac{1}{2}(\dfrac{x\mu}{\sigma})^2] + E_P[\dfrac{1}{2}(x)^2]$
The first expectation involves two logs that can be combined into one. There is no $x$ term, so the expectation can be dropped.
$= E_P[log(\dfrac{\sigma \sqrt{2 \pi}}{\sqrt{2 \pi}})]$
$= E_P[log(\sigma)]$
$= log(\sigma) = (1/2)log(\sigma^2)$
The second expectation can be simplified by knowing that $E_P[(x\mu)^2]$ is the equation for variance ($\sigma^2$).
$E_P[ \dfrac{1}{2}(\dfrac{(x\mu)^2}{\sigma^2})]$
$= (1/2) E_P[(x\mu)^2)] (1/\sigma^2)$
$= (1/2) \sigma^2 (1/\sigma^2)$
$= (1/2)$
The third expectation can also be explained in terms of variance. An equivalent equation for variance is $\sigma^2 = E[X^2]  E[X]^2$. In our case, $E[X] = \mu$, and $E[X^2]$ is what we want to find. So,
$\sigma^2 = E[X^2]  \mu^2$
$E[X^2] = \sigma^2 + \mu^2$
$(1/2)E[X^2] = (1/2)(\sigma^2 + \mu^2)$
Nice, all of our expectations are now expressed as functions of $\mu$ and $\sigma$. Put together, we get the final equation for the KL divergence loss:
$KL\_Loss(\mu,\sigma) = (1/2) [log(\sigma^2)  1 +\sigma^2 + \mu^2]$
Note that our encoder would give us a batchsized vector of $\mu$ and $\sigma$. Because we assume each $(\mu, \sigma)$ pair parametrizes an independent Gaussian, the total loss is the sum of the above equation applied elementwise.
Sanity check: the KL Loss should be minimized when $\mu=0, \sigma=1$.
$(d/d \mu) \; [log(\sigma^2)  1 +\sigma^2 + \mu^2] = 2 \mu \rightarrow$ min. at $\mu=0$.
$(d/d \sigma) \; [log(\sigma^2)  1 +\sigma^2 + \mu^2] = 2 \sigma  2/\sigma \rightarrow$ min. at $\sigma=1$.
All good!
An aside on batchwise KL vs. elementwise KL
A common misconception when examining this KL loss is how it relates to each batch of data. One could imagine that the KL loss is meant to shape a given batch of latent vectors so the approximate distribution within the batch is standard Gaussian. This interpretation would mean that on average, the latents should have mean 0 and variation 1 – but each sample can vary. It is possible to nearlyperfectly match the standard Gaussian while conveying information about the input image.
The above interpretation is incorrect. Instead, the KL loss is applied elementwise. Each image is encoded into a mean+variation pair, and the loss encourages this explicit mean and variation to resemble 0 and 1. Thus to nearlymatch the standard Gaussian, each image would encode to a standard Gaussian (and thus no unique information is conveyed). In this way, the KL loss and the recreation objective are conflicting. A balance is located depending on the scaling of the two objectives.