🎓 Math Rationale for ELBO/KL in Bayesian Inference and VAEs

May 27, 2025·
Jiyuan (Jay) Liu
Jiyuan (Jay) Liu
· 5 min read
Image credit: Shawn Day on Unsplash

🔍 Marginal likelihood is often intractable in Bayesian Inference

Bayesian inference and Variational Autoencoders (VAEs), a marriage of deep learning and Bayesian inference, are powerful and foundational for probabilistic modeling used in computational biology. But rely on the concept of marginal likelihood, which is often intractable to compute directly. This is because both frameworks typically involve integrating over latent variables or parameters, which can lead to high-dimensional integrals that are computationally intractable.

Specifically, in Bayesian inference, we want the posterior distribution:

$$p(\theta \mid D) = \frac{p(D \mid \theta) \, p(\theta)}{p(D)} \tag{1}$$

But the denominator—the evidence or marginal likelihood—is:

$$p(D) = \int p(D \mid \theta) \, p(\theta) \, d\theta \tag{2}$$

This integral is often intractable.

💡 Variational Inference: The Workaround

We introduce a simpler variational distribution $q(\theta)$ to approximate $p(\theta \mid D)$ , and we try to make $q$ close to the true posterior.

We measure closeness using KL divergence:

$$\text{KL}(q(\theta) \| p(\theta \mid D)) = \int q(\theta) \log \frac{q(\theta)}{p(\theta \mid D)} d\theta \tag{3}$$

This is hard to compute directly because it involves $p(D)$ , so we rearrange terms.

We can rewrite $\log p(D)$ as:

$$\log p(D) = \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] + \text{KL}(q(\theta) \| p(\theta \mid D)) \tag{4}$$

Thus:

$$\log p(D) = \text{ELBO}(q) + \text{KL}(q(\theta) \| p(\theta \mid D)) \tag{5}$$

Since the KL divergence is always ≥ 0:

$$\text{ELBO}(q) \leq \log p(D) \tag{6}$$

That’s why it’s called a lower bound.

🔑 Why KL is Computable in Practice

At first glance, the KL divergence:

$$ \text{KL}(q(\theta) \| p(\theta \mid D)) = \mathbb{E}_{q(\theta)} \left[ \log \frac{q(\theta)}{p(\theta \mid D)} \right] \tag{26} $$

looks not computable: it involves $p(\theta \mid D)$ , the very posterior we can’t evaluate (since it depends on the intractable evidence $p(D)$ ).

computable indirectly

We never compute KL directly. Instead, we use the ELBO identity:

$$ \log p(D) = \text{ELBO}(q) + \text{KL}(q(\theta) \| p(\theta \mid D)) \tag{27} $$

Rearrange:

$$ \text{KL}(q(\theta) \| p(\theta \mid D)) = \log p(D) - \text{ELBO}(q) \tag{28} $$
  • $\log p(D)$ is intractable.
  • But ELBO is tractable because it only requires expectations under $q(\theta)$ of terms involving $p(D, \theta)$ , not $p(\theta \mid D)$ .

Specifically:

$$ \text{ELBO}(q) = \mathbb{E}_{q(\theta)}[\log p(D, \theta)] - \mathbb{E}_{q(\theta)}[\log q(\theta)] \tag{29} $$

Both terms are computable:

  • $q(\theta)$ is chosen by us (so we can evaluate and sample from it).
  • $p(D, \theta)$ is the joint model (likelihood × prior), which we know by assumption.

Thus, we avoid direct use of $p(\theta \mid D)$ .

The Trick

  • The KL involves an intractable posterior.
  • But the ELBO replaces it with quantities we can compute (likelihood, prior, and variational distribution).
  • Maximizing ELBO → indirectly minimizing the KL.

That’s why all variational inference methods focus on the ELBO, not the raw KL.

🧮 Derive ELBO and KL

Let $q(\theta)$ be any distribution over $\theta$ such that its support covers that of $p(\theta \mid D)$ . We’ll exploit a classic trick: insert $q(\theta)$ into the log marginal likelihood using expectation and apply properties of KL divergence.

Step 1: Start with log evidence

We take the logarithm of $p(D)$ , and “multiply and divide” inside by $q(\theta)$ :

$$\log p(D) = \log \int \frac{q(\theta)}{q(\theta)} p(D \mid \theta) p(\theta) d\theta \tag{7}$$ $$= \log \int q(\theta) \cdot \frac{p(D \mid \theta) p(\theta)}{q(\theta)} d\theta \tag{8}$$ $$= \log \mathbb{E}_{q(\theta)} \left[ \frac{p(D \mid \theta) p(\theta)}{q(\theta)} \right] \tag{9}$$

This is Jensen’s inequality time.

Step 2: Apply Jensen’s Inequality

$$\log \mathbb{E}_{q(\theta)} \left[ \frac{p(D \mid \theta) p(\theta)}{q(\theta)} \right] \geq \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D \mid \theta) p(\theta)}{q(\theta)} \right] \tag{10}$$

That gives us the ELBO:

$$\text{ELBO}(q) = \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] \tag{11}$$

So:

$$\log p(D) \geq \text{ELBO}(q) \tag{12}$$

But we can go further — let’s rewrite $\log p(D)$ exactly in terms of ELBO + KL divergence.

Step 3: Add and Subtract the Same Quantity

We now write:

$$\log p(D) = \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] + \\ \left( \log p(D) - \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] \right) \tag{13}$$

Now we observe that the term in parentheses is exactly the KL divergence between $q(\theta)$ and the true posterior:

$$\text{KL}(q(\theta) \| p(\theta \mid D)) = \mathbb{E}_{q(\theta)} \left[ \log \frac{q(\theta)}{p(\theta \mid D)} \right] \tag{14}$$

But recall:

$$p(\theta \mid D) = \frac{p(D, \theta)}{p(D)} \Rightarrow \\ \log p(\theta \mid D) = \log p(D, \theta) - \log p(D) \tag{15} $$

Then:

$$\log \frac{q(\theta)}{p(\theta \mid D)} = \log \frac{q(\theta)}{p(D, \theta)} + \log p(D) \tag{16}$$

Take expectation over $q(\theta)$ :

$$\text{KL}(q(\theta) \| p(\theta \mid D)) = -\mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] + \log p(D) \tag{17}$$

Rearranged:

$$\log p(D) = \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] + \text{KL}(q(\theta) \| p(\theta \mid D)) \tag{18}$$

Definition of Expectation used above

Note the above derivations used multiple times the definition that expectation of a function $f(\theta)$ under a probability distribution $q(\theta)$ is:

$$\mathbb{E}_{q(\theta)}[f(\theta)] = \int q(\theta) \, f(\theta) \, d\theta \tag{19}$$

📐 ELBO Expression used in VAEs

$$\text{ELBO}(q) = \mathbb{E}_{q(\theta)}[\log p(D \mid \theta)] - \text{KL}(q(\theta) \| p(\theta)) \tag{20}$$

This is the most widely used form in variational inference and VAEs. It comes from expanding the joint $p(D, \theta)$, and interpreting the ELBO as a trade-off between reconstruction and regularization.

Interpretation:

  • The first term encourages $q(\theta)$ to explain the data well.
  • The second term encourages $q(\theta)$ to stay close to the prior.

Its derivation start from:

1. ELBO–KL decomposition:

$$\log p(D) = \text{ELBO}(q) + \text{KL}(q(\theta) \| p(\theta \mid D)) \tag{5}$$

This is always true by the definition of the Kullback-Leibler divergence and Jensen’s inequality. Rearranging:

$$\text{ELBO}(q) = \log p(D) - \text{KL}(q(\theta) \| p(\theta \mid D)) \tag{22}$$

2. Definition of ELBO via expected joint:

Alternatively, ELBO is often defined as:

$$\text{ELBO}(q) = \mathbb{E}_{q(\theta)} \left[ \log \frac{p(D, \theta)}{q(\theta)} \right] \\ = \mathbb{E}_{q(\theta)}[\log p(D, \theta)] - \mathbb{E}_{q(\theta)}[\log q(\theta)] \tag{23}$$

Now recall:

$$\log p(D, \theta) = \log p(D \mid \theta) + \log p(\theta) \tag{24}$$

So:

$$\text{ELBO}(q) = \mathbb{E}_{q(\theta)}[\log p(D \mid \theta)] + \mathbb{E}_{q(\theta)}[\log p(\theta)] - \mathbb{E}_{q(\theta)}[\log q(\theta)] $$

Group terms:

$$\text{ELBO}(q) = \mathbb{E}_{q(\theta)}[\log p(D \mid \theta)] - \text{KL}(q(\theta) \| p(\theta)) \tag{20}$$