Deriving the Poisson Log-Likelihood and MLEs for Single-Cell RNA-seq

Nov 17, 2025 · 3 min read
Credit: Nano Banana

Introduction

In single-cell RNA-sequencing (scRNA-seq) analysis, gene expression counts are often modeled using the Poisson distribution. If we assume that the count $X_{cg}$ for cell $c$ and gene $g$ follows a Poisson distribution with rate parameter $n_c p_g$, we can write:

$$X_{cg} \sim \text{Poisson}(n_c p_g)$$

where:

  • $n_c$ represents the total molecular count (or sequencing depth) for cell $c$
  • $p_g$ represents the probability (relative expression) of gene $g$
  • $\sum_g p_g = 1$ (probabilities sum to one)

In this post, we’ll derive the log-likelihood for this model and find the maximum likelihood estimators (MLEs) for both $n_c$ and $p_g$.

The Poisson Log-Likelihood

The Poisson probability mass function is:

$$P(X_{cg}=x) = \frac{(n_c p_g)^x e^{-n_c p_g}}{x!}$$

For all cells and genes, the joint likelihood is:

$$\mathcal{L}(\{n_c\},\{p_g\}) = \prod_{c,g}\frac{(n_c p_g)^{X_{cg}} e^{-n_c p_g}}{X_{cg}!}$$

Taking the natural logarithm:

$$\ell(\{n_c\},\{p_g\}) = \sum_{c,g}\Big[ X_{cg}\ln(n_c p_g) - n_c p_g - \ln(X_{cg}!)\Big]$$

Since $\ln(X_{cg}!)$ doesn’t depend on our parameters, we can drop it from the optimization:

$$\boxed{\ell(\{n_c\},\{p_g\}) = \sum_{c,g}\big[ X_{cg}\ln(n_c p_g) - n_c p_g\big] + \text{const.}}$$

This is the key result referenced in equation (13).

Expanding the Log-Likelihood

We can expand the logarithm of the product:

$$\ell = \sum_{c,g} X_{cg}\ln n_c + \sum_{c,g} X_{cg}\ln p_g - \sum_{c,g} n_c p_g + \text{const.}$$

Notice that if $\sum_g p_g = 1$, then:

$$\sum_g n_c p_g = n_c \sum_g p_g = n_c$$

This simplification will be useful for optimization.

MLE for Gene Probabilities $p_g$

To find the MLE for $\{p_g\}$ under the constraint $\sum_g p_g = 1$, we use Lagrange multipliers. The relevant terms are:

$$\mathcal{J} = \sum_{c,g} X_{cg}\ln p_g - \lambda\Big(\sum_g p_g - 1\Big)$$

Taking the derivative with respect to $p_g$ and setting it to zero:

$$\frac{\partial\mathcal{J}}{\partial p_g} = \sum_c\frac{X_{cg}}{p_g} - \lambda = 0$$

This gives us:

$$p_g = \frac{1}{\lambda}\sum_c X_{cg}$$

Enforcing the constraint $\sum_g p_g = 1$:

$$1 = \sum_g p_g = \frac{1}{\lambda}\sum_g\sum_c X_{cg} \quad\Rightarrow\quad \lambda = \sum_{c,g} X_{cg} \equiv N_{\text{tot}}$$

Therefore, the MLE is:

$$\boxed{\hat{p}_g = \frac{\sum_c X_{cg}}{\sum_{c,g} X_{cg}}}$$

Interpretation: The estimated probability for gene $g$ is simply the fraction of total observed counts that belong to that gene.

MLE for Cell Size Factors $n_c$

Now we maximize with respect to $n_c$, treating $\{p_g\}$ as fixed. The relevant terms are:

$$\ell = \sum_c\Big[\Big(\sum_g X_{cg}\Big)\ln n_c - n_c\sum_g p_g\Big] + \text{const.}$$

Taking the derivative:

$$\frac{\partial\ell}{\partial n_c} = \frac{\sum_g X_{cg}}{n_c} - \sum_g p_g = 0$$

Since $\sum_g p_g = 1$:

$$\boxed{\hat{n}_c = \sum_g X_{cg}}$$

Interpretation: The estimated size factor for cell $c$ is simply the total observed count across all genes in that cell.

Verification: Concavity

The log-likelihood is concave in both $\{p_g\}$ and $\{n_c\}$:

  • For $p_g$: The term $\sum X_{cg}\ln p_g$ is concave on the probability simplex
  • For $n_c$: The second derivative is $\frac{\partial^2\ell}{\partial n_c^2} = -\frac{\sum_g X_{cg}}{n_c^2} < 0$

Therefore, our stationary points are indeed global maxima.

Summary

For the Poisson model $X_{cg} \sim \text{Poisson}(n_c p_g)$:

  1. Log-likelihood (up to constant):

    $$\ell = \sum_{c,g}\big[ X_{cg}\ln(n_c p_g) - n_c p_g\big]$$
  2. MLE for gene probabilities:

    $$\hat{p}_g = \frac{\text{total counts for gene }g}{\text{total counts overall}}$$
  3. MLE for cell sizes:

    $$\hat{n}_c = \text{total counts in cell }c$$

These simple closed-form solutions make the Poisson model attractive for initial modeling of scRNA-seq count data, though more sophisticated models (negative binomial, zero-inflated, etc.) are often needed to account for overdispersion and other biological complexities.


This derivation is foundational for understanding normalization methods and differential expression testing in single-cell genomics.