Deriving the Poisson Log-Likelihood and MLEs for Single-Cell RNA-seq
Credit: Nano BananaIntroduction
In single-cell RNA-sequencing (scRNA-seq) analysis, gene expression counts are often modeled using the Poisson distribution. If we assume that the count $X_{cg}$ for cell $c$ and gene $g$ follows a Poisson distribution with rate parameter $n_c p_g$, we can write:
$$X_{cg} \sim \text{Poisson}(n_c p_g)$$where:
- $n_c$ represents the total molecular count (or sequencing depth) for cell $c$
- $p_g$ represents the probability (relative expression) of gene $g$
- $\sum_g p_g = 1$ (probabilities sum to one)
In this post, we’ll derive the log-likelihood for this model and find the maximum likelihood estimators (MLEs) for both $n_c$ and $p_g$.
The Poisson Log-Likelihood
The Poisson probability mass function is:
$$P(X_{cg}=x) = \frac{(n_c p_g)^x e^{-n_c p_g}}{x!}$$For all cells and genes, the joint likelihood is:
$$\mathcal{L}(\{n_c\},\{p_g\}) = \prod_{c,g}\frac{(n_c p_g)^{X_{cg}} e^{-n_c p_g}}{X_{cg}!}$$Taking the natural logarithm:
$$\ell(\{n_c\},\{p_g\}) = \sum_{c,g}\Big[ X_{cg}\ln(n_c p_g) - n_c p_g - \ln(X_{cg}!)\Big]$$Since $\ln(X_{cg}!)$ doesn’t depend on our parameters, we can drop it from the optimization:
$$\boxed{\ell(\{n_c\},\{p_g\}) = \sum_{c,g}\big[ X_{cg}\ln(n_c p_g) - n_c p_g\big] + \text{const.}}$$This is the key result referenced in equation (13).
Expanding the Log-Likelihood
We can expand the logarithm of the product:
$$\ell = \sum_{c,g} X_{cg}\ln n_c + \sum_{c,g} X_{cg}\ln p_g - \sum_{c,g} n_c p_g + \text{const.}$$Notice that if $\sum_g p_g = 1$, then:
$$\sum_g n_c p_g = n_c \sum_g p_g = n_c$$This simplification will be useful for optimization.
MLE for Gene Probabilities $p_g$
To find the MLE for $\{p_g\}$ under the constraint $\sum_g p_g = 1$, we use Lagrange multipliers. The relevant terms are:
$$\mathcal{J} = \sum_{c,g} X_{cg}\ln p_g - \lambda\Big(\sum_g p_g - 1\Big)$$Taking the derivative with respect to $p_g$ and setting it to zero:
$$\frac{\partial\mathcal{J}}{\partial p_g} = \sum_c\frac{X_{cg}}{p_g} - \lambda = 0$$This gives us:
$$p_g = \frac{1}{\lambda}\sum_c X_{cg}$$Enforcing the constraint $\sum_g p_g = 1$:
$$1 = \sum_g p_g = \frac{1}{\lambda}\sum_g\sum_c X_{cg} \quad\Rightarrow\quad \lambda = \sum_{c,g} X_{cg} \equiv N_{\text{tot}}$$Therefore, the MLE is:
$$\boxed{\hat{p}_g = \frac{\sum_c X_{cg}}{\sum_{c,g} X_{cg}}}$$Interpretation: The estimated probability for gene $g$ is simply the fraction of total observed counts that belong to that gene.
MLE for Cell Size Factors $n_c$
Now we maximize with respect to $n_c$, treating $\{p_g\}$ as fixed. The relevant terms are:
$$\ell = \sum_c\Big[\Big(\sum_g X_{cg}\Big)\ln n_c - n_c\sum_g p_g\Big] + \text{const.}$$Taking the derivative:
$$\frac{\partial\ell}{\partial n_c} = \frac{\sum_g X_{cg}}{n_c} - \sum_g p_g = 0$$Since $\sum_g p_g = 1$:
$$\boxed{\hat{n}_c = \sum_g X_{cg}}$$Interpretation: The estimated size factor for cell $c$ is simply the total observed count across all genes in that cell.
Verification: Concavity
The log-likelihood is concave in both $\{p_g\}$ and $\{n_c\}$:
- For $p_g$: The term $\sum X_{cg}\ln p_g$ is concave on the probability simplex
- For $n_c$: The second derivative is $\frac{\partial^2\ell}{\partial n_c^2} = -\frac{\sum_g X_{cg}}{n_c^2} < 0$
Therefore, our stationary points are indeed global maxima.
Summary
For the Poisson model $X_{cg} \sim \text{Poisson}(n_c p_g)$:
Log-likelihood (up to constant):
$$\ell = \sum_{c,g}\big[ X_{cg}\ln(n_c p_g) - n_c p_g\big]$$MLE for gene probabilities:
$$\hat{p}_g = \frac{\text{total counts for gene }g}{\text{total counts overall}}$$MLE for cell sizes:
$$\hat{n}_c = \text{total counts in cell }c$$
These simple closed-form solutions make the Poisson model attractive for initial modeling of scRNA-seq count data, though more sophisticated models (negative binomial, zero-inflated, etc.) are often needed to account for overdispersion and other biological complexities.
This derivation is foundational for understanding normalization methods and differential expression testing in single-cell genomics.