Understanding edgeR: A Deep Dive into Statistical Methods for RNA-seq Differential Expression Analysis

Introduction

edgeR is a popular R/Bioconductor software package for differential gene expression (DEG) analysis of sequencing data, particularly RNA-seq. Unlike older methods designed for microarrays that assumed normally distributed data, edgeR is specifically designed for count-based data. It models read counts using a negative binomial distribution, which accounts for the overdispersion commonly observed in RNA-seq data, where the variance is greater than the mean.

Key Concepts and Workflow

The typical workflow for a DEG analysis in edgeR involves several critical steps:

1. Data Import and DGEList Object

The process begins with a matrix of raw read counts, with rows representing genes and columns representing samples. This data is loaded into an R object called a DGEList, which stores the counts along with sample information.

2. Filtering

Genes with very low or zero counts across all samples are filtered out to improve statistical power and reduce the burden of multiple testing.

3. Normalization

RNA-seq data is susceptible to biases in sequencing depth and composition. edgeR uses the Trimmed Mean of M-values (TMM) method, which assumes that most genes are not differentially expressed between samples and uses a subset of genes to calculate normalization factors.

4. Dispersion Estimation

This is the core of edgeR’s statistical model. The dispersion parameter measures biological variability between replicates, and edgeR uses empirical Bayes methods to estimate it through a three-step process.

5. Differential Expression Testing

With dispersion estimates in place, edgeR uses statistical tests to identify genes with significant differences in expression between conditions.

6. Multiple Testing Correction

P-values are adjusted to control the False Discovery Rate (FDR), typically using the Benjamini-Hochberg method.

Dispersion Estimation: The Heart of edgeR

Dispersion estimation in edgeR is a sophisticated, tiered approach using empirical Bayes methods. This process involves three main steps:

Common Dispersion

The first step estimates a single dispersion value shared across all genes, representing overall biological variation in the dataset.

Example: Imagine two groups of samples (A and B) with a single gene:

Group A: 100, 110, 105
Group B: 200, 220, 215

Through analysis of all genes, suppose we determine the common dispersion is 0.05. This corresponds to a biological coefficient of variation (BCV) of $\sqrt{0.05} \approx 0.224$ , meaning typical biological variation between replicates is about 22.4%.

Trended Dispersion

edgeR recognizes that gene variability often depends on expression level. Highly expressed genes tend to be less variable than lowly expressed genes. A trended dispersion is estimated by plotting dispersion against average expression (logCPM) and fitting a curve.

Example:

Gene 1 (low expression): raw dispersion = 0.15
Gene 2 (medium expression): raw dispersion = 0.08
Gene 3 (high expression): raw dispersion = 0.03

A downward trend curve is fit showing that as average expression increases, dispersion decreases.

Tagwise (Gene-Specific) Dispersion

This step applies empirical Bayes shrinkage, taking raw gene-specific dispersion estimates and “shrinking” them toward the trended dispersion curve. The amount of shrinkage is determined by the prior degrees of freedom parameter.

Example shrinkage:

Gene	Raw Dispersion	Trended Dispersion	Final Shrinkage
Gene 1	0.15	0.12	0.13 (shrunk down)
Gene 2	0.08	0.08	0.08 (minimal change)
Gene 3	0.03	0.05	0.04 (shrunk up)

Empirical Bayes Moderation in Detail

The computation of tagwise dispersion uses empirical Bayes shrinkage to leverage information from the entire dataset. This is a classic Bayesian framework where we combine prior information with observed data to obtain a posterior estimate.

The Bayesian Framework

In Bayesian statistics, we update our beliefs about a parameter using:

$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$

For dispersion estimation in edgeR:

Prior: The trended dispersion curve (what we expect based on expression level)
Likelihood: The gene-specific data (raw dispersion estimate)
Posterior: The final moderated tagwise dispersion

Mathematical Formulation

The moderated tagwise dispersion ($\phi_g$ ) can be conceptually understood as a weighted average:

$\text{Final Dispersion} = \frac{(raw\ disp. \times n_e) + (trend\ disp. \times n_0)}{n_e + n_0}$

This formula represents the posterior mean under a conjugate prior setup, where:

$n_e$ = effective number of replicates for the gene (precision of the likelihood)
$n_0$ = prior degrees of freedom (precision of the prior)
The weights $n_e$ and $n_0$ determine how much we trust the individual gene’s data versus the population trend

Why “Empirical” Bayes?

The approach is called “Empirical” Bayes because:

The prior is estimated from the data: The trended dispersion curve is not assumed but learned from all genes in the dataset
The prior precision is estimated: $n_0$ is not fixed but estimated by assessing how variable individual dispersions are around the trend
Data-driven shrinkage: The amount of shrinkage adapts to the dataset characteristics

Bayesian Interpretation of the Weights

High $n_e$ (many replicates, stable counts): The gene’s own data is reliable → less shrinkage toward the trend
Low $n_e$ (few replicates, low counts): The gene’s own data is unreliable → more shrinkage toward the trend
High $n_0$ : Strong confidence in the population trend → more shrinkage
Low $n_0$ : Weak confidence in the population trend → less shrinkage

Numeric Example:

Gene	Raw Dispersion	Trend Dispersion	$n_e$	$n_0$	Final Dispersion
Gene 1	0.15	0.12	2	10	(2×0.15 + 10×0.12)/(2+10) = 0.123
Gene 2	0.08	0.08	3	10	(3×0.08 + 10×0.08)/(3+10) = 0.08
Gene 3	0.03	0.05	2	10	(2×0.03 + 10×0.05)/(2+10) = 0.047

Differential Expression Testing

Statistical Model

edgeR models RNA-seq counts using a Negative Binomial (NB) distribution:

$$Y_{gi} \sim NB(\mu_{gi}, \phi_g)$$

Where:

$Y_{gi}$ = observed count for gene g in sample i
$\mu_{gi}$ = expected mean count
$\phi_g$ = gene-specific dispersion

Variance is: $\text{Var}(Y_{gi}) = \mu_{gi} + \phi_g\mu_{gi}^2$

Two Testing Approaches

Exact Test (Simple Two-Group Comparisons)

Based on the NB distribution, analogous to Fisher’s exact test but accounting for overdispersion.

GLM Approach (Complex Designs)

Uses a log-linear model: $\log(\mu_{gi}) = x_i\beta_g$

Numeric Example: Two-Group Comparison

Consider 3 samples in each of 2 groups (A & B) for Gene X:

Sample	Group	Counts
S1	A	100
S2	A	110
S3	A	105
S4	B	200
S5	B	220
S6	B	215

Assume tagwise dispersion $\phi = 0.05$ .

Step 1: Estimate Group Means

$\bar{Y}_A = \frac{100 + 110 + 105}{3} = 105$
$\bar{Y}_B = \frac{200 + 220 + 215}{3} = 211.67$

Step 2: Calculate Variances

$\text{Var}_A = 105 + 0.05 \times 105^2 = 105 + 551.25 = 656.25$
$\text{Var}_B = 211.67 + 0.05 \times 211.67^2 \approx 2451.92$

Step 3: Compute Test Results

Fold change = $211.67/105 \approx 2.02$
The exact test computes p-value using NB variance (hypothetical p-value ≈ 0.001)

GLM Example with Batch Effects

Consider 6 RNA-seq samples with both Group (A vs B) and Batch (1 vs 2) factors:

Sample	Group	Batch	Count
S1	A	1	100
S2	A	1	110
S3	A	2	105
S4	B	1	200
S5	B	2	220
S6	B	2	215

Design Matrix

Sample	Intercept	Group	Batch
S1	1	0	0
S2	1	0	0
S3	1	0	1
S4	1	1	0
S5	1	1	1
S6	1	1	1

GLM Model

$$\log(\mu_i) = \beta_0 + \beta_1 \cdot \text{Group}_i + \beta_2 \cdot \text{Batch}_i$$

Coefficient Estimates:

$\beta_0 \approx \log(105) \approx 4.654$ (baseline)
$\beta_1 \approx \log(211.67/105) \approx 0.701$ (group effect)
$\beta_2 \approx 0$ (batch effect)

Fitted Means: $$\mu_i = \exp(\beta_0 + \beta_1 \cdot \text{Group}_i + \beta_2 \cdot \text{Batch}_i)$$

Likelihood Ratio Test (LRT)

The LRT compares full and reduced models:

Full model: includes $\beta_1$ (group) + $\beta_2$ (batch)
Reduced model: excludes $\beta_1$ , includes only $\beta_0 + \beta_2 \cdot \text{Batch}$

LRT Statistic: $D = 2 \times (\ell_{\text{full}} - \ell_{\text{reduced}})$

Under $H_0: \beta_1 = 0$ , $D \sim \chi^2_1$ .

Quasi-Likelihood F-Test: A More Robust Approach

For small sample sizes, edgeR offers the Quasi-Likelihood (QL) F-test, which is more robust than LRT by incorporating additional uncertainty in dispersion estimates.

Purpose

RNA-seq data often have small sample sizes, making variance estimates unstable. The QL F-test incorporates an extra layer of uncertainty in the dispersion estimate, making the test more conservative and better controlling Type I error.

Scale Factor Computation

The quasi-likelihood approach uses a scale factor computed from residual deviance:

$$d_i^2 = \frac{(Y_i - \hat{\mu}_i)^2}{\hat{\mu}_i + \phi_g\hat{\mu}_i^2}$$

This represents the squared Pearson residual for observation i.

Scale Factor: $\text{scale factor} = \frac{\sum_i d_i^2}{\text{residual df}}$

Numeric Example

Using our previous data:

Sample	$Y_i$	$\hat{\mu}_i$	$d_i^2$
S1	100	105	0.038
S2	110	105	0.038
S3	105	105	0
S4	200	211.67	0.0555
S5	220	211.67	0.0283
S6	215	211.67	0.0045

Calculations:

Residual deviance = $\sum d_i^2 = 0.164$
Residual df = 6 - 2 = 4
Scale factor = 0.164/4 = 0.041

F-Statistic

The F-statistic for testing $\beta_1 = 0$ is:

$$F = \frac{(\hat{\beta}_1)^2}{\text{Var}(\hat{\beta}_1) \times \text{scale factor}}$$

With $\hat{\beta}_1 = 0.701$ and suppose quasi-variance = 0.05:

$$F = \frac{0.701^2}{0.05} \approx 9.82$$

With 1 numerator df and ~4 residual df, p-value ≈ 0.034.

Comparison of Testing Methods

Feature	Exact Test	LRT (glmLRT)	QL F-Test (glmQLFTest)
Use Case	Simple two-group	Complex designs	Small samples, robust
Statistic	NB-based exact	Chi-squared	F-statistic
Approach	Conditional	Maximum likelihood	Quasi-likelihood
Robustness	Good for simple designs	Can be anti-conservative	More conservative, better Type I control

Understanding Pearson Residuals

The quantity $d_i^2 = \frac{(Y_i - \hat{\mu}_i)^2}{\hat{\mu}_i + \phi_g\hat{\mu}_i^2}$ represents the squared Pearson residual for a Negative Binomial GLM.

For any GLM with variance function $V(\mu_i)$ , the Pearson residual is:

$$r_i = \frac{Y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}$$

In the NB model: $V(\mu_i) = \mu_i + \phi_g\mu_i^2$

Therefore: $d_i^2 = r_i^2$

Practical Implementation

edgeR Workflow for QL F-Test

library(edgeR)
y <- DGEList(counts=counts, group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)          # Fit quasi-likelihood GLM
qlf <- glmQLFTest(fit, coef=2)      # Test Group effect
topTags(qlf)

Expected Output

Gene	logFC	F-stat	p-value	FDR
Gene X	0.701	9.82	0.034	0.04

Conclusion

edgeR’s sophisticated statistical framework provides robust methods for RNA-seq differential expression analysis through:

Three-step dispersion estimation using empirical Bayes shrinkage
Multiple testing approaches (exact test, LRT, QL F-test) suitable for different experimental designs
Proper handling of overdispersion through negative binomial modeling
Conservative testing procedures that control false discovery rates

The choice between testing methods depends on experimental design complexity and sample size, with the quasi-likelihood F-test providing the most robust results for challenging datasets with small sample sizes.