Understanding edgeR: A Deep Dive into Statistical Methods for RNA-seq Differential Expression Analysis

May 27, 2025·
Jiyuan (Jay) Liu
Jiyuan (Jay) Liu
· 8 min read

Introduction

edgeR is a popular R/Bioconductor software package for differential gene expression (DEG) analysis of sequencing data, particularly RNA-seq. Unlike older methods designed for microarrays that assumed normally distributed data, edgeR is specifically designed for count-based data. It models read counts using a negative binomial distribution, which accounts for the overdispersion commonly observed in RNA-seq data, where the variance is greater than the mean.

Key Concepts and Workflow

The typical workflow for a DEG analysis in edgeR involves several critical steps:

1. Data Import and DGEList Object

The process begins with a matrix of raw read counts, with rows representing genes and columns representing samples. This data is loaded into an R object called a DGEList, which stores the counts along with sample information.

2. Filtering

Genes with very low or zero counts across all samples are filtered out to improve statistical power and reduce the burden of multiple testing.

3. Normalization

RNA-seq data is susceptible to biases in sequencing depth and composition. edgeR uses the Trimmed Mean of M-values (TMM) method, which assumes that most genes are not differentially expressed between samples and uses a subset of genes to calculate normalization factors.

4. Dispersion Estimation

This is the core of edgeR’s statistical model. The dispersion parameter measures biological variability between replicates, and edgeR uses empirical Bayes methods to estimate it through a three-step process.

5. Differential Expression Testing

With dispersion estimates in place, edgeR uses statistical tests to identify genes with significant differences in expression between conditions.

6. Multiple Testing Correction

P-values are adjusted to control the False Discovery Rate (FDR), typically using the Benjamini-Hochberg method.

Dispersion Estimation: The Heart of edgeR

Dispersion estimation in edgeR is a sophisticated, tiered approach using empirical Bayes methods. This process involves three main steps:

Common Dispersion

The first step estimates a single dispersion value shared across all genes, representing overall biological variation in the dataset.

Example: Imagine two groups of samples (A and B) with a single gene:

  • Group A: 100, 110, 105
  • Group B: 200, 220, 215

Through analysis of all genes, suppose we determine the common dispersion is 0.05. This corresponds to a biological coefficient of variation (BCV) of $\sqrt{0.05} \approx 0.224$ , meaning typical biological variation between replicates is about 22.4%.

Trended Dispersion

edgeR recognizes that gene variability often depends on expression level. Highly expressed genes tend to be less variable than lowly expressed genes. A trended dispersion is estimated by plotting dispersion against average expression (logCPM) and fitting a curve.

Example:

  • Gene 1 (low expression): raw dispersion = 0.15
  • Gene 2 (medium expression): raw dispersion = 0.08
  • Gene 3 (high expression): raw dispersion = 0.03

A downward trend curve is fit showing that as average expression increases, dispersion decreases.

Tagwise (Gene-Specific) Dispersion

This step applies empirical Bayes shrinkage, taking raw gene-specific dispersion estimates and “shrinking” them toward the trended dispersion curve. The amount of shrinkage is determined by the prior degrees of freedom parameter.

Example shrinkage:

GeneRaw DispersionTrended DispersionFinal Shrinkage
Gene 10.150.120.13 (shrunk down)
Gene 20.080.080.08 (minimal change)
Gene 30.030.050.04 (shrunk up)

Empirical Bayes Moderation in Detail

The computation of tagwise dispersion uses empirical Bayes shrinkage to leverage information from the entire dataset. This is a classic Bayesian framework where we combine prior information with observed data to obtain a posterior estimate.

The Bayesian Framework

In Bayesian statistics, we update our beliefs about a parameter using:

$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$

For dispersion estimation in edgeR:

  • Prior: The trended dispersion curve (what we expect based on expression level)
  • Likelihood: The gene-specific data (raw dispersion estimate)
  • Posterior: The final moderated tagwise dispersion

Mathematical Formulation

The moderated tagwise dispersion ($\phi_g$ ) can be conceptually understood as a weighted average:

$\text{Final Dispersion} = \frac{(raw\ disp. \times n_e) + (trend\ disp. \times n_0)}{n_e + n_0}$

This formula represents the posterior mean under a conjugate prior setup, where:

  • $n_e$ = effective number of replicates for the gene (precision of the likelihood)
  • $n_0$ = prior degrees of freedom (precision of the prior)
  • The weights $n_e$ and $n_0$ determine how much we trust the individual gene’s data versus the population trend

Why “Empirical” Bayes?

The approach is called “Empirical” Bayes because:

  1. The prior is estimated from the data: The trended dispersion curve is not assumed but learned from all genes in the dataset
  2. The prior precision is estimated: $n_0$ is not fixed but estimated by assessing how variable individual dispersions are around the trend
  3. Data-driven shrinkage: The amount of shrinkage adapts to the dataset characteristics

Bayesian Interpretation of the Weights

  • High $n_e$ (many replicates, stable counts): The gene’s own data is reliable → less shrinkage toward the trend
  • Low $n_e$ (few replicates, low counts): The gene’s own data is unreliable → more shrinkage toward the trend
  • High $n_0$ : Strong confidence in the population trend → more shrinkage
  • Low $n_0$ : Weak confidence in the population trend → less shrinkage

Numeric Example:

GeneRaw DispersionTrend Dispersion$n_e$$n_0$Final Dispersion
Gene 10.150.12210(2×0.15 + 10×0.12)/(2+10) = 0.123
Gene 20.080.08310(3×0.08 + 10×0.08)/(3+10) = 0.08
Gene 30.030.05210(2×0.03 + 10×0.05)/(2+10) = 0.047

Differential Expression Testing

Statistical Model

edgeR models RNA-seq counts using a Negative Binomial (NB) distribution:

$$Y_{gi} \sim NB(\mu_{gi}, \phi_g)$$

Where:

  • $Y_{gi}$ = observed count for gene g in sample i
  • $\mu_{gi}$ = expected mean count
  • $\phi_g$ = gene-specific dispersion

Variance is: $\text{Var}(Y_{gi}) = \mu_{gi} + \phi_g\mu_{gi}^2$

Two Testing Approaches

Exact Test (Simple Two-Group Comparisons)

Based on the NB distribution, analogous to Fisher’s exact test but accounting for overdispersion.

GLM Approach (Complex Designs)

Uses a log-linear model: $\log(\mu_{gi}) = x_i\beta_g$

Numeric Example: Two-Group Comparison

Consider 3 samples in each of 2 groups (A & B) for Gene X:

SampleGroupCounts
S1A100
S2A110
S3A105
S4B200
S5B220
S6B215

Assume tagwise dispersion $\phi = 0.05$ .

Step 1: Estimate Group Means

  • $\bar{Y}_A = \frac{100 + 110 + 105}{3} = 105$
  • $\bar{Y}_B = \frac{200 + 220 + 215}{3} = 211.67$

Step 2: Calculate Variances

  • $\text{Var}_A = 105 + 0.05 \times 105^2 = 105 + 551.25 = 656.25$
  • $\text{Var}_B = 211.67 + 0.05 \times 211.67^2 \approx 2451.92$

Step 3: Compute Test Results

  • Fold change = $211.67/105 \approx 2.02$
  • The exact test computes p-value using NB variance (hypothetical p-value ≈ 0.001)

GLM Example with Batch Effects

Consider 6 RNA-seq samples with both Group (A vs B) and Batch (1 vs 2) factors:

SampleGroupBatchCount
S1A1100
S2A1110
S3A2105
S4B1200
S5B2220
S6B2215

Design Matrix

SampleInterceptGroupBatch
S1100
S2100
S3101
S4110
S5111
S6111

GLM Model

$$\log(\mu_i) = \beta_0 + \beta_1 \cdot \text{Group}_i + \beta_2 \cdot \text{Batch}_i$$

Coefficient Estimates:

  • $\beta_0 \approx \log(105) \approx 4.654$ (baseline)
  • $\beta_1 \approx \log(211.67/105) \approx 0.701$ (group effect)
  • $\beta_2 \approx 0$ (batch effect)

Fitted Means: $$\mu_i = \exp(\beta_0 + \beta_1 \cdot \text{Group}_i + \beta_2 \cdot \text{Batch}_i)$$

Likelihood Ratio Test (LRT)

The LRT compares full and reduced models:

  • Full model: includes $\beta_1$ (group) + $\beta_2$ (batch)
  • Reduced model: excludes $\beta_1$ , includes only $\beta_0 + \beta_2 \cdot \text{Batch}$

LRT Statistic: $D = 2 \times (\ell_{\text{full}} - \ell_{\text{reduced}})$

Under $H_0: \beta_1 = 0$ , $D \sim \chi^2_1$ .

Quasi-Likelihood F-Test: A More Robust Approach

For small sample sizes, edgeR offers the Quasi-Likelihood (QL) F-test, which is more robust than LRT by incorporating additional uncertainty in dispersion estimates.

Purpose

RNA-seq data often have small sample sizes, making variance estimates unstable. The QL F-test incorporates an extra layer of uncertainty in the dispersion estimate, making the test more conservative and better controlling Type I error.

Scale Factor Computation

The quasi-likelihood approach uses a scale factor computed from residual deviance:

$$d_i^2 = \frac{(Y_i - \hat{\mu}_i)^2}{\hat{\mu}_i + \phi_g\hat{\mu}_i^2}$$

This represents the squared Pearson residual for observation i.

Scale Factor: $\text{scale factor} = \frac{\sum_i d_i^2}{\text{residual df}}$

Numeric Example

Using our previous data:

Sample$Y_i$$\hat{\mu}_i$$d_i^2$
S11001050.038
S21101050.038
S31051050
S4200211.670.0555
S5220211.670.0283
S6215211.670.0045

Calculations:

  • Residual deviance = $\sum d_i^2 = 0.164$
  • Residual df = 6 - 2 = 4
  • Scale factor = 0.164/4 = 0.041

F-Statistic

The F-statistic for testing $\beta_1 = 0$ is:

$$F = \frac{(\hat{\beta}_1)^2}{\text{Var}(\hat{\beta}_1) \times \text{scale factor}}$$

With $\hat{\beta}_1 = 0.701$ and suppose quasi-variance = 0.05:

$$F = \frac{0.701^2}{0.05} \approx 9.82$$

With 1 numerator df and ~4 residual df, p-value ≈ 0.034.

Comparison of Testing Methods

FeatureExact TestLRT (glmLRT)QL F-Test (glmQLFTest)
Use CaseSimple two-groupComplex designsSmall samples, robust
StatisticNB-based exactChi-squaredF-statistic
ApproachConditionalMaximum likelihoodQuasi-likelihood
RobustnessGood for simple designsCan be anti-conservativeMore conservative, better Type I control

Understanding Pearson Residuals

The quantity $d_i^2 = \frac{(Y_i - \hat{\mu}_i)^2}{\hat{\mu}_i + \phi_g\hat{\mu}_i^2}$ represents the squared Pearson residual for a Negative Binomial GLM.

For any GLM with variance function $V(\mu_i)$ , the Pearson residual is:

$$r_i = \frac{Y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}$$

In the NB model: $V(\mu_i) = \mu_i + \phi_g\mu_i^2$

Therefore: $d_i^2 = r_i^2$

Practical Implementation

edgeR Workflow for QL F-Test

library(edgeR)
y <- DGEList(counts=counts, group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)          # Fit quasi-likelihood GLM
qlf <- glmQLFTest(fit, coef=2)      # Test Group effect
topTags(qlf)

Expected Output

GenelogFCF-statp-valueFDR
Gene X0.7019.820.0340.04

Conclusion

edgeR’s sophisticated statistical framework provides robust methods for RNA-seq differential expression analysis through:

  1. Three-step dispersion estimation using empirical Bayes shrinkage
  2. Multiple testing approaches (exact test, LRT, QL F-test) suitable for different experimental designs
  3. Proper handling of overdispersion through negative binomial modeling
  4. Conservative testing procedures that control false discovery rates

The choice between testing methods depends on experimental design complexity and sample size, with the quasi-likelihood F-test providing the most robust results for challenging datasets with small sample sizes.