Understanding edgeR: A Deep Dive into Statistical Methods for RNA-seq Differential Expression Analysis
Image credit: ** Logan Voss on Unsplash**Introduction
edgeR is a popular R/Bioconductor software package for differential gene expression (DEG) analysis of sequencing data, particularly RNA-seq. Unlike older methods designed for microarrays that assumed normally distributed data, edgeR is specifically designed for count-based data. It models read counts using a negative binomial distribution, which accounts for the overdispersion commonly observed in RNA-seq data, where the variance is greater than the mean.
Key Concepts and Workflow
The typical workflow for a DEG analysis in edgeR involves several critical steps:
1. Data Import and DGEList Object
The process begins with a matrix of raw read counts, with rows representing genes and columns representing samples. This data is loaded into an R object called a DGEList, which stores the counts along with sample information.
2. Filtering
Genes with very low or zero counts across all samples are filtered out to improve statistical power and reduce the burden of multiple testing.
3. Normalization
RNA-seq data is susceptible to biases in sequencing depth and composition. edgeR uses the Trimmed Mean of M-values (TMM) method, which assumes that most genes are not differentially expressed between samples and uses a subset of genes to calculate normalization factors.
4. Dispersion Estimation
This is the core of edgeR’s statistical model. The dispersion parameter measures biological variability between replicates, and edgeR uses empirical Bayes methods to estimate it through a three-step process.
5. Differential Expression Testing
With dispersion estimates in place, edgeR uses statistical tests to identify genes with significant differences in expression between conditions.
6. Multiple Testing Correction
P-values are adjusted to control the False Discovery Rate (FDR), typically using the Benjamini-Hochberg method.
Dispersion Estimation: The Heart of edgeR
Dispersion estimation in edgeR is a sophisticated, tiered approach using empirical Bayes methods. This process involves three main steps:
Common Dispersion
The first step estimates a single dispersion value shared across all genes, representing overall biological variation in the dataset.
Example: Imagine two groups of samples (A and B) with a single gene:
- Group A: 100, 110, 105
- Group B: 200, 220, 215
Through analysis of all genes, suppose we determine the common dispersion is 0.05. This corresponds to a biological coefficient of variation (BCV) of $\sqrt{0.05} \approx 0.224$ , meaning typical biological variation between replicates is about 22.4%.
Trended Dispersion
edgeR recognizes that gene variability often depends on expression level. Highly expressed genes tend to be less variable than lowly expressed genes. A trended dispersion is estimated by plotting dispersion against average expression (logCPM) and fitting a curve.
Example:
- Gene 1 (low expression): raw dispersion = 0.15
- Gene 2 (medium expression): raw dispersion = 0.08
- Gene 3 (high expression): raw dispersion = 0.03
A downward trend curve is fit showing that as average expression increases, dispersion decreases.
Tagwise (Gene-Specific) Dispersion
This step applies empirical Bayes shrinkage, taking raw gene-specific dispersion estimates and “shrinking” them toward the trended dispersion curve. The amount of shrinkage is determined by the prior degrees of freedom parameter.
Example shrinkage:
| Gene | Raw Dispersion | Trended Dispersion | Final Shrinkage |
|---|---|---|---|
| Gene 1 | 0.15 | 0.12 | 0.13 (shrunk down) |
| Gene 2 | 0.08 | 0.08 | 0.08 (minimal change) |
| Gene 3 | 0.03 | 0.05 | 0.04 (shrunk up) |
Empirical Bayes Moderation in Detail
The computation of tagwise dispersion uses empirical Bayes shrinkage to leverage information from the entire dataset. This is a classic Bayesian framework where we combine prior information with observed data to obtain a posterior estimate.
The Bayesian Framework
In Bayesian statistics, we update our beliefs about a parameter using:
$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$For dispersion estimation in edgeR:
- Prior: The trended dispersion curve (what we expect based on expression level)
- Likelihood: The gene-specific data (raw dispersion estimate)
- Posterior: The final moderated tagwise dispersion
Mathematical Formulation
The moderated tagwise dispersion ($\phi_g$ ) can be conceptually understood as a weighted average:
$\text{Final Dispersion} = \frac{(raw\ disp. \times n_e) + (trend\ disp. \times n_0)}{n_e + n_0}$This formula represents the posterior mean under a conjugate prior setup, where:
- $n_e$ = effective number of replicates for the gene (precision of the likelihood)
- $n_0$ = prior degrees of freedom (precision of the prior)
- The weights $n_e$ and $n_0$ determine how much we trust the individual gene’s data versus the population trend
Why “Empirical” Bayes?
The approach is called “Empirical” Bayes because:
- The prior is estimated from the data: The trended dispersion curve is not assumed but learned from all genes in the dataset
- The prior precision is estimated: $n_0$ is not fixed but estimated by assessing how variable individual dispersions are around the trend
- Data-driven shrinkage: The amount of shrinkage adapts to the dataset characteristics
Bayesian Interpretation of the Weights
- High $n_e$ (many replicates, stable counts): The gene’s own data is reliable → less shrinkage toward the trend
- Low $n_e$ (few replicates, low counts): The gene’s own data is unreliable → more shrinkage toward the trend
- High $n_0$ : Strong confidence in the population trend → more shrinkage
- Low $n_0$ : Weak confidence in the population trend → less shrinkage
Numeric Example:
| Gene | Raw Dispersion | Trend Dispersion | $n_e$ | $n_0$ | Final Dispersion |
|---|---|---|---|---|---|
| Gene 1 | 0.15 | 0.12 | 2 | 10 | (2×0.15 + 10×0.12)/(2+10) = 0.123 |
| Gene 2 | 0.08 | 0.08 | 3 | 10 | (3×0.08 + 10×0.08)/(3+10) = 0.08 |
| Gene 3 | 0.03 | 0.05 | 2 | 10 | (2×0.03 + 10×0.05)/(2+10) = 0.047 |
Differential Expression Testing
Statistical Model
edgeR models RNA-seq counts using a Negative Binomial (NB) distribution:
$$Y_{gi} \sim NB(\mu_{gi}, \phi_g)$$Where:
- $Y_{gi}$ = observed count for gene g in sample i
- $\mu_{gi}$ = expected mean count
- $\phi_g$ = gene-specific dispersion
Variance is: $\text{Var}(Y_{gi}) = \mu_{gi} + \phi_g\mu_{gi}^2$
Two Testing Approaches
Exact Test (Simple Two-Group Comparisons)
Based on the NB distribution, analogous to Fisher’s exact test but accounting for overdispersion.
GLM Approach (Complex Designs)
Uses a log-linear model: $\log(\mu_{gi}) = x_i\beta_g$
Numeric Example: Two-Group Comparison
Consider 3 samples in each of 2 groups (A & B) for Gene X:
| Sample | Group | Counts |
|---|---|---|
| S1 | A | 100 |
| S2 | A | 110 |
| S3 | A | 105 |
| S4 | B | 200 |
| S5 | B | 220 |
| S6 | B | 215 |
Assume tagwise dispersion $\phi = 0.05$ .
Step 1: Estimate Group Means
- $\bar{Y}_A = \frac{100 + 110 + 105}{3} = 105$
- $\bar{Y}_B = \frac{200 + 220 + 215}{3} = 211.67$
Step 2: Calculate Variances
- $\text{Var}_A = 105 + 0.05 \times 105^2 = 105 + 551.25 = 656.25$
- $\text{Var}_B = 211.67 + 0.05 \times 211.67^2 \approx 2451.92$
Step 3: Compute Test Results
- Fold change = $211.67/105 \approx 2.02$
- The exact test computes p-value using NB variance (hypothetical p-value ≈ 0.001)
GLM Example with Batch Effects
Consider 6 RNA-seq samples with both Group (A vs B) and Batch (1 vs 2) factors:
| Sample | Group | Batch | Count |
|---|---|---|---|
| S1 | A | 1 | 100 |
| S2 | A | 1 | 110 |
| S3 | A | 2 | 105 |
| S4 | B | 1 | 200 |
| S5 | B | 2 | 220 |
| S6 | B | 2 | 215 |
Design Matrix
| Sample | Intercept | Group | Batch |
|---|---|---|---|
| S1 | 1 | 0 | 0 |
| S2 | 1 | 0 | 0 |
| S3 | 1 | 0 | 1 |
| S4 | 1 | 1 | 0 |
| S5 | 1 | 1 | 1 |
| S6 | 1 | 1 | 1 |
GLM Model
$$\log(\mu_i) = \beta_0 + \beta_1 \cdot \text{Group}_i + \beta_2 \cdot \text{Batch}_i$$Coefficient Estimates:
- $\beta_0 \approx \log(105) \approx 4.654$ (baseline)
- $\beta_1 \approx \log(211.67/105) \approx 0.701$ (group effect)
- $\beta_2 \approx 0$ (batch effect)
Fitted Means: $$\mu_i = \exp(\beta_0 + \beta_1 \cdot \text{Group}_i + \beta_2 \cdot \text{Batch}_i)$$
Likelihood Ratio Test (LRT)
The LRT compares full and reduced models:
- Full model: includes $\beta_1$ (group) + $\beta_2$ (batch)
- Reduced model: excludes $\beta_1$ , includes only $\beta_0 + \beta_2 \cdot \text{Batch}$
LRT Statistic: $D = 2 \times (\ell_{\text{full}} - \ell_{\text{reduced}})$
Under $H_0: \beta_1 = 0$ , $D \sim \chi^2_1$ .
Quasi-Likelihood F-Test: A More Robust Approach
For small sample sizes, edgeR offers the Quasi-Likelihood (QL) F-test, which is more robust than LRT by incorporating additional uncertainty in dispersion estimates.
Purpose
RNA-seq data often have small sample sizes, making variance estimates unstable. The QL F-test incorporates an extra layer of uncertainty in the dispersion estimate, making the test more conservative and better controlling Type I error.
Scale Factor Computation
The quasi-likelihood approach uses a scale factor computed from residual deviance:
$$d_i^2 = \frac{(Y_i - \hat{\mu}_i)^2}{\hat{\mu}_i + \phi_g\hat{\mu}_i^2}$$This represents the squared Pearson residual for observation i.
Scale Factor: $\text{scale factor} = \frac{\sum_i d_i^2}{\text{residual df}}$
Numeric Example
Using our previous data:
| Sample | $Y_i$ | $\hat{\mu}_i$ | $d_i^2$ |
|---|---|---|---|
| S1 | 100 | 105 | 0.038 |
| S2 | 110 | 105 | 0.038 |
| S3 | 105 | 105 | 0 |
| S4 | 200 | 211.67 | 0.0555 |
| S5 | 220 | 211.67 | 0.0283 |
| S6 | 215 | 211.67 | 0.0045 |
Calculations:
- Residual deviance = $\sum d_i^2 = 0.164$
- Residual df = 6 - 2 = 4
- Scale factor = 0.164/4 = 0.041
F-Statistic
The F-statistic for testing $\beta_1 = 0$ is:
$$F = \frac{(\hat{\beta}_1)^2}{\text{Var}(\hat{\beta}_1) \times \text{scale factor}}$$With $\hat{\beta}_1 = 0.701$ and suppose quasi-variance = 0.05:
$$F = \frac{0.701^2}{0.05} \approx 9.82$$With 1 numerator df and ~4 residual df, p-value ≈ 0.034.
Comparison of Testing Methods
| Feature | Exact Test | LRT (glmLRT) | QL F-Test (glmQLFTest) |
|---|---|---|---|
| Use Case | Simple two-group | Complex designs | Small samples, robust |
| Statistic | NB-based exact | Chi-squared | F-statistic |
| Approach | Conditional | Maximum likelihood | Quasi-likelihood |
| Robustness | Good for simple designs | Can be anti-conservative | More conservative, better Type I control |
Understanding Pearson Residuals
The quantity $d_i^2 = \frac{(Y_i - \hat{\mu}_i)^2}{\hat{\mu}_i + \phi_g\hat{\mu}_i^2}$ represents the squared Pearson residual for a Negative Binomial GLM.
For any GLM with variance function $V(\mu_i)$ , the Pearson residual is:
$$r_i = \frac{Y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}$$In the NB model: $V(\mu_i) = \mu_i + \phi_g\mu_i^2$
Therefore: $d_i^2 = r_i^2$
Practical Implementation
edgeR Workflow for QL F-Test
library(edgeR)
y <- DGEList(counts=counts, group=group)
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design) # Fit quasi-likelihood GLM
qlf <- glmQLFTest(fit, coef=2) # Test Group effect
topTags(qlf)
Expected Output
| Gene | logFC | F-stat | p-value | FDR |
|---|---|---|---|---|
| Gene X | 0.701 | 9.82 | 0.034 | 0.04 |
Conclusion
edgeR’s sophisticated statistical framework provides robust methods for RNA-seq differential expression analysis through:
- Three-step dispersion estimation using empirical Bayes shrinkage
- Multiple testing approaches (exact test, LRT, QL F-test) suitable for different experimental designs
- Proper handling of overdispersion through negative binomial modeling
- Conservative testing procedures that control false discovery rates
The choice between testing methods depends on experimental design complexity and sample size, with the quasi-likelihood F-test providing the most robust results for challenging datasets with small sample sizes.