GSEA Input Guide: Should You Use All Genes or Only DEGs with p < 0.05?
Use all tested genes for GSEA
Use all tested genes, ranked by a continuous statistic—not just DEGs with p < 0.05. Filtering to differentially expressed genes (DEGs) first throws away valuable information and turns your enrichment test into something closer to an over-representation analysis (ORA) rather than true GSEA.
Which Gene List for Which Method?
GSEA / fgsea / clusterProfiler::GSEA
- Input: Ranked vector of all genes (after basic QC/mapping)
- Typical ranking metrics:
- DESeq2:
res$stat(Wald statistic) - limma:
topTable$t(t-statistic) - edgeR (LRT):
sign(logFC) * sqrt(LR)orsign(logFC) * sqrt(F)(signed score)
- DESeq2:
ORA / Hypergeometric Tests
- Input: DEG list with a cutoff, plus background universe of all tested genes
- Tools:
clusterProfiler::enrichGO,enrichKEGG
Working with edgeR Results
edgeR offers two main statistical tests, each producing different output columns. Here’s how to handle both:
Understanding edgeR Test Types
🔹 Case 1: Likelihood Ratio Test (LRT)
lrt <- glmLRT(fit, coef=2)
- Output columns:
"logFC""logCPM""LR""PValue""FDR" - The
LRcolumn is the likelihood ratio statistic (χ² with df = 1)
🔹 Case 2: Quasi-Likelihood F-test (QLF)
qlf <- glmQLFTest(fit, coef=2)
- Output columns:
"logFC""logCPM""F""PValue""FDR" - The
Fcolumn is the quasi-likelihood F-statistic
👉 If your results show an F column, you’re using QLF test. If you see LR, you used LRT.
How to Rank for GSEA in Each Case
Both test statistics are always ≥ 0 and lack directional information. We restore direction using the sign of logFC:
- LRT:
rank = sign(logFC) × √LR - QLF:
rank = sign(logFC) × √F
Both formulas give you a signed, continuous ranking statistic suitable for GSEA.
Why Use √F or √LR?
sign(logFC)→ gives direction (+1 for upregulated, -1 for downregulated)√F→ makes the statistic comparable to a t-statistic scale√LR→ makes the statistic comparable to a z-statistic scale
Understanding the Square Root: Statistical Theory
Nature of edgeR Test Statistics
🔹 LRT (Likelihood Ratio Test)
- The
LRcolumn is a likelihood ratio statistic - Under the null hypothesis:
LR ~ χ²(df = 1). The degrees of freedom (df) is 1 for single-coefficient tests, e.g. comparing two groups. See details in another blog post. - A χ² with 1 df is basically the square of a standard normal: χ²₁ = Z²
- So
LRis essentially Z² (magnitude only, no sign)
🔹 QLF (Quasi-Likelihood F-test)
- The
Fcolumn is an F-statistic. The degrees of freedom (df) is 1 for single contrast tests, e.g. comparing two groups. See details in another blog post. - With 1 numerator degree of freedom:
F₁,df₂ = t² - So the reported
Fis essentially a t²
Restoring Symmetry and Linearity
Both LR and F are squared test statistics that have lost their sign. Here’s why taking the square root matters:
Without square root:
- All values are ≥ 0 (no direction)
- The scale is quadratic: a gene twice as significant doesn’t just double the score—it grows nonlinearly
- Makes ranking unfair for GSEA
With square root:
√LRbrings it back to ≈ |Z| scale√Fbrings it back to ≈ |t| scale- The score behaves linearly and symmetrically around 0
- Multiplying by
sign(logFC)restores direction
Consistency Across Methods
This approach makes edgeR comparable to other tools:
- limma: reports t-statistic directly
- DESeq2: reports Wald statistic directly
- edgeR: reports squared stats (F or LR)
Taking sign(logFC) × √stat makes edgeR’s ranking directly comparable to limma/DESeq2 outputs.
Important Caveat: Degrees of Freedom Matter
✅ Simple Case: 1 Numerator df
For two-group comparisons or single coefficient tests:
F₁,df₂ = t²- The F-statistic is literally the squared t-statistic
√F × sign(logFC)works perfectly
⚠️ General Case: >1 Numerator df
For multi-coefficient contrasts (testing multiple conditions simultaneously):
F_{k,df₂} = (t₁² + t₂² + ... + t_k²)/k(simplified)- F is not equal to any single t², but combines multiple t-statistics
- You cannot simply take
√F × sign(logFC)because F is multivariate
For multi-df contrasts, consider:
- Taking one coefficient at a time
- Using
glmTreatto get a 1-df statistic for each gene - Using logFC as the rank metric if no simple 1-df test statistic is available
Summary Table
| Case | F vs t relationship | GSEA ranking safety |
|---|---|---|
| 1 numerator df | F = t² | ✅ Safe: use √F × sign(logFC) |
| >1 numerator df | F ≠ t²; F combines multiple t’s | ⚠️ Unsafe: need alternative approach |
Complete edgeR to GSEA Workflow
# 1. Build the ranked vector from edgeR results
# `res` is your edgeR results table
# For QLF results (with F column):
if("F" %in% colnames(res)) {
# Remove NAs and genes with missing stats
res2 <- res[!is.na(res$logFC) & !is.na(res$F), ]
# Construct signed score
score <- sign(res2$logFC) * sqrt(res2$F)
}
# For LRT results (with LR column):
if("LR" %in% colnames(res)) {
# Remove NAs and genes with missing stats
res2 <- res[!is.na(res$logFC) & !is.na(res$LR), ]
# Construct signed score
score <- sign(res2$logFC) * sqrt(res2$LR)
}
# Name and sort (decreasing order)
names(score) <- rownames(res2)
score <- score[!is.na(score)]
score <- sort(score, decreasing = TRUE)
# 2. Run fgsea
library(fgsea)
fg <- fgsea(
pathways = pathways_list,
stats = score,
minSize = 15,
maxSize = 500,
nperm = 10000
)
# View results
head(fg[order(padj), ])
Handling Gene Duplicates
If you have duplicate gene names, keep the one with the largest absolute score:
library(dplyr)
df <- data.frame(gene = names(score),
score = as.numeric(score),
stringsAsFactors = FALSE)
df2 <- df %>%
group_by(gene) %>%
slice_max(abs(score), with_ties = FALSE)
score_unique <- df2$score
names(score_unique) <- df2$gene
score_unique <- sort(score_unique, decreasing = TRUE)
Why Not Use P-values for Ranking?
This is a common question, but p-values make poor ranking statistics for several reasons:
1. P-values Lose Direction
- A p-value only indicates evidence strength, not whether a gene is up- or down-regulated
- GSEA requires directional information to properly order genes
2. P-values Distort Magnitude
- In large datasets, tiny fold-changes can get extremely small p-values
- In small datasets, large fold-changes might appear “non-significant”
- Sample size artifacts dominate biological signal
3. P-values Are Not Linear or Symmetric
- Bounded between 0 and 1 with highly skewed distribution
- Over-emphasizes a few ultra-significant genes while flattening the rest
- Test statistics (t, Wald, signed √F) are symmetric and unbounded—ideal for GSEA
4. Original GSEA Design
The original GSEA algorithm explicitly recommends using signal-to-noise ratios or test statistics, not p-values.
Best Practices Summary
✅ DO:
- Use all tested genes (don’t pre-filter by FDR)
- Use proper directional test statistics:
- edgeR:
sign(logFC) * sqrt(F) - DESeq2:
res$stat(Wald statistic) - limma:
tstatistic fromtopTable
- edgeR:
- Ensure gene IDs match between your ranking and pathway databases
- Set reasonable gene set size bounds (15-500 genes)
- Use sufficient permutations (≥10,000) for stable p-values
❌ DON’T:
- Pre-filter genes by p-value cutoffs for GSEA
- Use p-values as ranking statistics
- Ignore gene ID mismatches
- Forget to handle duplicate gene names
Conclusion
GSEA is a powerful method for functional enrichment analysis, but it requires proper input preparation. By using all genes ranked with appropriate test statistics rather than filtered DEG lists, you’ll get more reliable and interpretable results. The continuous ranking preserves the full spectrum of expression changes and enables GSEA to detect subtle but coordinated changes across gene sets.
Remember: more genes with proper ranking = better GSEA results than fewer genes with artificial cutoffs.