GSEA Input Guide: Should You Use All Genes or Only DEGs with p < 0.05?

Jan 15, 2024·
Jiyuan (Jay) Liu
Jiyuan (Jay) Liu
· 6 min read

Use all tested genes for GSEA

Use all tested genes, ranked by a continuous statistic—not just DEGs with p < 0.05. Filtering to differentially expressed genes (DEGs) first throws away valuable information and turns your enrichment test into something closer to an over-representation analysis (ORA) rather than true GSEA.

Which Gene List for Which Method?

GSEA / fgsea / clusterProfiler::GSEA

  • Input: Ranked vector of all genes (after basic QC/mapping)
  • Typical ranking metrics:
    • DESeq2: res$stat (Wald statistic)
    • limma: topTable$t (t-statistic)
    • edgeR (LRT): sign(logFC) * sqrt(LR) or sign(logFC) * sqrt(F) (signed score)

ORA / Hypergeometric Tests

  • Input: DEG list with a cutoff, plus background universe of all tested genes
  • Tools: clusterProfiler::enrichGO, enrichKEGG

Working with edgeR Results

edgeR offers two main statistical tests, each producing different output columns. Here’s how to handle both:

Understanding edgeR Test Types

🔹 Case 1: Likelihood Ratio Test (LRT)

lrt <- glmLRT(fit, coef=2)
  • Output columns: "logFC" "logCPM" "LR" "PValue" "FDR"
  • The LR column is the likelihood ratio statistic (χ² with df = 1)

🔹 Case 2: Quasi-Likelihood F-test (QLF)

qlf <- glmQLFTest(fit, coef=2)
  • Output columns: "logFC" "logCPM" "F" "PValue" "FDR"
  • The F column is the quasi-likelihood F-statistic

👉 If your results show an F column, you’re using QLF test. If you see LR, you used LRT.

How to Rank for GSEA in Each Case

Both test statistics are always ≥ 0 and lack directional information. We restore direction using the sign of logFC:

  • LRT: rank = sign(logFC) × √LR
  • QLF: rank = sign(logFC) × √F

Both formulas give you a signed, continuous ranking statistic suitable for GSEA.

Why Use √F or √LR?

  • sign(logFC) → gives direction (+1 for upregulated, -1 for downregulated)
  • √F → makes the statistic comparable to a t-statistic scale
  • √LR → makes the statistic comparable to a z-statistic scale

Understanding the Square Root: Statistical Theory

Nature of edgeR Test Statistics

🔹 LRT (Likelihood Ratio Test)

  • The LR column is a likelihood ratio statistic
  • Under the null hypothesis: LR ~ χ²(df = 1). The degrees of freedom (df) is 1 for single-coefficient tests, e.g. comparing two groups. See details in another blog post.
  • A χ² with 1 df is basically the square of a standard normal: χ²₁ = Z²
  • So LR is essentially Z² (magnitude only, no sign)

🔹 QLF (Quasi-Likelihood F-test)

  • The F column is an F-statistic. The degrees of freedom (df) is 1 for single contrast tests, e.g. comparing two groups. See details in another blog post.
  • With 1 numerator degree of freedom: F₁,df₂ = t²
  • So the reported F is essentially a t²

Restoring Symmetry and Linearity

Both LR and F are squared test statistics that have lost their sign. Here’s why taking the square root matters:

Without square root:

  • All values are ≥ 0 (no direction)
  • The scale is quadratic: a gene twice as significant doesn’t just double the score—it grows nonlinearly
  • Makes ranking unfair for GSEA

With square root:

  • √LR brings it back to ≈ |Z| scale
  • √F brings it back to ≈ |t| scale
  • The score behaves linearly and symmetrically around 0
  • Multiplying by sign(logFC) restores direction

Consistency Across Methods

This approach makes edgeR comparable to other tools:

  • limma: reports t-statistic directly
  • DESeq2: reports Wald statistic directly
  • edgeR: reports squared stats (F or LR)

Taking sign(logFC) × √stat makes edgeR’s ranking directly comparable to limma/DESeq2 outputs.

Important Caveat: Degrees of Freedom Matter

✅ Simple Case: 1 Numerator df

For two-group comparisons or single coefficient tests:

  • F₁,df₂ = t²
  • The F-statistic is literally the squared t-statistic
  • √F × sign(logFC) works perfectly

⚠️ General Case: >1 Numerator df

For multi-coefficient contrasts (testing multiple conditions simultaneously):

  • F_{k,df₂} = (t₁² + t₂² + ... + t_k²)/k (simplified)
  • F is not equal to any single t², but combines multiple t-statistics
  • You cannot simply take √F × sign(logFC) because F is multivariate

For multi-df contrasts, consider:

  • Taking one coefficient at a time
  • Using glmTreat to get a 1-df statistic for each gene
  • Using logFC as the rank metric if no simple 1-df test statistic is available

Summary Table

CaseF vs t relationshipGSEA ranking safety
1 numerator dfF = t²✅ Safe: use √F × sign(logFC)
>1 numerator dfF ≠ t²; F combines multiple t’s⚠️ Unsafe: need alternative approach

Complete edgeR to GSEA Workflow

# 1. Build the ranked vector from edgeR results
# `res` is your edgeR results table

# For QLF results (with F column):
if("F" %in% colnames(res)) {
  # Remove NAs and genes with missing stats
  res2 <- res[!is.na(res$logFC) & !is.na(res$F), ]
  # Construct signed score
  score <- sign(res2$logFC) * sqrt(res2$F)
}

# For LRT results (with LR column):
if("LR" %in% colnames(res)) {
  # Remove NAs and genes with missing stats  
  res2 <- res[!is.na(res$logFC) & !is.na(res$LR), ]
  # Construct signed score
  score <- sign(res2$logFC) * sqrt(res2$LR)
}

# Name and sort (decreasing order)
names(score) <- rownames(res2)
score <- score[!is.na(score)]
score <- sort(score, decreasing = TRUE)

# 2. Run fgsea
library(fgsea)

fg <- fgsea(
  pathways = pathways_list,
  stats    = score,
  minSize  = 15,
  maxSize  = 500,
  nperm    = 10000
)

# View results
head(fg[order(padj), ])

Handling Gene Duplicates

If you have duplicate gene names, keep the one with the largest absolute score:

library(dplyr)

df <- data.frame(gene = names(score), 
                 score = as.numeric(score), 
                 stringsAsFactors = FALSE)

df2 <- df %>% 
  group_by(gene) %>% 
  slice_max(abs(score), with_ties = FALSE)

score_unique <- df2$score
names(score_unique) <- df2$gene
score_unique <- sort(score_unique, decreasing = TRUE)

Why Not Use P-values for Ranking?

This is a common question, but p-values make poor ranking statistics for several reasons:

1. P-values Lose Direction

  • A p-value only indicates evidence strength, not whether a gene is up- or down-regulated
  • GSEA requires directional information to properly order genes

2. P-values Distort Magnitude

  • In large datasets, tiny fold-changes can get extremely small p-values
  • In small datasets, large fold-changes might appear “non-significant”
  • Sample size artifacts dominate biological signal

3. P-values Are Not Linear or Symmetric

  • Bounded between 0 and 1 with highly skewed distribution
  • Over-emphasizes a few ultra-significant genes while flattening the rest
  • Test statistics (t, Wald, signed √F) are symmetric and unbounded—ideal for GSEA

4. Original GSEA Design

The original GSEA algorithm explicitly recommends using signal-to-noise ratios or test statistics, not p-values.

Best Practices Summary

DO:

  • Use all tested genes (don’t pre-filter by FDR)
  • Use proper directional test statistics:
    • edgeR: sign(logFC) * sqrt(F)
    • DESeq2: res$stat (Wald statistic)
    • limma: t statistic from topTable
  • Ensure gene IDs match between your ranking and pathway databases
  • Set reasonable gene set size bounds (15-500 genes)
  • Use sufficient permutations (≥10,000) for stable p-values

DON’T:

  • Pre-filter genes by p-value cutoffs for GSEA
  • Use p-values as ranking statistics
  • Ignore gene ID mismatches
  • Forget to handle duplicate gene names

Conclusion

GSEA is a powerful method for functional enrichment analysis, but it requires proper input preparation. By using all genes ranked with appropriate test statistics rather than filtered DEG lists, you’ll get more reliable and interpretable results. The continuous ranking preserves the full spectrum of expression changes and enables GSEA to detect subtle but coordinated changes across gene sets.

Remember: more genes with proper ranking = better GSEA results than fewer genes with artificial cutoffs.