Quantify Un-/Under- Annotated Features (hERVs, Transgenes) from (Single-Cell) Transcriptomics: Math Algorithms, Implementation and Turnkey Workflows

May 25, 2025 · 5 min read
Image credit: Nano Banana

Introduction

Transformative innovations in biomedical science and therapeutics may lie within underappreciated regions of the DNA, which are only now beginning to be annotated.

Human Endogenous Retroviruses (hERVs) are ancient viral sequences embedded in the human genome. The roles of hERVs in gene regulation, immunity, development and cancer are under intense research, though their annotation was just started and has not been assembled into genome references. Transgenes including GFP, CRE, Luciferase, rtTA/Tet-On/Tet-Off and epitopically expressed genes are essential targets to be quantified in samples from transgenic mouse models to reveal the mechanisms of pathogenesis and therapeutic targets.

To quantify them, we need modify and create fasta and gtf files to include the their sequencing and annotation. Also, because their sequencing reads are frequently repetitive or overlaping with host genome, we need a feature quantification algorithm to handle multimapping commonly seen for HERV and transgenes.

Here I discussed the algorithms for feature quantification, and successfully quantified hERV and transgenes by implementing EM algorithms and generating gtf files that match the latest reference genome GRCm39 for below modalities:

  • Bulk-RNAseq
  • 10x single-cell RNAseq
  • Well-based single-cell RNAseq

And I complied turnkey pipelines using Snakemake.


📚 Table of Contents

Click on 🧭 Intro for a quick overview, 📝 Blog for an intuitive walkthrough with mathematic reasoning, or 🐙 Git for code and use examples.
#TitleIntroBlogGitHub
1Quantify hERV and transgenes from Bulk-RNAseq🧭📝🐙
2Generate and validate ERV gtf that matches the latest reference genome🧭📝🐙
3A snakemake pipline– quantify hERV, transgenes and genomics from 10x scRNAseq🧭📝🐙
4A snakemake pipline– quantify hERV, transgenes and genomics from well-based single-cell RNAseq🧭📝🐙

Contents

Please click 📝 Blog in the section titles for intuitive walkthrough of the content in each section, and click 🐙 Git for source codes and use examples.

1. Quantify hERV and transgenes from Bulk-RNAseq 📝Blog | 🐙Git

RNA Seq Analysis
Image credit: Round Icons on Unsplash

What are hERVs and transgenes

Human Endogenous Retroviruses (hERVs) are ancient viral sequences embedded in the human genome. The roles of hERVs in gene regulation, immunity, development and cancer are under intense research. Transgenes including GFP, CRE, Luciferase, rtTA/Tet-On/Tet-Off and epitopically expressed genes are common targets to be quantified in sample from transgenic mouse models.

How to quantify them

To quantify them, we of course need modify fasta and gtf files to include the their sequencing and annotation. Also, because their sequencing reads are frequently repetitive or overlaping with host genome, we need a feature quantification algorithm to handle multimapping commonly seen for HERV and transgenes. Here I discussed the algorithms for feature quantification, and successfully quantified hERV and transgenes by implementing an EM algorithm.


2. Generate and validate ERV gtf that matches the latest reference genome 📝Blog | 🐙Git

RNA Seq Analysis
Image credit: Heather Green on Unsplash

ERV gtf is the annotation file describes the start and end position of each of ERV in the genome. And itself or its corresponding fasta file (containing ERV sequences) is required to quantify the ERV from the RNAseq reads.

GRCm39 is preferred to have most accurate and up-to-date quantification of genes or other features from RNAseq reads, because of the key improvements made in GRCm39 compared with GRCm38:

  • Improved accuracy in assembly with updated sequencing data
  • More corrected sequencing gaps and mis-assemblies
  • Better representation of segmental duplications and structural variations
  • Enhanced annotation coverage

Moreover, ERV and genes have to be quantify simultaneously. Therefore, using the GRCm38 instead of the GRCm39 pair results in less accurate quantification of both genes and ERVs

But the ERV gtf compatible with the latest mouse genome GRCm39 is not publicly available. Here I generated and validated the ERV gtf based on GRCm39 using RepeatMasker and customized scripts.


3. A snakemake pipline to quantify hERV, transgenes and genomics from 10x single-cell RNAseq data 📝Blog | 🐙Git

RNA Seq Analysis
Image credit: Creative Idea on Unsplash

The EM algorithm is well-suited for quantifying them as elucidated in my previous blog. Briefly, EM algorithm is advantageous in dealing with multimaping, which is commonly seen for hERV and transgene quantification, because:

  • mathematically, EM is well suitable for estimating parameters in Gaussian Mixture Models, which is similar to the model describing multimapping
  • biologically, EM can takes into consideration the mapping probability and mapping bias, though Starsolo dose not use bias correction

To quantify them from 10x scRNAseq, here I carefully tuned the (hyper)parameters to allow Starsolo to:

  • implement EM algorithm
  • produce nearly identical gene counts in the same format as the 10x CellRanger does
  • empower downstream analysis to leverage a substantial and growing body of biological interpretations derived from CellRanger processed datasets

And complied a turnkey pipeline using Snakemake.


4. A snakemake pipline– quantify hERV, transgenes and genomics from well-based single-cell RNAseq 📝Blog | 🐙Git

RNA Seq Analysis
Image credit: Heather Green on Unsplash

Compared with that for 10x scRNAseq, the Snakemake file to quantify hERV, transgenes and genome from well-based single-cell RNAseq data has more details to pay attention to. Here I discussed the key unique lines of the Snakemake file for well-based single-cell RNAseq, and please see the common parts of the Snakemake files for both well-based and 10x scRNAseq data in my previous blog.