🧬 A snakemake pipline-- quantify hERV, transgenes and genomics from well-based single-cell RNAseq

May 27, 2025·
Jiyuan (Jay) Liu
Jiyuan (Jay) Liu
· 5 min read

Github source code

Please see github repo.

What are hERVs and transgenes

Human Endogenous Retroviruses (hERVs) are ancient viral sequences embedded in the human genome. The roles of hERVs in gene regulation, immunity, development and cancer are under intense research. Transgenes including GFP, CRE, Luciferase, rtTA/Tet-On/Tet-Off and epitopically expressed genes are common targets to be quantified in sample from transgenic mouse models.

How to quantify them from Well-based scRNAseq

The EM algorithm is well-suited for quantifying them as elucidated in my previous blog. Briefly, EM algorithm is advantageous in dealing with multimaping, which is commonly seen for hERV and transgene quantification, because:

  • mathematically, EM is well suitable for estimating parameters in Gaussian Mixture Models, which is similar to the model describing multimapping
  • biologically, EM can takes into consideration the mapping probability and mapping bias, though Starsolo dose not use bias correction

Starsolo is particularly handy for this task because it:

  • implements EM algorithm
  • follows CellRanger logic for cell barcode whitelisting and UMI deduplication to produce nearly identical gene counts in the same format as the CellRanger does
  • is the foundation of alignment in CellRanger

Why resembling CellRanger helps

The similarity of STARsolo with CellRanger is important. While CellRanger is not a gold standard, a substantial and growing body of biological interpretations is derived from data processed by it. Aligning with CellRanger enables more meaningful comparisons with such datasets. This is particularly crucial when data integration is required, as no current method can fully correct for technical artifacts without distorting true biological signals.

Snakemake Workflow Management

Compared with that for 10x scRNAseq, the Snakemake file to quantify hERV, transgenes and genome from well-based single-cell RNAseq data has more details to pay attention to. Below discussed the key unique lines of the Snakemake file for well-based single-cell RNAseq, and please see the common parts of the Snakemake files for both well-based and 10x scRNAseq data in my previous blog.

Snakemake is used to manage a workflow as illustrated below. Briefly, I generated star index and wrangled all fastqs belonging to the same end reading and the same sample into one fastq file. The combined fastqs and star index were input to starsolo in rule star_mapping. After making sure both input fastq wrangling and rule star_mapping were done, I moved the input fastqs to a specified directory to safely carry out cleaning up.

Align and count genes, hERV and transgenes

Rule star_mapping calls STARsolo with a detailed configuration for single-cell RNA-seq analysis, using a customized CB/UMI structure and enabling EM-based handling of multimappers.

The soloType for seqWell(well-based scRNAseq) is CB_UMI_Complex as excerpted below from star-solo github

- soloType input options:
    - CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
    - CB_UMI_Complex  ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.

Microwell-seq uses three separate barcodes with adaptors aside them rather than a single barcode. Generating their three white lists requires reading through the methods in Mapping the Mouse Cell Atlas by Microwell-Seq. NOTE the results only showed 5’ to 3’ strands, while barcodes 2 and 3 were presented on the 3’ to 5’ strands and reversed when being transcribed onto the beads, we have to take the complement and reverse the resultant sequence. Below successfully accomplished the three white lists step by step:

The rest of the configurations are copied from the STARsolo guide on github. That gives

rule star_mapping:
    input:
        "combined_reads/{input_reads}.r1.fastq.gz",
        "combined_reads/{input_reads}.r2.fastq.gz",
        idx=expand("{star_index}", star_index=config["star_index"]),
    output: 
        directory("mapped_reads/{input_reads}/")
    message:
        "star mapping..."
    threads: 24
    resources:
        mem = 64000,
        time = '4:00:00'
    shell:
        """STAR --genomeDir {input.idx} --runThreadN {threads} \
            --soloType CB_UMI_Complex \
            --soloCBposition 0_0_0_5 0_21_0_26 0_42_0_47 \
            --soloUMIposition 0_48_0_53 \
            --soloCBwhitelist {config[wl]} \
            --clipAdapterType CellRanger4 --outFilterScoreMin 10 \
            --soloCBmatchWLtype 1MM_multi \
            --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR \
            --soloFeatures Gene GeneFull SJ --soloCellFilter EmptyDrops_CR \
            --soloMultiMappers EM \
            --readFilesIn {input[1]} {input[0]} --readFilesCommand 'gunzip -c'\
            --outFileNamePrefix {output}/"""