🧬 Quantify hERV and transgenes from Bulk-RNAseq
Image credit: Illustration by Round Icons on unsplashGithub source code
Please see source codes in github that can be deploied in Cloud Cluster Computational platform and local desktops, and allows for:
- batch computaion with tunable batch size
- automatic environment setting up
- file preprocessing
- QC
- index buidling, alignment and feature counting by either
subreadorsalmon
The github repo also contains usage examples using SLURM or LSF job scheduler and handy utility tools.
What are hERVs and transgenes
Human Endogenous Retroviruses (hERVs) are ancient viral sequences embedded in the human genome. The roles of hERVs in gene regulation, immunity, development and cancer are under intense research. Transgenes including GFP, CRE, Luciferase, rtTA/Tet-On/Tet-Off and epitopically expressed genes are common targets to be quantified in sample from transgenic mouse models.
How to quantify them
To quantify them, we of course need modify fasta and gtf files to include the their sequencing and annotation. Also, because their sequencing reads are frequently repetitive or overlaping with host genome, we need a feature quantification algorithm to handle multimapping commonly seen for HERV and transgenes. Here I discussed the algorithms for feature quantification, and successfully quantified hERV and transgenes by implementing an EM algorithm.
Their fasta and gtf
The transgenes can be included as chromosomes in fasta as below
> EGFP
ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCG
ACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCT
GACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACC
CTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCA
AGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTA
CAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGC
ATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACA
ACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAA
CATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGC
CCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACG
AGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGA
CGAGCTGTACAAGTAA
The transgenes are formated as below in gtf
EGFP self exon 1 720 500 + . gene_id "custom_3"; gene_name "EGFP"
Cre self exon 1 1023 500 + . gene_id "custom_4"; gene_name "Cre"
ERT2 self exon 1 931 500 + . gene_id "custom_5"; gene_name "ERT2"
The hERV sequences are already in reference genome, but their gtf need be added as below. We will talk about how to generate the hERV gtf matching the latest reference genome in another blog.
chr10 RepeatMasker exon 3083835 3085549 4.5 - . gene_id "RepMasker_dc1fbd39-e43b-4ac4-90ef-1da20b1d6249"; gene_name "chr10:LTR/ERVL-MaLR:ORR1A0:3"
chr10 RepeatMasker exon 3114480 3115385 6 - . gene_id "RepMasker_c49d8024-0b45-46a9-bcaa-c44f7a305a03"; gene_name "chr10:LTR/ERVK:RLTR13F:8"
chr10 RepeatMasker exon 3122665 3123530 13.8 + . gene_id "RepMasker_5154950a-1e53-4528-a5cf-37d121e3e1a4"; gene_name "chr10:LTR/ERVK:RLTR13D6:11"
chr11 RepeatMasker exon 3055489 3061982 25.9 + . gene_id "RepMasker_f6a34047-357d-4ed5-9744-eaf4c862be26"; gene_name "chr11:LTR/ERVK:IAPLTR3-int:1"
chr11 RepeatMasker exon 3139464 3140302 11.7 + . gene_id "RepMasker_ad8b23ab-de54-4f73-be4d-f8405ad13e66"; gene_name "chr11:LTR/ERVK:RLTR33:3"
chr11 RepeatMasker exon 3410867 3413285 16.9 + . gene_id "RepMasker_ec06076e-c8d0-4a94-8cfa-358f352bb1e8"; gene_name "chr11:LTR/ERV1:RLTR14-int:7"
chr12 RepeatMasker exon 3052672 3059141 1.6 + . gene_id "RepMasker_8ea0db28-2cd0-4d7e-b151-b5b6480ee320"; gene_name "chr12:LTR/ERVK:IAPEz-int:1"
chr12 RepeatMasker exon 3065049 3072337 27.3 - . gene_id "RepMasker_f3379b4a-cdfa-4b24-a4c9-5dacbdaf004c"; gene_name "chr12:LTR/ERV1:MMERGLN-int:2"
chr12 RepeatMasker exon 3084084 3084938 5.8 - . gene_id "RepMasker_062a5737-d991-4220-9648-33c43ad467a2"; gene_name "chr12:LTR/ERVK:IAPEz-int:4"
How tools deal with multimapping
Below list common feature count/qunatification tools emphasizing on the core algorithms they wrapped to deal with multimapping. These common tools have similar quantification accuracy.
| Tool | Default(option) | Core algorithm |
|---|---|---|
| HTSeq-count | discard(–nonunique=all/fraction) | fraction |
| featureCounts | discard(-M –fraction) | fraction |
| RSEM | EM Probabilistic | EM |
| Salmon | EM Probabilistic | (Bayesian) EM |
| kallisto | EM Probabilistic | EM |
| StringTie | EM Probabilistic | EM |
| TEtranscripts | EM Probabilistic | EM |
| Telescope | EM Probabilistic | EM with MAP |
They by default may simply discard multimapping reads, which ignores most of hERVs. With carefully specified arguments, they either distribute read counts equally among certian features (fraction) or estimate using EM algorithm.
Why EM algorithm for multimapping
“Fraction” algorithm overlooked how likely the reads can match with reference sequences. The biological prior information, if reasonably exploited, can significantly improve the multimapping quantification. Beatifully, multimapping by nature is similar to the problem of Gaussian Mixtures and EM has been proved to resolve Gaussian Mixture Model (GMM) well.
EM is proved to resolve GMM well
GMM describes a set of data points of which each comes from one of several Gaussian distributions, but we don’t know which one. This missing information makes direct estimation (like via maximum likelihood) of the Gaussian parameters hard. EM solve this by iterating below E and M steps until converge:
E-step (Expectation): Use initiallized parameter to compute the posterior probability (sounds like bayesian, but it is not in standard EM) that each data point was generated by each Gaussian (i.e., the responsibility γ for each component).
M-step (Maximization): Update the parameters (π, μ, σ²) by performing weighted MLE using the expected values (responsibilities) computed in the E-step.
Multimapping is modeled as mixtures of distributions that account for mapping bias
Salmon model describes a set of fragments (seqeuncing reads) of which each fragment f originates from one of the several transcripts t at a probability distribution $P(t \mid f)$ , but we don’t know which one. Thus EM goes:
E-Step
For each fragment $f$ and transcript $t$ :
$$ P(t \mid f) = \frac{\alpha_t \cdot P(f \mid t)}{\sum_{t'} \alpha_{t'} \cdot P(f \mid t')} $$Where:
- $\alpha_t$ : Current estimate of the relative abundance of transcript $t$ .
- $P(f \mid t)$ : Probability of observing fragment $f$ given transcript $t$ , incorporating bias corrections.
Bias Corrections
The fragment-transcript probability is modeled as:
$$ P(f \mid t) = P_{\text{len}}(l_f \mid t) \cdot P_{\text{pos}}(p_f \mid t) \cdot P_{\text{seq}}(s_f \mid t) $$Where:
- $l_f$ : Length of fragment $f$ .
- $p_f$ : Position of fragment $f$ on transcript $t$ .
- $s_f$ : Sequence content of fragment $f$ .
- $P_{\text{len}}$ : Probability of fragment length, modeling the empirical fragment length distribution.
- $P_{\text{pos}}$ : Positional bias model, capturing biases in fragment start locations.
- $P_{\text{seq}}$ : Sequence-specific bias model, correcting for nucleotide composition effects.
Importantly, the bias models: $P_{\text{len}}$ , $P_{\text{pos}}$ and $P_{\text{seq}}$ are not directly updated by the EM algorithm during transcript abundance estimation. Instead, these bias models are learned once, typically in a preprocessing or auxiliary phase before the EM iterations begin. The EM algorithm then uses the fixed values of these bias models to compute the probability $P(f \mid t)$ in the E-step.
The only parameters updated by EM are the relative abundances $\alpha_t$ of the transcripts.
M-Step
Update the abundance estimates:
$$ \alpha_t = \sum_f P(t \mid f) $$Then normalize:
$$ \sum_t \alpha_t = 1 $$Quantificaiton using Salmon
Now we see EM algorithm superior than fraction algorithms in quantifing multimappings since EM
- is well suitable for estimating parameters in Gaussian Mixture Models, which is similar to the multimapping mathematical model
- takes multimapping mathematical models of mapping probability that accounts for the mapping bias
We implemented Salmon since it allows both EM and full bayesian EM. We will talk about EM (with MLE), EM with MAP and bayesian EM as well as Frequentist vs Bayesian in other blogs.
Please see source codes in github that can be deploied in Cloud Cluster Computational platform and local desktops, and allows for:
- batch computaion with tunable batch size
- automatic environment setting up
- file preprocessing
- QC
- index buidling, alignment and feature counting by either
subreadorsalmon
The github repo also contains usage examples using SLURM or LSF job scheduler and handy utility tools.
Excerpt scripts
Below excerpt scripts call each needed analysis step specified as ANALYSIS_STEP array, including file preprocessing, index buidling by build_index(), QC by qc_all(), alignment by align_all() if using subread, and feature counting by count_all().
for i in ${ANALYSIS_STEP[@]}; do
timed_print "$i-ing..."
case "$i" in
# for the first 2, do not run if they are in a child/forked process
index) if [ $CHILD = false ]; then build_index; fi ;;
convert) if [ $CHILD = false ]; then get_pairs_all; fi ;;
qc) qc_all ;;
align) align_all ;;
count) count_all ;;
esac
timed_print "finished $i"
done
Some of the core functions in the anylysis are below:
build index
salmon_build_index(){ # supports multiple transcripts
# get dir name for the index, which is the transcript names seperated by a hyphen(-)
# eg. for erv.fa and line.fa, you get: erv-line
local index_name="salmon/$(IFS=-; echo "${TRANSCRIPTS[*]%.*}")"
if [ "$OVER_WRITE" = "true" ] || [ ! -d "${index_name}_index" ]; then
timed_print "building salmon index @: ${index_name}_index"
# salmon requires a decoy, which in this case is the entire reference genome
if [ "$OVER_WRITE" = "true" ] || [ ! -f "${index_name}_decoys.txt" ]; then
timed_print "building decoys..."
grep "^>" "${REF_GENOME}" | cut -d " " -f 1 > "${index_name}_decoys.txt"
sed -i.bak -e 's/>//g' "${index_name}_decoys.txt"
fi
# salmon requires all the transcripts being quantified to be concat into one file
if [ "$OVER_WRITE" = "true" ] || [ ! -f "${index_name}_gentrome.fa" ]; then
timed_print "building gentrome..."
cat "${TRANSCRIPTS[@]}" "${REF_TRANSCRIPT}" "${REF_GENOME}" > "${index_name}.fa"
fi
# adding --gencode flag due to salmon doing some preprocessing for it
# this flag does nothing for none gencode files (I think)
salmon index -t ${index_name}.fa -d "${index_name}_decoys.txt" -i "${index_name}_index" --gencode
fi
}
build_index() {
case $ALIGN_METHOD in
subread) subread_build_index ;;
salmon) salmon_build_index ;;
esac
}
QC
fastp_qc(){
# results stored in /tmp/{name of the source}/qc/* and all end in *.qc.fq.gz
if [ "$OVER_WRITE" = "true" ] || [ ! -f "tmp/${SOURCE}/qc/$1.qc.fq.gz" ]; then
if [[ $# -eq 2 ]]; then # if pair ended
./fastp -i "tmp/${SOURCE}/$1.fq.gz" -o "tmp/${SOURCE}/qc/$1.qc.fq.gz" \
-I "tmp/${SOURCE}/$2.fq.gz" -O "tmp/${SOURCE}/qc/$2.qc.fq.gz" \
-j "results/fastp/$1.json" -h "results/fastp/$1.html"
elif [[ $# -eq 1 ]]; then # if single ended
./fastp -i "tmp/${SOURCE}/$1.fq.gz" -o "tmp/${SOURCE}/qc/$1.qc.fq.gz" \
-j "results/fastp/$1.json" -h "results/fastp/$1.html"
fi
fi
}
qc_all(){ # only supports fastp as of now
if [[ ! -d "tmp/${SOURCE}/qc" ]]; then
mkdir "tmp/${SOURCE}/qc"
fi
if [[ ! -d "results/${QC_METHOD}" ]]; then
mkdir "results/${QC_METHOD}"
fi
timed_print "qc-ing with $QC_METHOD"
while IFS=, read -r r1 r2; do # loop over pair file, csv
if [ "$OVER_WRITE" = "true" ] || [ ! -d "results/${ALIGN_METHOD}/$r1" ]; then
#if salmon quant already exists
timed_print "qc-ing ${r1} and ${r2}"
case $QC_METHOD in
fastp) fastp_qc $r1 $r2 ;;
esac
fi
done < $PAIR_FILE
}
quantification
salmon_quant() {
# get the same index_name as specified in "salmon_build_index()"
local index_name="salmon/$(IFS=-; echo "${TRANSCRIPTS[*]%.*}")"
#salmon outputs into a dir, so we check for that instead
#results stored in results/salmon/*
if [ "$OVER_WRITE" = "true" ] || [ ! -d "results/salmon/$1" ]; then
mkdir results/salmon/$1
if [[ $# -eq 2 ]]; then # if pair ended
salmon quant -i "${index_name}_index" -l A -1 "tmp/${SOURCE}/qc/$1.qc.fq.gz" -2 "tmp/${SOURCE}/qc/$2.qc.fq.gz" --validateMappings -o "results/salmon/$1"
#sleep 1
elif [[ $# -eq 1 ]]; then # if single ended
salmon quant -i "${index_name}_index" -l A -r "tmp/${SOURCE}/qc/$1.qc.fq.gz" --validateMappings -o "results/salmon/$1"
#sleep 1
fi
fi
}
salmon_count() {
while IFS=, read -r r1 r2; do
salmon_quant $r1 $r2
done < $PAIR_FILE
}
count_all(){
if [ ! -d "results/${ALIGN_METHOD}" ]; then
mkdir "results/${ALIGN_METHOD}"
fi
case $ALIGN_METHOD in
subread) subread_count ;;
salmon) salmon_count
esac
}