🛠️ Generate and validate ERV gtf that matches the latest reference genome

Github source code

What are ERV

Endogenous Retroviruses (ERV) are ancient viral sequences embedded in the mammalian genome. The roles of hERVs in gene regulation, immunity, development and cancer are under intense research.

Why is ERV gtf needed

ERV gtf is the annotation file describes the start and end position of each of ERV in the genome. And itself or its corresponding fasta file (containing ERV sequences) is required to quantify the ERV from the RNAseq reads.

Why is ERV gtf compatible with GRCm39 needed

GRCm39 is preferred to have most accurate and up-to-date quantification of genes or other features from RNAseq reads, because of the key improvements made in GRCm39 compared with GRCm38:

Improved accuracy in assembly with updated sequencing data
More corrected sequencing gaps and mis-assemblies
Better representation of segmental duplications and structural variations
Enhanced annotation coverage

Moreover, ERV and genes have to be quantify simultaneously for below reasons. Therefore, using the GRCm38 and its ERV annotation instead of the GRCm39 pair results in less accurate quantification of both genes and ERVs

properly handling multi-mapping: ERV sequences can be similar or overlap with endogenous genes. If ERVs and genes aren’t quantified together, reads may be incorrectly assigned to one while the other is completely ignored.
accurately read assignment: regardless multimappings, quantifying ERV with genes ensures a good control of false positives by including a competitive pool of unmatched features.

How to generate and validate the ERV gtf compatible with GRCm39

gEVE database was generates using RepeatMasker, thus we were encouraged to generate the latest ERV gtf in the same way. We hypothesized that the ERV gtf generated by us should show a good degree of conservation when compared with GRCm38 ERV gtf in gEVE database, and validated it correspondingly.

Generate GRCm39 ERV

1. Install RepeatMasker

Please refer to its official manuel

2. File wrangling and run RepeatMasker

Download fasta of GRCm39 (i.e. mm39), and split fasta by chromosome names by faidx provided by samtools or pyfaidx.

#!/bin/bash 
CHRS=($(seq 1 19) "X" "Y")

mkdir mus39_split

for chr in "${CHRS[@]}"
do
  echo "extracting chr${chr}"
	faidx --regex ".*chr${chr}.*" Mmus39.fa -o "mus39_split/chr${chr}.fa"
done

Then call run_repeatmasker.sh as below to run repeatmasker on each mouse chromosome.

chroms=( $(for i in $(seq 1 19) X Y; do echo "chr$i"; done) )
bash run_repeatmasker.sh "${chroms[@]}"

The run_repeatmasker.sh is below

#!/bin/bash

# save this file as run_repeatmasker.sh

# for every chr in listed in input
for chr in $@
do
    # ./repeatmasker/${chr} is output dir
    mkdir ./repeatmasker/${chr}

    # hmmer use 2 cores for each instance, 24 cores means 12 instances of hmmer
    RepeatMasker -engine hmmer -parallel 12 -species "Mus musculus"\
      -gccalc -dir ./repeatmasker/${chr} -gff ./chrs/${chr}.fa
done

3. Generate GTF file from `RepeatMasker` outputs

Below is R function repmask2gtf() to generate GTF from RepeatMasker output files. We will dissect it later.

library(data.table)
repmask2gtf <- function (name) {
  out_name <- sprintf("chr%1$s/chr%1$s.fa.out", name)
  gff_name <- sprintf("chr%1$s/chr%1$s.fa.out.gff", name)
  gtf_name <- sprintf("gtfs/chr%1$s.gtf", name)
  header <- c("bit score", "perc div.", "perc del.", "perc ins.", "chr", "begin", "end", "(left)",
              "side", "matching repeat", "repeat class/family", "rep begin", "rep end", "rep (left)",
              "ID", "inc higer score")
  data <- read.table(out_name, skip = 2, header = FALSE, fill = TRUE, col.names = paste0("V",seq_len(16)))
  names(data) <- header
  gff <- read.table(gff_name, skip=2, header=F)
  gff <- gff[!grepl("Simple|Unknown", data$`repeat class/family`),]
  data <- data[!grepl("Simple|Unknown", data$`repeat class/family`),]
  threshold <- 750
  gff <- gff[(data$`bit score` > threshold),]
  data <- data[(data$`bit score` > threshold),]
  if(nrow(gff) == 0 || nrow(data) == 0) {
    cat('No Repeatmasker annotation after filtering for',name,'\n')
    return()
  }
  seqname <- data$chr
  source <- rep(c("RepeatMasker"), each=length(seqname))
  feature <- rep(c("mRNA"), each=length(seqname))
  start <- data$begin
  end <- data$end
  score <- data$`perc div.`
  strand <- gff$V7
  frame <- rep(c("."), each=length(seqname))
  gene_id <- sapply(1:length(seqname), function(x) paste0("gene_id \"RepMasker", x, "\"", collapse=''))
  gene_name <- mapply(
    function(i, x, y, z) paste0("gene_name \"", x, ":", y, ":", z, ":", i, "\"", collapse=''),
    1:length(seqname), 
    data$chr, 
    data$`repeat class/family`, 
    data$`matching repeat`
    )
  ID <- sapply(
    1:length(seqname), 
    function(x) paste0("ID \"RepMasker", x, "\"", collapse='')
    )
  attr <- mapply(
    function(id, x, y) paste0(id, "; ", x, "; ", y),
    ID, 
    gene_id, 
    gene_name
    )
    gtf <- data.table(seqname=seqname,source=source,feature=feature,start=start,end=end,score=score,strand=strand,
    frame=frame,attribute=attr)
  write.table(gtf, file=gtf_name, quote=FALSE, sep='\t', col.names=F, row.names=F)
}

The below lines get the names of .out and .out.gff files output by repeatmasker, and define the name of the GTF to return.

  out_name <- sprintf("chr%1$s/chr%1$s.fa.out", name)
  gff_name <- sprintf("chr%1$s/chr%1$s.fa.out.gff", name)
  gtf_name <- sprintf("gtfs/chr%1$s.gtf", name)

The next few lines simply reads the files that are generated by RepeatMasker. Note that we are using the -gff setting in RepeatMasker so that it generates a .out.gff file, which besides .out file contains information required for GTF. We’re skipping the first two lines of the .out and .gff files because they are just headers. I define a header for the data read from .out.gff file.

  # make sure we're reading 16 columns, because the last column is an optional one 
  # and some rows many not have them 
  data <- read.table(out_name, skip = 2, header = FALSE, fill = TRUE,
                     col.names = paste0("V",seq_len(16)))

  names(data) <- c("bit score", "perc div.", "perc del.", "perc ins.", "chr", "begin", "end", "(left)",
              "side", "matching repeat", "repeat class/family", "rep begin", "rep end", "rep (left)",
              "ID", "inc higer score")
  
  gff <- read.table(gff_name, skip=2, header=F)

Next, we’re filtering the ERV output by RepeatMasker to only keep the sequences more likely to be genetically significant

many of the identified sequences overlap and often the classification that best represents the true origin of the sequence has the highest score. So we are filtering for that by having a threshold for the bit score.
the .out.gff file is ordered the same way as the .out file, so can also filter based on the Simple/Unknown classification in .out file. Simple repeats tend to score higher than ERV or LINE.

  gff <- gff[!grepl("Simple|Unknown", data$`repeat class/family`),]
  data <- data[!grepl("Simple|Unknown", data$`repeat class/family`),]
  
  if(length(gff) == 0 || length(data) == 0) {return()}
  
  threshold <- 750
  gff <- gff[(data$`bit score` > threshold),]
  data <- data[(data$`bit score` > threshold),]
  
  if(length(gff) == 0 || length(data) == 0) {return()}

Then, we generate each column for the GTF. A major issue is that AGAT, the tool we use to turn GTF files into FASTA files for validating our ERV GTF, requires that each sequence has an ID and be marked as mRNA.

  seqname <- data$chr
  source <- rep(c("RepeatMasker"), each=length(seqname))
  feature <- rep(c("mRNA"), each=length(seqname))
  start <- data$begin
  end <- data$end
  score <- data$`perc div.`
  strand <- gff$V7
  frame <- rep(c("."), each=length(seqname))
  gene_id <- sapply(1:length(seqname), 
                    function(x){
                      paste0("gene_id \"RepMasker", 
                             x, 
                             "\"", 
                             collapse='')}
                    )
  
  gene_name <- mapply(
    function(i, x, y, z){
      paste0("gene_name \"", 
             x, ":",
             y, ":", 
             z, ":", 
             i, "\"",
             collapse='')
    },
    1:length(seqname), 
    data$chr,
    data$`repeat class/family`, 
    data$`matching repeat`)
  
  ID <- sapply(
    1:length(seqname), 
    function(x){
      paste0("ID \"RepMasker", x, "\"", collapse='')}
    )
  
  attr <- mapply(
    function(id, x, y) paste0(id, "; ", x, "; ", y), 
    ID,
    gene_id, 
    gene_name)

Last, output the GTF

  gtf <- data.table(
    seqname=seqname,
    source=source,
    feature=feature,
    start=start,
    end=end,
    score=score,
    strand=strand,
    frame=frame,
    attribute=attr
  )
  write.table(gtf, file=gtf_name, quote=FALSE, sep='\t', col.names=F, row.names=F)

A loop below to call the above repmask2gtf() to generate the GTF for all the files that are in the current dir.

for (chr in list.dirs(path=".", full.names=TRUE, recursive=FALSE)){
  if (grepl("chr", chr, fixed=TRUE)){
    print(paste0("generating GTF for ", gsub("./chr", "", chr)))
    if (!file.exists(sprintf("gtfs/chr%1$s.gtf", gsub("./chr", "", chr)))){
      repmask2gtf(gsub("./chr", "", chr))
    }else{
      cat(sprintf("gtfs/chr%1$s.gtf", gsub("./chr", "", chr)), 'already exist\n')
    }
  }
}

4. Modify GTF for STARsolo

Modify GTF for a couple of reasons:

STAR does not accept that we’re using mRNA instead of exons
The gene id, which are RepeatMasker + a number, may not work when we concat all chromosomes together due to duplicated IDs.
STAR also does not need a ID besides gene id, so we’ll remove that as well.

change_feature_gtf <- function(gtf_tab) {
  gtf_tab$V3 <- rep("exon", nrow(gtf_tab))
  gtf_tab
}
rm_id_gtf <- function(gtf_tab) {
  gtf_tab$V9 <- lapply(
    gtf_tab$V9, 
    function(x) gsub("(ID \"[A-Za-z0-9]*\"; )", "", x))
  gtf_tab
}
rename_gtf <- function(gtf_tab) {
  gtf_tab$V9 <- lapply(gtf_tab$V9, function(x) {
    new_id <- sprintf("gene_id \"RepMasker_%s\"; ", uuid::UUIDgenerate())
    gsub("(gene_id \"[A-Za-z0-9]*\"; )", new_id, x)
  })
  gtf_tab
}
mod_gtf <- function(chr_name) {
  gtf <- read.table(chr_name, header = FALSE, sep = "\t", quote = "")
  gtf <- change_feature_gtf(gtf)
  gtf <- rm_id_gtf(gtf)
  gtf <- rename_gtf(gtf)
  gtf <- as.data.frame(lapply(gtf,unlist))
  write.table(gtf, file=gsub("./gtfs/", "./gtfs2/", chr_name), quote=FALSE, sep='\t', col.names=F, row.names=F)
}

A for loop below to apply mod_gtf() to every GTF file

for (chr in list.files(path="./gtfs", full.names=T, recursive=F)) {
  if (!file.exists(gsub("./gtfs/", "./gtfs2/", chr))){
    print(paste0("generating new gtfs for: ", chr))
    mod_gtf(chr)
  }
}

5. Generate FASTA from GTF for validation

We first modify the GTF generated for STAR

mod_gtf_for_AGAT <- function(chr_name,input_dir="./gtfs/",output_dir="./gtfs2/") {
  gtf <- read.table(chr_name, header = FALSE, sep = "\t", quote = "")
  t_uuid <- stringr::str_extract(string = gtf$V9,pattern = 'gene_id( "RepMasker_.*?"; )',group = 1)
  gtf$V9 <- paste0('ID',t_uuid,gtf$V9)
  gtf$V3 <- rep("mRNA", nrow(gtf))
  gtf <- as.data.frame(lapply(gtf,unlist))
  write.table(gtf,file=gsub(input_dir, output_dir, chr_name),quote=F,sep='\t',col.names=F,row.names=F)
}

Then convert GTF into fasta for comparing our GRCm39 ERV GTF with from GRCm38 gEVE database using AGATinstalled by mamba/conda.

#!/bin/bash
# see also in 'gtf2fa.sh'

# CHRS=($(seq 1 19) "X" "Y")
CHRS=("Y")
for chr in "${CHRS[@]}"
do
  echo "extracting .fa from chr${chr}.gtf"
	agat_sp_extract_sequences.pl -g ./gtfs2/chr${chr}.gtf -t mRNA --merge 0\
    -f ../single_cell/mouse/GRCm39.primary_assembly.genome.fa -o ./fa/chr${chr}.fa
done

Vadlidate our GRCm39 ERV

We compared ERV using RepeatMasker and its built-in Dfam 4.1 on GRCm39 and GRCm38 ERV in gEVE.

1. Wrange files of GRCm38 ERV

ERV sequences for the mouse genome assembly GRCm38 are downloaded from the gEVE website. They are split by chromosome names using pyfaidx or faidx provided by samtools

# download gEVE mouse annots
wget http://geve.med.u-tokai.ac.jp/download_data/nt_fasta/Mmus38.geve.nt_v1.fa.bz2
bunzip Mmus38.geve.nt_v1.fa.bz2

# extract just chromosome Y (we're going to compare this one)
mkdir mus38_split
faidx --regex ".*chrY.*" Mmus38.geve.nt_v1.fa -o "mus38_split/chrY.fa"

2. BLASTn: gEVE as reference, our GRCm39 ERV as query

Then, a BLASTn database is created based on the extracted chromosome Y sequences from GRCm38 ERV. Make sure BLAST is installed on your system and is in system PATHS by adding the below to your user bashrc or else makeblastdb and rBLAST would not work. It won’t be achieved by running module load blast in eristwo command line interface or rsutdio terminal

export PATH=$PATH:/apps/lib/blast/ncbi-blast-2.13.0+/bin/
mkdir ./blastdb/mus38_chrY
makeblastdb -in mus38_split/chrY.fa -title mus38_chrY \
      -dbtype nucl -parse_seqids -out ./blastdb/mus38_chrY/mus38_chrY
# the first mus38_chrY of "./blastdb/mus38_chrY/mus38_chrY " specify dir, the second specify database name

In order to compare the ERV sequence generated, we first use the BLASTn to build reference using gEVE sequences and use RepeatMasker sequences are query. We will then do the reverse and use RepeatMasker sequences as reference and gEVE sequences as query. This step simply reads the RepeatMasker sequences and the BLAST database.

library(rBLAST) # make sure your default PATH includes BLAST, or else it wouldn't work
library(Biostrings) # read .fa files
library(data.table)

# allow us to parallelize the rBLAST operations
library(foreach) 
library(doParallel)

reptmask_chrY <- readDNAStringSet('repeatmasker/chrY.fa')
bla <- blast(db='./blastdb/mus38_chrY/mus38_chrY', type = "blastn")#read makeblastdb-generated database into R enviorment

Below test if the sequences can be read and if rBLAST is working.

reptmask_chrY[1,]
predict(bla, reptmask_chrY[1,], BLAST_args = "-perc_identity 99")

If it looks good, we can speed it up via multi-threading parrellel computation

parrellel computation

# run in cluster, this may take 15~30 minutes with a good (48 core) computer
reptmask_matches <- foreach(i=1:length(reptmask_chrY)) %dopar% {
  library('rBLAST')
  library('data.table')
  
  match_i = data.table(predict(bla, reptmask_chrY[i,], BLAST_args = "-perc_identity 99"))
  match_i 
}

# change the name to something more intuitive
names(reptmask_matches)<- gsub(pattern=' .*$',
                               replacement='',
                               reptmask_chrY@ranges@NAMES)
saveRDS(reptmask_matches,'repeatmasker_mus38_full_named_matches.rds')

To analyse the matches, we first find number of sequences with matches, and also extract unique matches from the reference.

length(reptmask_matches[lapply(reptmask_matches, nrow) > 0])
length(unique(do.call(c, lapply(reptmask_matches, function (tbl) tbl$'SubjectID'))))

We can also specify families and see how many matches exist between gEVE and our sequences.

gtflist <- read.table("repeatmasker/chrY.gtf")
erv_matches <- full_matches[grepl("ERV", gtflist$V16)]
length(erv_matches[lapply(erv_matches, nrow) > 0])

erv_matches_list <- unique(do.call(c, lapply(erv_matches, function (tbl) tbl$'SubjectID')))
length(erv_matches_list)

3. BLASTn: our GRCm39 ERV as reference, gEVE as query

mus38_chrY <- readDNAStringSet('mus38_split/chrY.fa')
bla <- blast(db='./blastdb/rm_mus39_chrY/rm_mus39_chrY', type = "blastn")

rev_reptmask_matches <- foreach(i=1:length(mus38_chrY)) %dopar% {
  library('rBLAST')
  library('data.table')
  
  match_i = data.table(predict(bla, mus38_chrY[i,], BLAST_args = "-perc_identity 99")) 
  match_i #Equivalent to finalMatrix = cbind(finalMatrix, tempMatrix)
}

#comparison
length(full_matches[lapply(rev_reptmask_matches, nrow) > 0])
length(unique(do.call(c, lapply(rev_reptmask_matches, function (tbl) tbl$'SubjectID'))))

#family-specific comparison
erv_matches <- rev_reptmask_matches[grepl("ERV", gtflist$V16)]
length(erv_matches[lapply(erv_matches, nrow) > 0])
erv_matches_list <- unique(do.call(c, lapply(erv_matches, function (tbl) tbl$'SubjectID')))
length(erv_matches_list)

stopCluster(cl)

Conclusion

For chrY:

gEVE GRCm38 ERV has 4633 unique sequences; our GRCm39 ERV have 8841 unique sequences including 3701 unique ERV sequences
gEVE sequences as BLASTn query and our GRCm39 ERV as reference:
- Out of 9043 total sequences in gEVE, 8768 has a BLASTn >99% matched sequence in our GRCm39 ERV, corresponding to 4321 unique GRCm39 sequences.

Thus, our GRCm39 ERV capture nearly all information in gEVE GRCm38 ERV, and add ~4000 more unique ERV sequences. We attribute this improvement to the up-to-date GRCm39 and the sensitivity settings available in the newer version of RepeatMasker. Below observation further revealed that the newly identified sequences in GRCm39 ERV predominantly belong to ERV and LINE, both of significant interest in biomedical research.

our GRCm39 ERV as BLASTn query and gEVE as reference: only 55% of GRCm39 ERV segments and 27% of GRCm39 LINE segments have matched sequences in gEVE, specifically

2794 (out of 5077) of our GRCm39 ERV segments have BLASTn matched sequence in gEVE (corresponding to 6134 sequences);
1803 (out of 6670) of our GRCm39 LINE segments have BLASTn matched sequence in gEVE (corresponding to 1454 sequences)