In some common genome annotation GTF/GFF files, rRNA repeats are not properly marked. A detailed description/discussion of this problem can be seen here. This can cause ineffective identification/filtering of rRNA reads in RNA-seq studies. To address this problem, a set of GTF files obtained from UCSC table browser and modified versions of them are hosted in this repo to assist rRNA-related quality-check steps in RNA-seq data analysis pipelines.
Original GTF files were obtained from UCSC table browser by selecting:
- Genome and version
Repeats
orVariation or Repeats
under "group"RepeatMasker
under "track"- Enter
repClass does match rRNA
under "filter" - Choose
GTF
under "output format"
Each line in the GTF file was duplicated with the third column changed from 'exon' to 'gene'. This is to increase compatiblity if users choose to count 'gene' instead of 'exon' using featureCounts
.
To convert the file from UCSC format to Ensembl format, conversion tables from ChromsomeMappings were used. Chromosomes not in the conversion tables were ommitted.
Additionally, gene_biotype "rRNA"
were added to end of every line.
There are many ways to use these files, of course. One way that I have tested and found quite effective is simply concatenating these GTF files to the original one, and running featureCounts
with the options -g gene_biotype -M -O --fraction
. -g gene_biotype
would tell featureCounts
to tabulate reads by gene_biotype attribute; -M -O --fraction
would tell featureCounts
to count reads aligned to multiple locations and/or multiple genes, by assign them to target biotypes by fractions of reads, therefore not counting one read multiple times. The final tally would be a good approximation of biotype composition of the RNA-seq library.
Optionally, you can choose to add -t gene
to the featureCounts
command. This will ask featureCounts
to count reads overlapping with genes instead of exons. Counting only reads overlapping exons will result in bias against protein coding genes because they may have reads overlapping introns, which won't be counted, while other biotypes such as rRNA have little or no introns. This bias is stronger in samples with more intron content. For example, library prepared using rRNA depletion protocol generally has more pre-mRNA than libraries prepared with polyA selection protocol. So you might want to consider using the -t gene
option for the former library type.
cat GRCh38.gtf >> GRCh38_original.gtf
featureCounts -a GRCh38_original.gtf -g gene_biotype -M -O --fraction -p -o sample_biotype.featureCounts.txt -s 0 sample.bam
Test was done using two RNA-seq samples, one with rRNA depeltion, and one without. The FASTQ files were analyzed using a modified version of nf-core/rnaseq pipeline version 1.4.2 using GRCh38 as reference. The analysis were done three times.
- With original featureCounts code, almost no rRNA was detected.
- With
-M -O --fraction
options, a little bit of rRNA was detected, but still very low. - with both rRNA GTF file and
-M -O --fraction
, rRNA was detected correctly (close to expectations).
Test data and MultiQC report of the test runs are available upon request.