alexdobin / STAR

RNA-seq aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pass 2 is 50 times slower depending on reindexed genome

maressyl opened this issue · comments

Hi,

I use STAR 2.7.11a in several targeted RNA-seq projects, and can't figure out why in some of them there is a huge difference in mapping speed between pass 1 (around 50 million reads / hour) and pass 2 (either 40-50 millions / hour in some projects or 1 million / hour in others).

I apply a fully decoupled two-pass approach (full commands below, quite a lot of options as I'm looking for splicing and fusion events) :

  • aligning each sample separately against GRCh38 in a first pass,
  • feeding all *.SJ.out.tab files to a "reindex" launch with dummy FASTQ to get a reindexed genome
  • using the reindexed genome to align again each sample separately (pass 2)

I narrowed down the problem to two projects with similar characteristics (10 to 15 million reads per sample, ~2x150 bp after adapter trimming, ~150 targeted genes, UMIs in the "fast" project but not in the "slow" one), and ended up aligning the reads of the slow project with two distinct reindexed genomes : the one produced by the corresponding pass 1 and another one produced by pass 1 of a "fast" project. I ended up with a mapping speed of 50 M/h with the "fast" project genome, and still 1 M/h with the "slow" project genome.

I saw previous discussions on the subject (#733, #1034) suggesting to lower down the amount of junctions injected into the genome, but in this case the log file of my genome reindexing step indicated more junctions in the "fast" project (1 125 058 after SJ and GTF injection) than in the "slow" project (977 952).

Any idea on what I am missing ?

Best regards,
Sylvain


Pass 1

STAR \
		--runThreadN 6 \
		--twopassMode None \
		--genomeDir "$rawGenome" \
		--genomeLoad NoSharedMemory \
		--readFilesIn $readFilesIn \
		--readFilesCommand gunzip -c \
		--outFileNamePrefix "./" \
		--outSAMunmapped Within \
		--outSAMtype BAM Unsorted \
		--chimOutType Junctions WithinBAM \
		--quantMode TranscriptomeSAM \
		--outSAMattrRGline $RG \
		--sjdbGTFfile "$genomeGTF" \
		--alignEndsProtrude ${params.umi_length} ConcordantPair \ ### -1 in "slow" project, 6 in "fast project"
		--alignInsertionFlush Right \
		--alignSJDBoverhangMin 4 \
		--alignSJstitchMismatchNmax 3 -1 3 3 \
		--alignSplicedMateMapLmin 16 \
		--alignSplicedMateMapLminOverLmate 0 \
		--chimJunctionOverhangMin 8 \
		--chimScoreJunctionNonGTAG -4 \
		--chimSegmentMin 10 \
		--chimMultimapNmax 1 \
		--chimNonchimScoreDropMin 10 \
		--outFilterMultimapNmax 3 \
		--outFilterMismatchNmax 5 \
		--outSJfilterOverhangMin 16 8 8 8 \
		--outSJfilterDistToOtherSJmin 0 0 0 0 \
		--peOverlapNbasesMin 12 \
		--peOverlapMMp 0.1

Reindex

STAR \
		--runThreadN 2 \
		--genomeDir "$rawGenome" \
		--readFilesIn "$R1" "$R2" \
		--sjdbFileChrStartEnd $SJ \
		--limitSjdbInsertNsj 5000000 \
		--sjdbInsertSave All \
		--sjdbGTFfile "$genomeGTF" \
		--outFileNamePrefix "./reindex/" \
		--outSAMtype None

Pass 2

STAR \
		--runThreadN 6 \
		--twopassMode None \
		--genomeDir "$reindexedGenome" \
		--genomeLoad NoSharedMemory \
		--readFilesIn $readFilesIn \
		--readFilesCommand gunzip -c \
		--outFileNamePrefix "./${sample}/" \
		--outSAMunmapped Within \
		--outSAMtype BAM Unsorted \
		--chimOutType Junctions WithinBAM \
		--quantMode TranscriptomeSAM \
		--outSAMattrRGline $RG \
		--sjdbGTFfile "$genomeGTF" \
		--alignEndsProtrude -1 ConcordantPair \ ### -1 for both during the test
		--alignInsertionFlush Right \
		--alignSJDBoverhangMin 4 \
		--alignSJstitchMismatchNmax 3 -1 3 3 \
		--alignSplicedMateMapLmin 16 \
		--alignSplicedMateMapLminOverLmate 0 \
		--chimJunctionOverhangMin 8 \
		--chimScoreJunctionNonGTAG -4 \
		--chimSegmentMin 10 \
		--chimMultimapNmax 1 \
		--chimNonchimScoreDropMin 10 \
		--outFilterMultimapNmax 3 \
		--outFilterMismatchNmax 5 \
		--outSJfilterOverhangMin 16 8 8 8 \
		--outSJfilterDistToOtherSJmin 0 0 0 0 \
		--peOverlapNbasesMin 12 \
		--peOverlapMMp 0.1

Here are the Log.out files found in the redindexed genome directories :
Log_28358_fast.txt
Log_29870_slow.txt

Log.progress.out of pass 2 with the "fast" genome :

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Feb 05 12:10:37     44.1      808302      291    88.2%    289.2     0.3%     4.0%     1.4%     2.3%     3.0%     0.0%
Feb 05 12:11:37     47.3     1656787      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
Feb 05 12:12:45     48.1     2589510      291    88.1%    289.2     0.3%     4.0%     1.4%     2.4%     3.1%     0.0%
Feb 05 12:13:51     48.8     3522403      291    88.0%    289.1     0.3%     4.0%     1.4%     2.4%     3.2%     0.0%
Feb 05 12:14:52     50.4     4497709      291    88.0%    289.1     0.3%     4.0%     1.4%     2.4%     3.2%     0.0%
Feb 05 12:15:55     50.1     5345938      291    87.9%    289.1     0.3%     4.0%     1.4%     2.4%     3.2%     0.0%
Feb 05 12:17:01     49.6     6193987      291    87.9%    289.1     0.3%     4.0%     1.4%     2.4%     3.2%     0.0%
Feb 05 12:18:04     49.7     7084344      291    88.0%    289.1     0.3%     4.0%     1.4%     2.4%     3.2%     0.0%
...
Feb 05 12:27:19     51.1    15167755      291    88.1%    289.2     0.3%     4.0%     1.4%     2.4%     3.1%     0.0%
Feb 05 12:28:19     51.1    16013581      291    88.1%    289.2     0.3%     4.0%     1.4%     2.4%     3.1%     0.0%
Feb 05 12:29:23     51.3    16986118      291    88.1%    289.2     0.3%     4.0%     1.4%     2.4%     3.1%     0.0%
Feb 05 12:30:26     51.5    17958489      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
Feb 05 12:31:26     51.7    18888455      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
Feb 05 12:32:28     51.7    19776267      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
Feb 05 12:33:30     51.7    20664174      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
Feb 05 12:34:34     51.7    21594288      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
Feb 05 12:35:42     51.6    22524679      291    88.1%    289.2     0.3%     4.0%     1.4%     2.3%     3.1%     0.0%
ALL DONE!

Log.progress.out of pass 2 with the "slow" genome :

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Feb 05 12:20:29      0.2       42597      291    78.7%    289.0     0.2%    12.1%     3.7%     1.8%     2.7%     0.2%
Feb 05 12:33:04      0.7      297976      291    78.4%    289.2     0.2%    12.0%     3.8%     2.0%     2.8%     0.2%
Feb 05 12:34:41      1.1      510592      291    78.0%    289.1     0.3%    12.0%     3.7%     2.3%     3.0%     0.2%
Feb 05 12:46:18      0.9      553125      291    77.9%    289.1     0.3%    12.1%     3.7%     2.3%     3.0%     0.2%
Feb 05 12:47:39      1.1      723274      291    77.9%    289.0     0.3%    12.1%     3.7%     2.3%     3.0%     0.2%
Feb 05 12:59:34      0.9      808302      291    77.9%    289.0     0.3%    12.1%     3.7%     2.3%     3.0%     0.2%
Feb 05 13:00:41      1.1      935787      291    77.9%    289.0     0.3%    12.1%     3.7%     2.3%     3.0%     0.2%

They both started at roughly the same time (12:07 / 12:09) on the same otherwise unoccupied server (886 Go RAM, 114 CPU).

Hi Sylvain,

The mapping speed goes down when there is a complex locus with a large number of novel inserted junctions all close together. It often correlates with the overall number of junctions, but in your case it does not, apparently. The solution is still to filter the junctions.

Hi Alex,

I tried with the filter you suggested here ($1 ~ /^chr[0-9XY]+$/ && $6 == 0 && $5 > 0 && $7 > 0), I go down to 804 615 junctions in total (-18%) and speed is around 5-8 M/h, it's better but still far slower than pass 1 or the other genome (50 M/h).

It also means filtering out non-canonical sites, but we are looking for DNA mutations introducing new canonical splice sites, wouldn't they appear as non-canonical to STAR based on the reference genome ? The $7 filter is marginal in this data, I also tried $1 ~ /^chr[0-9XY]+$/ && $7 > 0 && $9 >= 8 but is has no significant impact on junction count (-3%) or speed.

I will have a look at junction coordinates to see if I can identify some of the complex loci you mention.