Pass 2 is 50 times slower depending on reindexed genome
maressyl opened this issue · comments
Hi,
I use STAR 2.7.11a in several targeted RNA-seq projects, and can't figure out why in some of them there is a huge difference in mapping speed between pass 1 (around 50 million reads / hour) and pass 2 (either 40-50 millions / hour in some projects or 1 million / hour in others).
I apply a fully decoupled two-pass approach (full commands below, quite a lot of options as I'm looking for splicing and fusion events) :
- aligning each sample separately against GRCh38 in a first pass,
- feeding all
*.SJ.out.tab
files to a "reindex" launch with dummy FASTQ to get a reindexed genome - using the reindexed genome to align again each sample separately (pass 2)
I narrowed down the problem to two projects with similar characteristics (10 to 15 million reads per sample, ~2x150 bp after adapter trimming, ~150 targeted genes, UMIs in the "fast" project but not in the "slow" one), and ended up aligning the reads of the slow project with two distinct reindexed genomes : the one produced by the corresponding pass 1 and another one produced by pass 1 of a "fast" project. I ended up with a mapping speed of 50 M/h with the "fast" project genome, and still 1 M/h with the "slow" project genome.
I saw previous discussions on the subject (#733, #1034) suggesting to lower down the amount of junctions injected into the genome, but in this case the log file of my genome reindexing step indicated more junctions in the "fast" project (1 125 058 after SJ and GTF injection) than in the "slow" project (977 952).
Any idea on what I am missing ?
Best regards,
Sylvain
Pass 1
STAR \
--runThreadN 6 \
--twopassMode None \
--genomeDir "$rawGenome" \
--genomeLoad NoSharedMemory \
--readFilesIn $readFilesIn \
--readFilesCommand gunzip -c \
--outFileNamePrefix "./" \
--outSAMunmapped Within \
--outSAMtype BAM Unsorted \
--chimOutType Junctions WithinBAM \
--quantMode TranscriptomeSAM \
--outSAMattrRGline $RG \
--sjdbGTFfile "$genomeGTF" \
--alignEndsProtrude ${params.umi_length} ConcordantPair \ ### -1 in "slow" project, 6 in "fast project"
--alignInsertionFlush Right \
--alignSJDBoverhangMin 4 \
--alignSJstitchMismatchNmax 3 -1 3 3 \
--alignSplicedMateMapLmin 16 \
--alignSplicedMateMapLminOverLmate 0 \
--chimJunctionOverhangMin 8 \
--chimScoreJunctionNonGTAG -4 \
--chimSegmentMin 10 \
--chimMultimapNmax 1 \
--chimNonchimScoreDropMin 10 \
--outFilterMultimapNmax 3 \
--outFilterMismatchNmax 5 \
--outSJfilterOverhangMin 16 8 8 8 \
--outSJfilterDistToOtherSJmin 0 0 0 0 \
--peOverlapNbasesMin 12 \
--peOverlapMMp 0.1
Reindex
STAR \
--runThreadN 2 \
--genomeDir "$rawGenome" \
--readFilesIn "$R1" "$R2" \
--sjdbFileChrStartEnd $SJ \
--limitSjdbInsertNsj 5000000 \
--sjdbInsertSave All \
--sjdbGTFfile "$genomeGTF" \
--outFileNamePrefix "./reindex/" \
--outSAMtype None
Pass 2
STAR \
--runThreadN 6 \
--twopassMode None \
--genomeDir "$reindexedGenome" \
--genomeLoad NoSharedMemory \
--readFilesIn $readFilesIn \
--readFilesCommand gunzip -c \
--outFileNamePrefix "./${sample}/" \
--outSAMunmapped Within \
--outSAMtype BAM Unsorted \
--chimOutType Junctions WithinBAM \
--quantMode TranscriptomeSAM \
--outSAMattrRGline $RG \
--sjdbGTFfile "$genomeGTF" \
--alignEndsProtrude -1 ConcordantPair \ ### -1 for both during the test
--alignInsertionFlush Right \
--alignSJDBoverhangMin 4 \
--alignSJstitchMismatchNmax 3 -1 3 3 \
--alignSplicedMateMapLmin 16 \
--alignSplicedMateMapLminOverLmate 0 \
--chimJunctionOverhangMin 8 \
--chimScoreJunctionNonGTAG -4 \
--chimSegmentMin 10 \
--chimMultimapNmax 1 \
--chimNonchimScoreDropMin 10 \
--outFilterMultimapNmax 3 \
--outFilterMismatchNmax 5 \
--outSJfilterOverhangMin 16 8 8 8 \
--outSJfilterDistToOtherSJmin 0 0 0 0 \
--peOverlapNbasesMin 12 \
--peOverlapMMp 0.1
Here are the Log.out files found in the redindexed genome directories :
Log_28358_fast.txt
Log_29870_slow.txt
Log.progress.out
of pass 2 with the "fast" genome :
Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped
M/hr number length unique length MMrate multi multi+ MM short other
Feb 05 12:10:37 44.1 808302 291 88.2% 289.2 0.3% 4.0% 1.4% 2.3% 3.0% 0.0%
Feb 05 12:11:37 47.3 1656787 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
Feb 05 12:12:45 48.1 2589510 291 88.1% 289.2 0.3% 4.0% 1.4% 2.4% 3.1% 0.0%
Feb 05 12:13:51 48.8 3522403 291 88.0% 289.1 0.3% 4.0% 1.4% 2.4% 3.2% 0.0%
Feb 05 12:14:52 50.4 4497709 291 88.0% 289.1 0.3% 4.0% 1.4% 2.4% 3.2% 0.0%
Feb 05 12:15:55 50.1 5345938 291 87.9% 289.1 0.3% 4.0% 1.4% 2.4% 3.2% 0.0%
Feb 05 12:17:01 49.6 6193987 291 87.9% 289.1 0.3% 4.0% 1.4% 2.4% 3.2% 0.0%
Feb 05 12:18:04 49.7 7084344 291 88.0% 289.1 0.3% 4.0% 1.4% 2.4% 3.2% 0.0%
...
Feb 05 12:27:19 51.1 15167755 291 88.1% 289.2 0.3% 4.0% 1.4% 2.4% 3.1% 0.0%
Feb 05 12:28:19 51.1 16013581 291 88.1% 289.2 0.3% 4.0% 1.4% 2.4% 3.1% 0.0%
Feb 05 12:29:23 51.3 16986118 291 88.1% 289.2 0.3% 4.0% 1.4% 2.4% 3.1% 0.0%
Feb 05 12:30:26 51.5 17958489 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
Feb 05 12:31:26 51.7 18888455 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
Feb 05 12:32:28 51.7 19776267 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
Feb 05 12:33:30 51.7 20664174 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
Feb 05 12:34:34 51.7 21594288 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
Feb 05 12:35:42 51.6 22524679 291 88.1% 289.2 0.3% 4.0% 1.4% 2.3% 3.1% 0.0%
ALL DONE!
Log.progress.out
of pass 2 with the "slow" genome :
Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped
M/hr number length unique length MMrate multi multi+ MM short other
Feb 05 12:20:29 0.2 42597 291 78.7% 289.0 0.2% 12.1% 3.7% 1.8% 2.7% 0.2%
Feb 05 12:33:04 0.7 297976 291 78.4% 289.2 0.2% 12.0% 3.8% 2.0% 2.8% 0.2%
Feb 05 12:34:41 1.1 510592 291 78.0% 289.1 0.3% 12.0% 3.7% 2.3% 3.0% 0.2%
Feb 05 12:46:18 0.9 553125 291 77.9% 289.1 0.3% 12.1% 3.7% 2.3% 3.0% 0.2%
Feb 05 12:47:39 1.1 723274 291 77.9% 289.0 0.3% 12.1% 3.7% 2.3% 3.0% 0.2%
Feb 05 12:59:34 0.9 808302 291 77.9% 289.0 0.3% 12.1% 3.7% 2.3% 3.0% 0.2%
Feb 05 13:00:41 1.1 935787 291 77.9% 289.0 0.3% 12.1% 3.7% 2.3% 3.0% 0.2%
They both started at roughly the same time (12:07 / 12:09) on the same otherwise unoccupied server (886 Go RAM, 114 CPU).
Hi Sylvain,
The mapping speed goes down when there is a complex locus with a large number of novel inserted junctions all close together. It often correlates with the overall number of junctions, but in your case it does not, apparently. The solution is still to filter the junctions.
Hi Alex,
I tried with the filter you suggested here ($1 ~ /^chr[0-9XY]+$/ && $6 == 0 && $5 > 0 && $7 > 0
), I go down to 804 615 junctions in total (-18%) and speed is around 5-8 M/h, it's better but still far slower than pass 1 or the other genome (50 M/h).
It also means filtering out non-canonical sites, but we are looking for DNA mutations introducing new canonical splice sites, wouldn't they appear as non-canonical to STAR based on the reference genome ? The $7
filter is marginal in this data, I also tried $1 ~ /^chr[0-9XY]+$/ && $7 > 0 && $9 >= 8
but is has no significant impact on junction count (-3%) or speed.
I will have a look at junction coordinates to see if I can identify some of the complex loci you mention.