Metatranscriptomics slow mapping rate
ohickl opened this issue · comments
Oskar Hickl commented
Hi, I am trying to map (short) reads back to metatranscriptome assemblies, but the mapping rate is extremely low for some samples.
I tried to reduce --seedPerWindowNmax
as described in e.g. #1414, but it didnt seem to affect the speed at all.
Problematic sample example read stats:
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) AvgQual GC(%)
mt.r1.preprocessed.fq FASTQ DNA 93,701,920 5,527,380,584 40 59 60 59 60 60 0 60 96.98 92.94 26.19 37.51
mt.r2.preprocessed.fq FASTQ DNA 93,701,920 5,533,235,876 40 59.1 60 60 60 60 0 60 97.28 94.24 26 37.43
Index cmd used:
# Get stats
r_max_l=$( seqkit stats -T {input.r_1} | cut -f 8 | tail -n +2 )
assembly_stats=$( seqkit stats -T {input.assembly} | tail -n +2 )
a_contigs=$( echo ${{assembly_stats}} | cut -d ' ' -f 4 )
a_length=$( echo ${{assembly_stats}} | cut -d ' ' -f 5 )
genomeChrBinNbits=$( python -c "import math; print( round(min(18, math.log2(max(${{a_length}}/${{a_contigs}}, ${{r_max_l}})))) )" )
genomeSAindexNbases=$( python -c "import math; print( round(min(14, math.log2(${{a_length}}) / 2 - 1)))" )
echo "r_max_l: ${{r_max_l}}" >> {log}
echo "assembly_stats: ${{assembly_stats}}" >> {log}
echo "genomeChrBinNbits: ${{genomeChrBinNbits}}" >> {log}
echo "genomeSAindexNbases: ${{genomeSAindexNbases}}" >> {log}
STAR --runThreadN {threads} \
--runMode genomeGenerate \
--genomeDir Assembly \
--genomeChrBinNbits ${{genomeChrBinNbits}} \
--genomeSAindexNbases 10 \
--limitGenomeGenerateRAM {params.max_mem_bytes} \
--genomeFastaFiles {input.assembly} >> {log} 2>&1
Index log:
r_max_l: 60
< file format type num_seqs sum_len min_len avg_len max_len >
assembly_stats: Assembly/mgmt.assembly.merged.fa FASTA DNA 5248352 3450289293 200 657.4 383322
genomeChrBinNbits: 9
genomeSAindexNbases: 14
.../bin/STAR-avx2 --runThreadN 43 --runMode genomeGenerate --genomeDir Assembly --genomeChrBinNbits 9 --genomeSAindexNbases 10 --limitGenomeGenerateRAM 75000000000 --genomeFastaFiles Assembly/mgmt.assembly.merged.fa
STAR version: 2.7.10b compiled: 2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
Apr 25 23:30:00 ..... started STAR run
Apr 25 23:30:00 ... starting to generate Genome files
Apr 25 23:31:23 ... starting to sort Suffix Array. This may take a long time...
Apr 25 23:31:44 ... sorting Suffix Array chunks and saving them to disk...
Apr 25 23:42:12 ... loading chunks from disk, packing SA...
Apr 25 23:45:18 ... finished generating suffix array
Apr 25 23:45:18 ... generating Suffix Array index
Apr 25 23:45:24 ... completed Suffix Array index
Apr 25 23:45:25 ... writing Genome to disk ...
Apr 25 23:45:31 ... writing Suffix Array to disk ...
Apr 25 23:46:07 ... writing SAindex to disk
Apr 25 23:46:08 ..... finished successfully
Map cmd used:
# Increase max number of open file descriptors for STAR
ulimit -n 100000
# Map paired
echo "Mapping PE reads and sorting them ..." >> {log} 2>&1
# --seedPerWindowNmax 30
STAR --runThreadN {threads} \
--runMode alignReads \
--genomeDir Assembly \
--outFilterMultimapNmax 100 \
--winAnchorMultimapNmax 150 \
--alignIntronMax 1 \
--outFileNamePrefix {params.out_dir}/ \
--outSAMtype BAM SortedByCoordinate \
--limitBAMsortRAM {params.max_mem} \
--outSAMheaderHD '@RG\tID:{params.sample}\tSM:mt\tPL:platform\tLB:library' \
--outBAMcompression -1 \
--outReadsUnmapped Fastx \
--outFilterScoreMinOverLread 0.33 \
--outFilterMatchNminOverLread 0.33 \
--readFilesIn {input.r_1} {input.r_2} >> {log} 2>&1
Log.out
:
STAR version=2.7.10b
STAR compilation time,server,dir=2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
##### Command Line:
.../bin/STAR-avx2 --runThreadN 128 --runMode alignReads --genomeDir Assembly --outFilterMultimapNmax 100 --winAnchorMultimapNmax 150 --alignIntronMax 1 --outFileNamePrefix Assembly/ --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 224000000000 --outSAMheaderHD "@RG ID:<sample> SM:mt PL:platform LB:library" --outBAMcompression -1 --outReadsUnmapped Fastx --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --readFilesIn Preprocessing/mt.r1.preprocessed.fq Preprocessing/mt.r2.preprocessed.fq
##### Initial USER parameters from Command Line:
outFileNamePrefix Assembly/
###### All USER parameters from Command Line:
runThreadN 128 ~RE-DEFINED
runMode alignReads ~RE-DEFINED
genomeDir Assembly ~RE-DEFINED
outFilterMultimapNmax 100 ~RE-DEFINED
winAnchorMultimapNmax 150 ~RE-DEFINED
alignIntronMax 1 ~RE-DEFINED
outFileNamePrefix Assembly/ ~RE-DEFINED
outSAMtype BAM SortedByCoordinate ~RE-DEFINED
limitBAMsortRAM 224000000000 ~RE-DEFINED
outSAMheaderHD "@RG ID:<sample> SM:mt PL:platform LB:library" ~RE-DEFINED
outBAMcompression -1 ~RE-DEFINED
outReadsUnmapped Fastx ~RE-DEFINED
outFilterScoreMinOverLread 0.33 ~RE-DEFINED
outFilterMatchNminOverLread 0.33 ~RE-DEFINED
readFilesIn Preprocessing/mt.r1.preprocessed.fq Preprocessing/mt.r2.preprocessed.fq ~RE-DEFINED
##### Finished reading parameters from all sources
##### Final user re-defined parameters-----------------:
runMode alignReads
runThreadN 128
genomeDir Assembly
readFilesIn Preprocessing/mt.r1.preprocessed.fq Preprocessing/mt.r2.preprocessed.fq
limitBAMsortRAM 224000000000
outFileNamePrefix Assembly/
outReadsUnmapped Fastx
outSAMtype BAM SortedByCoordinate
outSAMheaderHD "@RG ID:<sample> SM:mt PL:platform LB:library"
outBAMcompression -1
outFilterMultimapNmax 100
outFilterScoreMinOverLread 0.33
outFilterMatchNminOverLread 0.33
winAnchorMultimapNmax 150
alignIntronMax 1
-------------------------------
##### Final effective command line:
.../bin/STAR-avx2 --runMode alignReads --runThreadN 128 --genomeDir Assembly --readFilesIn Preprocessing/mt.r1.preprocessed.fq Preprocessing/mt.r2.preprocessed.fq --limitBAMsortRAM 224000000000 --outFileNamePrefix Assembly/ --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --outSAMheaderHD "@RG ID:<sample> SM:mt PL:platform LB:library" --outBAMcompression -1 --outFilterMultimapNmax 100 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --winAnchorMultimapNmax 150 --alignIntronMax 1
----------------------------------------
Number of fastq files for each mate = 1
ParametersSolo: --soloCellFilterType CellRanger2.2 filtering parameters: 3000 0.99 10
Finished loading and checking parameters
Reading genome generation parameters:
### .../bin/STAR-avx2 --runMode genomeGenerate --runThreadN 43 --genomeDir Assembly --genomeFastaFiles Assembly/mgmt.assembly.merged.fa --genomeSAindexNbases 10 --genomeChrBinNbits 9 --limitGenomeGenerateRAM 75000000000
### GstrandBit=33
versionGenome 2.7.4a ~RE-DEFINED
genomeType Full ~RE-DEFINED
genomeFastaFiles Assembly/mgmt.assembly.merged.fa ~RE-DEFINED
genomeSAindexNbases 10 ~RE-DEFINED
genomeChrBinNbits 9 ~RE-DEFINED
genomeSAsparseD 1 ~RE-DEFINED
genomeTransformType None ~RE-DEFINED
genomeTransformVCF - ~RE-DEFINED
sjdbOverhang 0 ~RE-DEFINED
sjdbFileChrStartEnd - ~RE-DEFINED
sjdbGTFfile - ~RE-DEFINED
sjdbGTFchrPrefix - ~RE-DEFINED
sjdbGTFfeatureExon exon ~RE-DEFINED
sjdbGTFtagExonParentTranscripttranscript_id ~RE-DEFINED
sjdbGTFtagExonParentGene gene_id ~RE-DEFINED
sjdbInsertSave Basic ~RE-DEFINED
genomeFileSizes 4574533632 29327458994 ~RE-DEFINED
Genome version is compatible with current STAR
Number of real (reference) chromosomes= 5248352
...
Started loading the genome: Thu Apr 25 23:46:53 2024
Genome: size given as a parameter = 4574533632
SA: size given as a parameter = 29327458994
SAindex: size given as a parameter = 1
Read from SAindex: pGe.gSAindexNbases=10 nSAi=1398100
nGenome=4574533632; nSAbyte=29327458994
GstrandBit=33 SA number of indices=6900578586
Shared memory is not used for genomes. Allocated a private copy of the genome.
Genome file size: 4574533632 bytes; state: good=1 eof=0 fail=0 bad=0
Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 4574533632 bytes
SA file size: 29327458994 bytes; state: good=1 eof=0 fail=0 bad=0
Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 29327458994 bytes
Loading SAindex ... done: 6291549 bytes
Finished loading the genome: Thu Apr 25 23:47:43 2024
To accommodate alignIntronMax=1 redefined winBinNbits=17
winBinNbits=17 > pGe.gChrBinNbits=9 redefining:
winBinNbits=9
To accommodate alignIntronMax=1 and alignMatesGapMax=0, redefined winFlankNbins=1 and winAnchorDistNbins=2
...
BAM sorting: 251056 mapped reads
BAM sorting bins genomic start loci:
...
Log_progress.out
:
Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped
M/hr number length unique length MMrate multi multi+ MM short other
Apr 26 04:54:41 0.0 84749 117 58.1% 96.3 3.3% 40.1% 0.0% 0.0% 1.8% 0.0%
...
Apr 26 05:14:31 0.1 338887 117 58.1% 96.2 3.3% 39.9% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 05:28:54 0.2 932243 117 58.0% 96.2 3.3% 40.1% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 05:38:20 0.3 1694432 117 58.1% 96.2 3.3% 40.1% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 05:45:42 0.4 2203086 117 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 05:50:13 0.5 2795870 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 05:54:54 0.6 3388615 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 05:59:39 0.7 4066270 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 06:10:00 0.8 4913180 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 06:19:11 0.9 5759268 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 06:26:27 1.0 6437858 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 06:34:39 1.1 7285238 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 06:44:18 1.2 8133274 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 06:48:46 1.3 8810789 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 07:04:06 1.4 9826523 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
...
Apr 26 08:36:28 1.2 10842616 118 58.1% 96.3 3.3% 40.0% 0.0% 0.0% 1.9% 0.0%
Should I try to reduce --seedPerWindowNmax
lower than 30?
Are some of the parameters for indexing or mapping incorrect for the assembly contig/read size/length properties?
Best
Oskar