Metatranscriptomics slow mapping rate

Question

Metatranscriptomics slow mapping rate

ohickl opened this issue 2 months ago · comments

Hi, I am trying to map (short) reads back to metatranscriptome assemblies, but the mapping rate is extremely low for some samples.
I tried to reduce --seedPerWindowNmax as described in e.g. #1414, but it didnt seem to affect the speed at all.
Problematic sample example read stats:

file                   format  type    num_seqs        sum_len  min_len  avg_len  max_len  Q1  Q2  Q3  sum_gap  N50  Q20(%)  Q30(%)  AvgQual  GC(%)
mt.r1.preprocessed.fq  FASTQ   DNA   93,701,920  5,527,380,584       40       59       60  59  60  60        0   60   96.98   92.94    26.19  37.51
mt.r2.preprocessed.fq  FASTQ   DNA   93,701,920  5,533,235,876       40     59.1       60  60  60  60        0   60   97.28   94.24       26  37.43

Index cmd used:

# Get stats
r_max_l=$( seqkit stats -T {input.r_1} | cut -f 8 | tail -n +2 )
assembly_stats=$( seqkit stats -T {input.assembly} | tail -n +2 )
a_contigs=$( echo ${{assembly_stats}} | cut -d ' ' -f 4 )
a_length=$( echo ${{assembly_stats}} | cut -d ' ' -f 5 )

genomeChrBinNbits=$( python -c "import math; print( round(min(18, math.log2(max(${{a_length}}/${{a_contigs}}, ${{r_max_l}})))) )" )
genomeSAindexNbases=$( python -c "import math; print( round(min(14, math.log2(${{a_length}}) / 2 - 1)))" )

echo "r_max_l: ${{r_max_l}}" >> {log}
echo "assembly_stats: ${{assembly_stats}}" >> {log}
echo "genomeChrBinNbits: ${{genomeChrBinNbits}}" >> {log}
echo "genomeSAindexNbases: ${{genomeSAindexNbases}}" >> {log}

STAR --runThreadN {threads} \
        --runMode genomeGenerate \
        --genomeDir Assembly \
        --genomeChrBinNbits ${{genomeChrBinNbits}} \
        --genomeSAindexNbases 10 \
        --limitGenomeGenerateRAM {params.max_mem_bytes} \
        --genomeFastaFiles {input.assembly} >> {log} 2>&1

Index log:

r_max_l: 60
<               file                                    format  type    num_seqs   sum_len      min_len   avg_len  max_len >
assembly_stats: Assembly/mgmt.assembly.merged.fa	FASTA	DNA	5248352	   3450289293	200	  657.4	   383322
genomeChrBinNbits: 9
genomeSAindexNbases: 14
	.../bin/STAR-avx2 --runThreadN 43 --runMode genomeGenerate --genomeDir Assembly --genomeChrBinNbits 9 --genomeSAindexNbases 10 --limitGenomeGenerateRAM 75000000000 --genomeFastaFiles Assembly/mgmt.assembly.merged.fa
	STAR version: 2.7.10b   compiled: 2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
Apr 25 23:30:00 ..... started STAR run
Apr 25 23:30:00 ... starting to generate Genome files
Apr 25 23:31:23 ... starting to sort Suffix Array. This may take a long time...
Apr 25 23:31:44 ... sorting Suffix Array chunks and saving them to disk...
Apr 25 23:42:12 ... loading chunks from disk, packing SA...
Apr 25 23:45:18 ... finished generating suffix array
Apr 25 23:45:18 ... generating Suffix Array index
Apr 25 23:45:24 ... completed Suffix Array index
Apr 25 23:45:25 ... writing Genome to disk ...
Apr 25 23:45:31 ... writing Suffix Array to disk ...
Apr 25 23:46:07 ... writing SAindex to disk
Apr 25 23:46:08 ..... finished successfully

Map cmd used:

# Increase max number of open file descriptors for STAR
ulimit -n 100000

# Map paired
echo "Mapping PE reads and sorting them ..." >> {log} 2>&1
# --seedPerWindowNmax 30
STAR --runThreadN {threads} \
        --runMode alignReads \
        --genomeDir Assembly \
        --outFilterMultimapNmax 100 \
        --winAnchorMultimapNmax 150 \
        --alignIntronMax 1 \
        --outFileNamePrefix {params.out_dir}/ \
        --outSAMtype BAM SortedByCoordinate \
        --limitBAMsortRAM {params.max_mem} \
        --outSAMheaderHD '@RG\tID:{params.sample}\tSM:mt\tPL:platform\tLB:library' \
        --outBAMcompression -1 \
        --outReadsUnmapped Fastx \
        --outFilterScoreMinOverLread 0.33 \
        --outFilterMatchNminOverLread 0.33 \
        --readFilesIn {input.r_1} {input.r_2} >> {log} 2>&1

Log.out:

STAR version=2.7.10b
STAR compilation time,server,dir=2023-05-25T06:56:23+0000 :/opt/conda/conda-bld/star_1684997536154/work/source
##### Command Line:
.../bin/STAR-avx2 --runThreadN 128 --runMode alignReads --genomeDir Assembly --outFilterMultimapNmax 100 --winAnchorMultimapNmax 150 --alignIntronMax 1 --outFileNamePrefix Assembly/ --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 224000000000 --outSAMheaderHD "@RG	ID:<sample>	SM:mt	PL:platform	LB:library" --outBAMcompression -1 --outReadsUnmapped Fastx --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --readFilesIn Preprocessing/mt.r1.preprocessed.fq Preprocessing/mt.r2.preprocessed.fq
##### Initial USER parameters from Command Line:
outFileNamePrefix                 Assembly/
###### All USER parameters from Command Line:
runThreadN                    128     ~RE-DEFINED
runMode                       alignReads        ~RE-DEFINED
genomeDir                     Assembly     ~RE-DEFINED
outFilterMultimapNmax         100     ~RE-DEFINED
winAnchorMultimapNmax         150     ~RE-DEFINED
alignIntronMax                1     ~RE-DEFINED
outFileNamePrefix             Assembly/     ~RE-DEFINED
outSAMtype                    BAM   SortedByCoordinate        ~RE-DEFINED
limitBAMsortRAM               224000000000     ~RE-DEFINED
outSAMheaderHD                "@RG	ID:<sample>	SM:mt	PL:platform	LB:library"        ~RE-DEFINED
outBAMcompression             -1     ~RE-DEFINED
outReadsUnmapped              Fastx     ~RE-DEFINED
outFilterScoreMinOverLread    0.33     ~RE-DEFINED
outFilterMatchNminOverLread   0.33     ~RE-DEFINED
readFilesIn                   Preprocessing/mt.r1.preprocessed.fq   Preprocessing/mt.r2.preprocessed.fq        ~RE-DEFINED
##### Finished reading parameters from all sources

##### Final user re-defined parameters-----------------:
runMode                           alignReads   
runThreadN                        128
genomeDir                         Assembly
readFilesIn                       Preprocessing/mt.r1.preprocessed.fq   Preprocessing/mt.r2.preprocessed.fq   
limitBAMsortRAM                   224000000000
outFileNamePrefix                 Assembly/
outReadsUnmapped                  Fastx
outSAMtype                        BAM   SortedByCoordinate   
outSAMheaderHD                    "@RG	ID:<sample>	SM:mt	PL:platform	LB:library"   
outBAMcompression                 -1
outFilterMultimapNmax             100
outFilterScoreMinOverLread        0.33
outFilterMatchNminOverLread       0.33
winAnchorMultimapNmax             150
alignIntronMax                    1

-------------------------------
##### Final effective command line:
.../bin/STAR-avx2   --runMode alignReads      --runThreadN 128   --genomeDir Assembly   --readFilesIn Preprocessing/mt.r1.preprocessed.fq   Preprocessing/mt.r2.preprocessed.fq      --limitBAMsortRAM 224000000000   --outFileNamePrefix Assembly/   --outReadsUnmapped Fastx   --outSAMtype BAM   SortedByCoordinate      --outSAMheaderHD "@RG	ID:<sample>	SM:mt	PL:platform	LB:library"      --outBAMcompression -1   --outFilterMultimapNmax 100   --outFilterScoreMinOverLread 0.33   --outFilterMatchNminOverLread 0.33   --winAnchorMultimapNmax 150   --alignIntronMax 1
----------------------------------------

Number of fastq files for each mate = 1
ParametersSolo: --soloCellFilterType CellRanger2.2 filtering parameters:  3000 0.99 10
Finished loading and checking parameters
Reading genome generation parameters:
### .../bin/STAR-avx2   --runMode genomeGenerate      --runThreadN 43   --genomeDir Assembly   --genomeFastaFiles Assembly/mgmt.assembly.merged.fa      --genomeSAindexNbases 10   --genomeChrBinNbits 9   --limitGenomeGenerateRAM 75000000000
### GstrandBit=33
versionGenome                 2.7.4a     ~RE-DEFINED
genomeType                    Full     ~RE-DEFINED
genomeFastaFiles              Assembly/mgmt.assembly.merged.fa        ~RE-DEFINED
genomeSAindexNbases           10     ~RE-DEFINED
genomeChrBinNbits             9     ~RE-DEFINED
genomeSAsparseD               1     ~RE-DEFINED
genomeTransformType           None     ~RE-DEFINED
genomeTransformVCF            -     ~RE-DEFINED
sjdbOverhang                  0     ~RE-DEFINED
sjdbFileChrStartEnd           -        ~RE-DEFINED
sjdbGTFfile                   -     ~RE-DEFINED
sjdbGTFchrPrefix              -     ~RE-DEFINED
sjdbGTFfeatureExon            exon     ~RE-DEFINED
sjdbGTFtagExonParentTranscripttranscript_id     ~RE-DEFINED
sjdbGTFtagExonParentGene      gene_id     ~RE-DEFINED
sjdbInsertSave                Basic     ~RE-DEFINED
genomeFileSizes               4574533632   29327458994        ~RE-DEFINED
Genome version is compatible with current STAR
Number of real (reference) chromosomes= 5248352
...
Started loading the genome: Thu Apr 25 23:46:53 2024

Genome: size given as a parameter = 4574533632
SA: size given as a parameter = 29327458994
SAindex: size given as a parameter = 1
Read from SAindex: pGe.gSAindexNbases=10  nSAi=1398100
nGenome=4574533632;  nSAbyte=29327458994
GstrandBit=33   SA number of indices=6900578586
Shared memory is not used for genomes. Allocated a private copy of the genome.
Genome file size: 4574533632 bytes; state: good=1 eof=0 fail=0 bad=0
Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 4574533632 bytes
SA file size: 29327458994 bytes; state: good=1 eof=0 fail=0 bad=0
Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 29327458994 bytes
Loading SAindex ... done: 6291549 bytes
Finished loading the genome: Thu Apr 25 23:47:43 2024

To accommodate alignIntronMax=1 redefined winBinNbits=17
winBinNbits=17 > pGe.gChrBinNbits=9   redefining:
winBinNbits=9
To accommodate alignIntronMax=1 and alignMatesGapMax=0, redefined winFlankNbins=1 and winAnchorDistNbins=2
...
BAM sorting: 251056 mapped reads
BAM sorting bins genomic start loci:
...

Log_progress.out:

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Apr 26 04:54:41      0.0       84749      117    58.1%     96.3     3.3%    40.1%     0.0%     0.0%     1.8%     0.0%
...
Apr 26 05:14:31      0.1      338887      117    58.1%     96.2     3.3%    39.9%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 05:28:54      0.2      932243      117    58.0%     96.2     3.3%    40.1%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 05:38:20      0.3     1694432      117    58.1%     96.2     3.3%    40.1%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 05:45:42      0.4     2203086      117    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 05:50:13      0.5     2795870      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 05:54:54      0.6     3388615      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 05:59:39      0.7     4066270      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 06:10:00      0.8     4913180      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 06:19:11      0.9     5759268      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 06:26:27      1.0     6437858      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 06:34:39      1.1     7285238      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 06:44:18      1.2     8133274      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 06:48:46      1.3     8810789      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 07:04:06      1.4     9826523      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%
...
Apr 26 08:36:28      1.2    10842616      118    58.1%     96.3     3.3%    40.0%     0.0%     0.0%     1.9%     0.0%

Should I try to reduce --seedPerWindowNmax lower than 30?
Are some of the parameters for indexing or mapping incorrect for the assembly contig/read size/length properties?

Best

Oskar