STARsolo BAM sorting required too much RAM

Question

STARsolo BAM sorting required too much RAM

cliff891 opened this issue 5 months ago · comments

Hi STAR team,

I'm really sorry for bothering, but I've read as many repos as I could, and still couldn't find a solution for my situation.

I'm trying to use STARsolo in my single-cell analysis and need to add some tags to my BAM file. Thus, inevitably, I have to use BAM sorting built into STAR as the documentation indicated. My data is paired-end FASTQ, and each file is around 30-40GB. After running the command, it returns that I need 400GB+ for BAM sorting. My question is, is this really the amount of RAM required, or are there other methods to avoid this or reduce the amount of RAM usage?

EXITING because of fatal ERROR: not enough memory for BAM sorting:
SOLUTION: re-run STAR with at least --limitBAMsortRAM 450666107356

I also read the document that STARsolo also accepts the BAM file as input. Is it possible to use STAR to output an unsorted BAM file and then sort it by samtools, and finally input the sorted BAM into STARsolo for adding those tags?

Here is my original code:

STAR \
 --runThreadN $threads \
 --genomeDir $index \
 --readFilesIn $work_dir/$fq1 $work_dir/$fq2 \
 --soloBarcodeMate $BarcodeMate --clip5pNbases $sc1 $sc2 \
 --soloCBwhitelist $whitelist \
 --readFilesCommand zcat \
 --soloType CB_UMI_Simple \
 --soloCBstart $CBstart --soloCBlen $CBlen \
 --soloUMIstart $UMIstart --soloUMIlen $UMIlen \
 --soloStrand $Strand \
 --outFileNamePrefix $output_dir/$prefix \
 --soloCBmatchWLtype $WLtype \
 --soloUMIdedup $UMItools \
 --outSAMattributes $tags \
 --outBAMsortingBinsN 400 \
 --outBAMsortingThreadN 20 \
  --outSAMtype BAM SortedByCoordinate \
 --limitBAMsortRAM 200666107356 \
 --outFilterMultimapNmax 1 \

Thanks in advance,
Cliff

Alexander Dobin · Answer 1 · Sat Feb 24 2024 03:37:26 GMT+0800 (China Standard Time)

Hi Cliff,

Please send me the Log.out file.
Were these FASTQ files preprocessed in any way?
If not, then it means there is a locus in the genome that contains most of the reads, which will require large RAM for sorting.
If you run it without BAM generation, what's the mapping stats in Log.final.out?

Cliff · Answer 2 · Sat Feb 24 2024 06:02:25 GMT+0800 (China Standard Time)

Hi Alex,

Thanks for the quick reply. I believe the FASTQ data is not preprocessed, but I could shuffle it and run it again.

Here is the log after inputting 400G+ for re-running STAR. It was successful, but I don't think it's an ideal solution for me.
test_out_Log.out.txt

And here is the mapping status:
test_out_Log.final.out.txt

Alexander Dobin · Answer 3 · Sat Mar 02 2024 00:16:31 GMT+0800 (China Standard Time)

Hi Cliff,

It appears the reads are ordered for some reason. The first ~70k reads all map on chr1 at ~1,000,000 base, so STAR makes a lot of sorting bins around that locus, but most of the reads end up in the last bin outside that locus, and so take a lot of RAM and time to sort.

Cliff · Answer 4 · Mon Mar 04 2024 09:03:28 GMT+0800 (China Standard Time)

Hi Alex,
I really appreciate you mentioning it. We confirmed this fastq file was converted from a BAM file. I suggest others facing this issue could check their fastq files first.
Anyway, after reshuffling the FASTQ file, I managed to reduce the amount of RAM consumed by STAR to 60G. In case anyone faces the same problem, here is my solution, I use the fastq-shuffle:
fastq-shuffle.pl -1 <fq1> -2 <fq2> -s 10G -d <tmp_dir> -r 123 -o <out_dir>