alexdobin / STAR

RNA-seq aligner

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault STARsolo SMARTseq

LJK1991 opened this issue · comments

Hello STAR team,
Thank you for the nice tool, i use it a lot.

Currently i am trying to run STARsolo on a SMARTseq2 dataset GSE209742.
I have downloaded all the fastq files and created a manifest.

/home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639501/SRR20639501.fastq.gz	-	Cell_1
/home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639502/SRR20639502.fastq.gz	-	Cell_2
/home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639503/SRR20639503.fastq.gz	-	Cell_3
/home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639504/SRR20639504.fastq.gz	-	Cell_4
/home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639505/SRR20639505.fastq.gz	-	Cell_5

subsequently i run this command

STAR --runThreadN 16 \
         --genomeDir ${genome} \
         --readFilesCommand zcat \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes NH HI nM AS GX GN \
         --sjdbGTFfile ${annotation} \
         --runDirPerm All_RWX \
         --outFileNamePrefix ${output}/ \
         --soloType SmartSeq \
         --readFilesManifest ${ReadFileManifest} \
         --soloUMIdedup ${dedup} \
         --soloStrand ${Stranded} \
         --soloFeatures Gene GeneFull \
         --soloCellReadStats Standard \
         --soloOutFileNames solo.out/ features.tsv barcodes.tsv matrix.mtx \
         --soloMultiMappers Unique \

It seems to align, which i can find in the Log.progress.out.
However i get a segmentation fault (core dumped) error.

/home/l.kuijpers/scripts/STARscripts/SMARTseq2.sh: line 43: 3983846 Segmentation fault      (core dumped) STAR --runThreadN 8 --genomeDir ${genome} --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS GX GN --sjdbGTFfile ${annotation} --runDirPerm All_RWX --outFileNamePrefix ${output}/ --soloType SmartSeq --readFilesManifest ${ReadFileManifest} --soloUMIdedup ${dedup} --soloStrand ${Stranded} --soloFeatures Gene GeneFull SJ --soloCellReadStats Standard --soloOutFileNames solo.out/ features.tsv barcodes.tsv matrix.mtx --soloMultiMappers Unique

The Log.out if stops at the GeneFull counting step.
I also have almost all files created up to that point, as well as the SJ output data.
I have attached the full Log.out.

I do run in through snakemake in a conda env where STAR is version 2.7.11a

Thanks in advance for your help.
Log.out.txt
manifest.txt

Update.

  1. I tried removing GeneFull and just use Gene, however it also crashes. It seems that SJ is the only one that works.
  2. I use a HPC with 370G RAM normally i run this script with either 64G, but i have tried 128G or no limit. It still gets the segmentation fault. Either it requires more than this which i doubt or it is something else.
  3. I have given it the All_RWX authority, and checked the folders for writing permission which i have.

Not sure where to look next.

Hi @LJK1991

please try to run just a few files (1,2,3...), and send me the Log.out file when it fails.

I get the same error when i run STAR with three files.
Here is the log file

Log_out.txt

@alexdobin, I also tried with one file in the manifest and get the same error.
Log.out.txt

Subsequently, i tried without the manifest using the following command:

STAR --runThreadN 16 \
>          --genomeDir /home/l.kuijpers/genomes/GRCm39 \
>          --readFilesCommand zcat \
>          --readFilesIn /home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639501/SRR20639501.fastq.gz \
>          --outSAMtype BAM SortedByCoordinate \
>          --outSAMattributes NH HI nM AS GX GN \
>          --limitOutSJcollapsed 10000000 \
>          --sjdbGTFfile /home/l.kuijpers/genomes/GRCm39/gencode.vM33.chr_patch_hapl_scaff.annotation.gtf \
>          --runDirPerm All_RWX \
>          --outFileNamePrefix /home/l.kuijpers/data/fastq/GSE209742/alignment/manifest \
>          --soloType SmartSeq \
>          --soloUMIdedup NoDedup \
>          --soloStrand Unstranded \
>          --soloFeatures Gene \
>          --soloCellReadStats Standard \
>          --soloOutFileNames solo.out/ features.tsv barcodes.tsv matrix.mtx \
>          --soloMultiMappers Unique \
>          --soloCellFilter None

However the same error occurs.

STAR --runThreadN 16 --genomeDir /home/l.kuijpers/genomes/GRCm39 --readFilesCommand zcat --readFilesIn /home/l.kuijpers/data/fastq/GSE209742/fastqFiles/SRR20639501/SRR20639501.fastq.gz --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS GX GN --limitOutSJcollapsed 10000000 --sjdbGTFfile /home/l.kuijpers/genomes/GRCm39/gencode.vM33.chr_patch_hapl_scaff.annotation.gtf --runDirPerm All_RWX --outFileNamePrefix /home/l.kuijpers/data/fastq/GSE209742/alignment/manifest --soloType SmartSeq --soloUMIdedup NoDedup --soloStrand Unstranded --soloFeatures Gene --soloCellReadStats Standard --soloOutFileNames solo.out/ features.tsv barcodes.tsv matrix.mtx --soloMultiMappers Unique --soloCellFilter None
        STAR version: 2.7.11a   compiled: 2023-12-15T16:21:49+01:00 :/home/l.kuijpers/git_repos/STAR/source
Feb 20 09:07:37 ..... started STAR run
Feb 20 09:07:37 ..... loading genome
Feb 20 09:07:51 ..... processing annotations GTF
Feb 20 09:07:58 ..... inserting junctions into the genome indices
Feb 20 09:09:04 ..... started mapping
Segmentation fault (core dumped)

Interestingly though it occurs during the mapping now, with makes everything somewhat more confusing
manifestLog.out.txt

I have also tried other files from different datasets where the same problem occurs in both using the manifest as well as the readFilesIn command.

I am running on a large cluster and I have enough space on my account (over 1TB), changing the amount of threads also does not work.

Maybe you have an idea?
Thanks in advance

Hi @LJK1991 ,

I have also the same problem with STAR. Did you manage to solve it?

@mariaklv04 no i have not. So far it has only become more confusing.

I am trying to figure out if it is RAM memory related. I am using Snakemake and slurm so total RAM should be capped at the requested values. But i am too much of a beginner to quickly figure these things out.

Maybe you have some suggestions.

Regards,

Ok i have run some additional tests using /usr/bin/time.
Normally i run the STAR command through a bash script within a snakemake pipeline that uses slurms sbatch to run the rule. I specify that the rule cannot take more than 64Gb.

However, in all of the above scripts the few lines before it crashes are:

Feb 23 17:09:50 ..... finished mapping
RAM after mapping:
VmPeak:	37720804 kB; VmSize:	36858028 kB; VmHWM:	34904868 kB; VmRSS:	34716292 kB; 
RAM after freeing genome index memory:
VmPeak:	37720804 kB; VmSize:	 8950636 kB; VmHWM:	34904868 kB; VmRSS:	 7003236 kB; 
Feb 23 17:09:53 ..... started Solo counting
Feb 23 17:09:53 ... Starting Solo post-map for Gene
     Redistributing reads into 48files; nReadRec=878925766;   nReadRecBin=18310953

The high RAM numbers for VmPeak bothered me as it is way above what the slurm should allow.

When running it outside of the snakemake within a /usr/bin/time to look at its VmPeak independently (slurm on the hpc does not have job accounting so i cannot use sacct) and i find a similarly high number which equals the maximum RAM available.
Somehow STAR is using insane amounts of RAM. Importantly this high RAM usage does also happen when running with 3 SMART-seq files in the manifest as well as 2000 SMART-seq files in the manifest
This suggest a possible issue with STAR.
however, I will try and investigate further to see if slurm is not properly limiting RAM somhow.

@alexdobin maybe you have some suggestions on how to properly limit RAM usage for STAR, or wether you know of a different solution.

These numbers do not look huge; peak RAM is ~37GB.

@LJK1991
Using the entire installed star path, I was able to solve the issue. Also added the command to specify the maximum amount of RAM (memory) that should be used during the sorting of a BAM file.

/home/maria/STAR-2.7.0e/bin/Linux_x86_64_static/STAR --runThreadN 10 --genomeDir STAR_index_chr19/ --readFilesIn trim_output/SRR10045016_1P trim_output/SRR10045016_2P --sjdbGTFfile genome/chr19_Homo_sapiens.GRCh38.95.gtf --outFileNamePrefix STAR_output/SRR10045016/ --limitBAMsortRAM 10000000000

I hope it helps.

@alexdobin, you're right. it is indeed not as high, I misscalculated.

@mariaklv04. Thanks for the tip. The problem does not occur anymore during the final stages of solo counting. but during the mapping.

When running it without the manifest as before it gets a segmentation fault during the mapping. This seems to be irrespective of what data i use. I have multiple SMART-seq datasets. Some have a read length of 25 while others have a read length of 125. In both cases it crashes even when using '1'cell as follows. (the example has read length of 125 bases)

I have also tried shortening the file to 1000 reads upon which the error also occurs.

/home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16          --genomeDir ${genome}          --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz         --readFilesCommand zcat          --outSAMtype BAM SortedByCoordinate          --outSAMattributes NH HI nM AS GX GN          --limitOutSJcollapsed 10000000                   --sjdbGTFfile ${annotation}          --runDirPerm All_RWX          --outFileNamePrefix ${output}/          --soloType SmartSeq          --soloUMIdedup ${dedup}          --soloStrand ${Stranded}          --soloFeatures Gene          --soloCellReadStats Standard          --soloOutFileNames solo.out/ features.tsv bres.tsv barcodes.tsv matrix.mtx          --soloMultiMappers Unique          --soloCellFilter None
        /home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16 --genomeDir /home/l.kuijpers/genomes/GRCm39 --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS GX GN --limitOutSJcollapsed 10000000 --sjdbGTFfile /home/l.kuijpers/genomes/GRCm39/gencode.vM33.chr_patch_hapl_scaff.annotation.gtf --runDirPerm All_RWX --outFileNamePrefix /home/l.kuijpers/data/fastq/GSE59114/alignment/manifest/ --soloType SmartSeq --soloUMIdedup NoDedup --soloStrand Unstranded --soloFeatures Gene --soloCellReadStats Standard --soloOutFileNames solo.out/ features.tsv barcodes.tsv matrix.mtx --soloMultiMappers Unique --soloCellFilter None
        STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Feb 26 14:31:16 ..... started STAR run
Feb 26 14:31:16 ..... loading genome
Feb 26 14:31:30 ..... processing annotations GTF
Feb 26 14:31:37 ..... inserting junctions into the genome indices
Feb 26 14:33:02 ..... started mapping
Segmentation fault (core dumped)

I have updated STAR to the latest version, from 2.7.11a to 2.7.11b. I have not updated the index, not sure if that matters.

i have also tried to align '1' cell without the STARsolo part e.g.

/home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16          --genomeDir ${genome}          --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz         --readFilesCommand zcat          --outSAMtype BAM SortedByCoordinate          --outSAMattributes NH HI nM AS GX GN                   --sjdbGTFfile ${annotation}          --runDirPerm All_RWX          --outFileNamePrefix ${output}
        /home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16 --genomeDir /home/l.kuijpers/genomes/GRCm39 --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS GX GN --sjdbGTFfile /home/l.kuijpers/genomes/GRCm39/gencode.vM33.chr_patch_hapl_scaff.annotation.gtf --runDirPerm All_RWX --outFileNamePrefix /home/l.kuijpers/data/fastq/GSE59114/alignment/manifest
        STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Feb 26 14:36:35 ..... started STAR run
Feb 26 14:36:35 ..... loading genome
Feb 26 14:36:49 ..... processing annotations GTF
Feb 26 14:36:56 ..... inserting junctions into the genome indices
Feb 26 14:38:09 ..... started mapping
Feb 26 14:38:47 ..... finished mapping
Feb 26 14:38:48 ..... started sorting BAM
Feb 26 14:38:49 ..... finished successfully

This runs succesfully, suggesting the problem lies somewhere within the 'solo' part.

When adding the minimal options for STARsolo with soloType SmartSeq it already gets the segmentation fault.

(base) l.kuijpers@hpc:~/data/fastq/GSE87631$ /home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16          --genomeDir ${genome}          --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz         --readFilesCommand zcat          --outSAMtype BAM SortedByCoordinate          --outSAMattributes NH HI nM AS GX GN                   --sjdbGTFfile ${annotation}          --runDirPerm All_RWX          --outFileNamePrefix ${output} --soloType SmartSeq --soloUMIdedup NoDedup
        /home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16 --genomeDir /home/l.kuijpers/genomes/GRCm39 --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS GX GN --sjdbGTFfile /home/l.kuijpers/genomes/GRCm39/gencode.vM33.chr_patch_hapl_scaff.annotation.gtf --runDirPerm All_RWX --outFileNamePrefix /home/l.kuijpers/data/fastq/GSE59114/alignment/manifest --soloType SmartSeq --soloUMIdedup NoDedup
        STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Feb 26 14:50:02 ..... started STAR run
Feb 26 14:50:02 ..... loading genome
Feb 26 14:50:16 ..... processing annotations GTF
Feb 26 14:50:24 ..... inserting junctions into the genome indices
Feb 26 14:51:37 ..... started mapping
Segmentation fault (core dumped)
(base) l.kuijpers@hpc:~/data/fastq/GSE87631$ /home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16          --genomeDir ${genome}          --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz         --readFilesCommand zcat          --outSAMtype BAM SortedByCoordinate          --outSAMattributes NH HI nM AS GX GN                   --sjdbGTFfile ${annotation}          --runDirPerm All_RWX          --outFileNamePrefix ${output} --soloType SmartSeq --soloUMIdedup Exact
        /home/l.kuijpers/git_repos/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16 --genomeDir /home/l.kuijpers/genomes/GRCm39 --readFilesIn /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_1.fastq.gz /home/l.kuijpers/data/fastq/GSE87631/fastqFiles/SRR4344274/SRR4344274_2.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI nM AS GX GN --sjdbGTFfile /home/l.kuijpers/genomes/GRCm39/gencode.vM33.chr_patch_hapl_scaff.annotation.gtf --runDirPerm All_RWX --outFileNamePrefix /home/l.kuijpers/data/fastq/GSE59114/alignment/manifest --soloType SmartSeq --soloUMIdedup Exact
        STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Feb 26 14:51:57 ..... started STAR run
Feb 26 14:51:57 ..... loading genome
Feb 26 14:52:11 ..... processing annotations GTF
Feb 26 14:52:18 ..... inserting junctions into the genome indices
Feb 26 14:53:43 ..... started mapping
Segmentation fault (core dumped)

all other Starsolo options use UMI and CB which i can run succesfully on 10X datasets.

I will try a work around align each sell with normal STAR options subsequently using FeatureCounts and umi_tools to create a matrix.