broadinstitute / Drop-seq

Java tools for analyzing Drop-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Requested array size exceeds VM limit

drneavin opened this issue · comments

Affected tool(s)

GenerateSyntheticDoublets

Affected version(s)

  • 2.5.1, not sure if it is fixed in other versions

Description

I'm not sure if this is directly a DropSeq issue or an issue with one of the tools it calls but I'm receiving the following java error with only a subset of GenerateSyntheticDoublets jobs:

[Tue Jul 18 11:31:18 AEST 2023] GenerateSyntheticDoublets --INPUT /directflow/SCCGGroupShare/projects/DrewNeavin/Demultiplex_Benchmark/output/Consortium/benchmark_rerun/SimulatedPools/size8_SimulatedPool3_unevenN_0.5pctl/combined_singlets.bam --OUTPUT /directflow/SCCGGroupShare/projects/DrewNeavin/Demultiplex_Benchmark/output/Consortium/benchmark_rerun/SimulatedPools/size8_SimulatedPool3_unevenN_0.5pctl/simulated_doublets.bam --EMIT_SINGLETONS false --NUMBER_MULTICELL 518 --CELL_BARCODE_TAG CB --CELL_BC_FILE /directflow/SCCGGroupShare/projects/DrewNeavin/Demultiplex_Benchmark/output/Consortium/benchmark_rerun/SimulatedPools/size8_SimulatedPool3_unevenN_0.5pctl/barcodes4simulation.tsv --MULTIPLICITY 2 --READ_MQ 10 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Jul 18 11:31:18 AEST 2023] Executing as drenea@beefy-4-4.local on Linux 3.10.0-1160.42.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: Version:2.5.1(680c2ea_1642084299)
INFO    2023-07-18 11:31:18     GenerateSyntheticDoublets       Found 7526 cell barcodes in file
[Tue Jul 18 11:31:39 AEST 2023] org.broadinstitute.dropseqrna.barnyard.digitalallelecounts.sampleassignment.GenerateSyntheticDoublets done. Elapsed time: 0.36 minutes.
Runtime.totalMemory()=12348030976
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
        at java.lang.StringBuffer.append(StringBuffer.java:369)
        at java.io.StringWriter.write(StringWriter.java:94)
        at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:129)
        at java.io.BufferedWriter.write(BufferedWriter.java:230)
        at java.io.Writer.write(Writer.java:157)
        at java.io.Writer.append(Writer.java:227)
        at htsjdk.samtools.SAMTextHeaderCodec.println(SAMTextHeaderCodec.java:455)
        at htsjdk.samtools.SAMTextHeaderCodec.encode(SAMTextHeaderCodec.java:420)
        at htsjdk.samtools.SAMTextHeaderCodec.encode(SAMTextHeaderCodec.java:395)
        at htsjdk.samtools.SAMFileWriterImpl.writeHeader(SAMFileWriterImpl.java:254)
        at htsjdk.samtools.SAMFileWriterImpl.setHeader(SAMFileWriterImpl.java:149)
        at htsjdk.samtools.SAMFileWriterFactory.initializeBAMWriter(SAMFileWriterFactory.java:316)
        at htsjdk.samtools.SAMFileWriterFactory.makeBAMWriter(SAMFileWriterFactory.java:301)
        at htsjdk.samtools.SAMFileWriterFactory.makeBAMWriter(SAMFileWriterFactory.java:263)
        at htsjdk.samtools.SAMFileWriterFactory.makeSAMOrBAMWriter(SAMFileWriterFactory.java:444)
        at htsjdk.samtools.SAMFileWriterFactory.makeSAMOrBAMWriter(SAMFileWriterFactory.java:425)
        at org.broadinstitute.dropseqrna.barnyard.digitalallelecounts.sampleassignment.GenerateSyntheticDoublets.doWork(GenerateSyntheticDoublets.java:107)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

This is for generating 518 doublets from a relatively small bam file (12Gb) and I've had successful runs for another ~275 bams, some of which are much larger and require generation of many more doublets

Steps to reproduce

I'm not sure how to reproduce since I'm not sure what about this bam is causing this issue. There doesn't appear to be anything obviously wrong with the file since samtools quickcheck -v returns nothing

Expected behavior

I would expect it to load the bam, generate doublets and write a bam of simulated doublets.

Actual behavior

I would normally receive the following message as it is loading data before reporting org.broadinstitute.dropseqrna.barnyard.digitalallelecounts.sampleassignment.GenerateSyntheticDoublets done. Elapsed time: 0.36 minutes.:

INFO    2023-03-08 18:23:43     GenerateSyntheticDoublets       Found 93739 cell barcodes in file
INFO    2023-03-08 18:24:14     GenerateSyntheticDoublets       Processed     1,000,000 records.  Elapsed time: 00:00:07s.  Time for last 1,000,000:    7s.  Last read position: 1:629,325
INFO    2023-03-08 18:24:20     GenerateSyntheticDoublets       Processed     2,000,000 records.  Elapsed time: 00:00:14s.  Time for last 1,000,000:    6s.  Last read position: 1:630,374
INFO    2023-03-08 18:24:27     GenerateSyntheticDoublets       Processed     3,000,000 records.  Elapsed time: 00:00:20s.  Time for last 1,000,000:    6s.  Last read position: 1:632,488
INFO    2023-03-08 18:24:34     GenerateSyntheticDoublets       Processed     4,000,000 records.  Elapsed time: 00:00:27s.  Time for last 1,000,000:    6s.  Last read position: 1:632,553
INFO    2023-03-08 18:24:40     GenerateSyntheticDoublets       Processed     5,000,000 records.  Elapsed time: 00:00:34s.  Time for last 1,000,000:    6s.  Last read position: 1:634,125

Let me know if there's any additional information that I can provide to help with resolving this issue.

Hi!

I haven't seen an error quite like this before - the out of memory error is occurring while trying to set up the output BAM writer before actually doing any "work" to write records. Have you tried increasing memory with -m at the start of your arguments? I wonder if something in the input BAM header itself is problematic? You might try to create a smaller reproducible data set to test by capturing the header plus the first 5 lines of the BAM sequence. If that reproduces the result, maybe you can share that? Otherwise, it'll be hard to debug what's going on.

@alecw any ideas?

commented

Apparently this error occurs when the heap size is smaller than the array being allocated. A couple of questions:

  • What heap size are you requesting? Are you using -m, or the default 4g?
  • How big is the header on your input BAM? I.e. what does samtools view -H /directflow/SCCGGroupShare/projects/DrewNeavin/Demultiplex_Benchmark/output/Consortium/benchmark_rerun/SimulatedPools/size8_SimulatedPool3_unevenN_0.5pctl/combined_singlets.bam | wc -c produce?

Is this problem solved, so we can close this issue?

Hi @jamesnemesh, mostly yes. It turns out the bams I was having trouble with had >1 million header lines. When. I reduced the header lines, most of the bams have processed without issue. There is one exception but I'm guessing that is an issue specific to that bam or the header of that bam that I haven't yet identified. thanks for your help!