tools/htseq: Add override or fix htseq-count max-reads-in-buffer option

Question

tools/htseq: Add override or fix htseq-count max-reads-in-buffer option

zaeleus opened this issue 4 years ago · comments

When the input records have mates, htseq-count keeps an arbitrarily-sized buffer to match record pairs. In extreme cases, the default buffer size --max-reads-in-buffer 30000000 is too small, causing the following error:

Error occured when processing SAM input (record #396226907 in file sample.bam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.

I propose either adding an input to override the value (Int max_reads_in_buffer = 30000000) or fixing the value to an infeasibly high record count, e.g., 2^63-1. The latter is then simply bounded by memory.

Jobin Sunny · Answer 1 · Wed Oct 07 2020 22:31:31 GMT+0800 (China Standard Time)

How extreme are the cases with the error? <1% of the samples? I like adding an input for the option - that way users can dial up the max buffered reads and memory as needed. But if leaving the value really high has no impact other than reaching the memory limit quicker, then option 2 sounds better to me since we can just set this option once and be done.

Andrew Frantz · Answer 2 · Wed Oct 07 2020 22:35:25 GMT+0800 (China Standard Time)

I'm guessing that the limit doesn't have any impact on performance, and is meant as a cap to the memory usage in shared environments. I don't think there will be any downsides to removing the limit by setting it arbitrarily high.

Michael Macias · Answer 3 · Thu Oct 08 2020 01:00:42 GMT+0800 (China Standard Time)

How extreme are the cases with the error? <1% of the samples?

It's rare in the samples I work with: ~1/250. It typically occurs when a sample has excessive coverage in a small region.

I like adding an input for the option - that way users can dial up the max buffered reads and memory as needed.

One thing to consider is that max-reads-in-buffer is not a very good user option, as its effects are opaque.