stjudecloud / workflows

Bioinformatics workflows developed for and used on the St. Jude Cloud project.

Home Page:https://stjudecloud.github.io/workflows

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tools/htseq: Add override or fix htseq-count max-reads-in-buffer option

zaeleus opened this issue · comments

Workflow: RNA-Seq Standard 2.0.0

When the input records have mates, htseq-count keeps an arbitrarily-sized buffer to match record pairs. In extreme cases, the default buffer size --max-reads-in-buffer 30000000 is too small, causing the following error:

Error occured when processing SAM input (record #396226907 in file sample.bam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.

I propose either adding an input to override the value (Int max_reads_in_buffer = 30000000) or fixing the value to an infeasibly high record count, e.g., 2^63-1. The latter is then simply bounded by memory.

How extreme are the cases with the error? <1% of the samples? I like adding an input for the option - that way users can dial up the max buffered reads and memory as needed. But if leaving the value really high has no impact other than reaching the memory limit quicker, then option 2 sounds better to me since we can just set this option once and be done.

I'm guessing that the limit doesn't have any impact on performance, and is meant as a cap to the memory usage in shared environments. I don't think there will be any downsides to removing the limit by setting it arbitrarily high.

How extreme are the cases with the error? <1% of the samples?

It's rare in the samples I work with: ~1/250. It typically occurs when a sample has excessive coverage in a small region.

I like adding an input for the option - that way users can dial up the max buffered reads and memory as needed.

One thing to consider is that max-reads-in-buffer is not a very good user option, as its effects are opaque.