snakemake-workflows / dna-seq-gatk-variant-calling

This Snakemake pipeline implements the GATK best-practices workflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rule plot_stats fails with "OverflowError: value too large to convert to npy_uint32"

gernophil opened this issue · comments

Hey everyone,

I am trying to run this pipeline with 144 samples so the resulting files are quite big. I managed to get it almost to the end, but the last rule (plots_stats) fails with OverflowError: value too large to convert to npy_uint32. I guess, I just have to many rows in my calls.tsv.gzto be handled. The complete error log is:

Traceback (most recent call last):
  File "/[PATH]/workflow_var_calling/.snakemake/scripts/tmp10j_ba31.plot-depths.py", line 16, in <module>
    sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/series.py", line 2899, in unstack
    return unstack(self, level, fill_value)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 501, in unstack
    constructor=obj._constructor_expanddim)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
    self.index = index.remove_unused_levels()
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1494, in remove_unused_levels
    uniques = algos.unique(lab)
  File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/algorithms.py", line 367, in unique
    table = htable(len(values))
  File "pandas/_libs/hashtable_class_helper.pxi", line 937, in pandas._libs.hashtable.Int64HashTable.__cinit__
OverflowError: value too large to convert to npy_uint32

any ideas?

Sorry, I have no quick and easy ideas for a fix. One way that should make this script work better on such large datasets could be to exchange pandas code with polars, which should be quicker and more memory-efficient:
https://pola-rs.github.io/polars-book/user-guide/

As this is not such a long script, and not too complicated, switching the library used for handling the dataframes should not be overly complicated. But unless you already know polars, it will surely take a moment to find all the right syntax (but would have the added benefit of learning polars;).

Also, another caveat: switching to polars does not guarantee that this will run through. It's just more likely.

Thanks for that. I will definitely take a look into polars. Never used it before. I took a different approach now. I did just split the calls.tsv.gz in half (and copied the header to the 2nd half) and ran the rule separately on these files. It's running for 4h now, but no error so far. Fingers crossed :).

Fingers crossed! 😅

As a more general solution, some meaningful way of (programmatically) stratifying samples might make sense, i.e. having some sort of annotation column in config/samples.tsv that defines groups that you want to split your samples into and then a rule that splits calls.tsv.gz into those groups and then make the rule plots_stats: actually work on those smaller files.

We don't currently have the capacity to provide something like this, but are always welcoming pull requests and will try to review and merge those quickly.

And if you are looking for a more actively maintained snakemake workflow for variant calling, we are putting a lot of effort into this one:
https://snakemake.github.io/snakemake-workflow-catalog/?usage=snakemake-workflows/dna-seq-varlociraptor