rule plot_stats fails with "OverflowError: value too large to convert to npy_uint32"
gernophil opened this issue · comments
Hey everyone,
I am trying to run this pipeline with 144 samples so the resulting files are quite big. I managed to get it almost to the end, but the last rule (plots_stats) fails with OverflowError: value too large to convert to npy_uint32
. I guess, I just have to many rows in my calls.tsv.gz
to be handled. The complete error log is:
Traceback (most recent call last):
File "/[PATH]/workflow_var_calling/.snakemake/scripts/tmp10j_ba31.plot-depths.py", line 16, in <module>
sample_info = calls.loc[:, samples].stack([0, 1]).unstack().reset_index(1, drop=False)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/series.py", line 2899, in unstack
return unstack(self, level, fill_value)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 501, in unstack
constructor=obj._constructor_expanddim)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/reshape/reshape.py", line 116, in __init__
self.index = index.remove_unused_levels()
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1494, in remove_unused_levels
uniques = algos.unique(lab)
File "/[PATH]/workflow_var_calling/.snakemake/conda/5e32b1f022a698680d2667be14f8a58a/lib/python3.6/site-packages/pandas/core/algorithms.py", line 367, in unique
table = htable(len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 937, in pandas._libs.hashtable.Int64HashTable.__cinit__
OverflowError: value too large to convert to npy_uint32
any ideas?
Sorry, I have no quick and easy ideas for a fix. One way that should make this script work better on such large datasets could be to exchange pandas
code with polars
, which should be quicker and more memory-efficient:
https://pola-rs.github.io/polars-book/user-guide/
As this is not such a long script, and not too complicated, switching the library used for handling the dataframes should not be overly complicated. But unless you already know polars
, it will surely take a moment to find all the right syntax (but would have the added benefit of learning polars
;).
Also, another caveat: switching to polars
does not guarantee that this will run through. It's just more likely.
Thanks for that. I will definitely take a look into polars
. Never used it before. I took a different approach now. I did just split the calls.tsv.gz in half (and copied the header to the 2nd half) and ran the rule separately on these files. It's running for 4h now, but no error so far. Fingers crossed :).
Fingers crossed! 😅
As a more general solution, some meaningful way of (programmatically) stratifying samples might make sense, i.e. having some sort of annotation column in config/samples.tsv
that defines groups that you want to split your samples into and then a rule that splits calls.tsv.gz
into those groups and then make the rule plots_stats:
actually work on those smaller files.
We don't currently have the capacity to provide something like this, but are always welcoming pull requests and will try to review and merge those quickly.
And if you are looking for a more actively maintained snakemake workflow for variant calling, we are putting a lot of effort into this one:
https://snakemake.github.io/snakemake-workflow-catalog/?usage=snakemake-workflows/dna-seq-varlociraptor