dfguan / purge_dups

haplotypic duplication identification tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bug for the python script + ask precision if purge_dups run without -c and -T

ArthurPERE opened this issue · comments

Hello,

We have a bug in a python script, and we have a question.

python bug

Even if we don't have to use the python script run_purge_dups.py., this one have a bug.
The python script work but not completly with ".gz" (zipped) files for the reads (pacbio-hifi). Maybe precised that in the README.md.

The error for the ".gz" files is :

Traceback (most recent call last):
 File "/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
   self.run()
 File "/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run
   self._target(*self._args, **self._kwargs)
 File "scripts/run_purge_dups.py", line 119, in cal_cov
   for fl in f:
 File "/anaconda3/lib/python3.8/codecs.py", line 322, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

So the reads doesn't map on the assembly, and so all the calculs of the coverage is not done.

Question

When we use the python script anyway with ".gz" files. The script continue, and purge the duplicate without running pbcstat and calcuts.

The purge_dups is running even if there is a lake of coverage / cutoffs file (that we have in , but it doesn't seem to bother the purge_dups program, who run without error message or warning message. So we just map the assembly on itself, and gave it to purge_dups.

And when we run the get_seqs it just work, and we have the purged and hap files.

What happened when we run the purge_dups without the coverage and cutoffs file ?

Thanks,
Regards.

I tried Purge_dups on two genomes without specifying coverage and cutoffs files (which means, without options "-c" and "-T") and, quite unexpectedly, Purge_dups produced much better results without them. I took a peek into the code and wasn't able to understand what it does when I don't specify the files.