uqfoundation / dill

serialize all of Python

Home Page:http://dill.rtfd.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnpicklingError using dill but not stdlib pickle

xzy3 opened this issue · comments

dill version 0.3.7
centos stream
python 3.10.4

I've run into a situation where the standard library pickle is successful in serializing an object, but dill has a bug.

In [7]:  dill.loads(dill.dumps(iterable[0]))
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
File <ipython-input-7-fc7606c36bd3>:1
----> 1 dill.loads(dill.dumps(iterable[0]))

File ~/.local/virtualenvs/stantz-2022-update/lib/python3.10/site-packages/dill/_dill.py:301, in loads(str, ignore, **kwds)
    290 """
    291 Unpickle an object from a string.
    292
   (...)
    298 Default values for keyword arguments can be set in :mod:`dill.settings`.
    299 """
    300 file = StringIO(str)
--> 301 return load(file, ignore, **kwds)

File ~/.local/virtualenvs/stantz-2022-update/lib/python3.10/site-packages/dill/_dill.py:287, in load(file, ignore, **kwds)
    281 def load(file, ignore=None, **kwds):
    282     """
    283     Unpickle an object from a file.
    284
    285     See :func:`loads` for keyword arguments.
    286     """
--> 287     return Unpickler(file, ignore=ignore, **kwds).load()

File ~/.local/virtualenvs/stantz-2022-update/lib/python3.10/site-packages/dill/_dill.py:442, in Unpickler.load(self)
    441 def load(self): #NOTE: if settings change, need to update attributes
--> 442     obj = StockUnpickler.load(self)
    443     if type(obj).__module__ == getattr(_main_module, '__name__', '__main__'):
    444         if not self._ignore:
    445             # point obj class to main

UnpicklingError: NEWOBJ class argument must be a type, not NoneType

In [8]:  import pickle

In [9]: pickle.loads(pickle.dumps(iterable[0]))
Out[9]: <cdcdvh.ghost.preprocess.Preprocess |CasperStrategy| out: |preprocess/GHOST_EP10.gh5|>

here is the data serialized by dill

b'\x80\x04\x95\xbf\x10\x00\x00\x00\x00\x00\x00\x8c\x17cdcdvh.ghost.preprocess\x94\x8c\nPreprocess\x94\x93\x94)\x81\x94}\x94(\x8c\x08seq_file\x94\x8c\x1acdcdvh.ghost.util.inputset\x94\x8c\x0cPairedEndSet\x94\x93\x94)\x81\x94}\x94(
\x8c\x05_open\x94\x8c\x17cdcdvh.ghost.util.files\x94\x8c\x0fopen_compressed\x94\x93\x94\x8c\x06format\x94\x8c\x05fastq\x94\x8c\x07r1_file\x94\x8ck/scicomp/groups-pure/OID/NCHHSTP/DVH/testdata/TrainingDataset_A/rawfiles/GHOST_EP10_S1_L001
_R1_001.fastq.gz\x94\x8c\x07r2_file\x94\x8ck/scicomp/groups-pure/OID/NCHHSTP/DVH/testdata/TrainingDataset_A/rawfiles/GHOST_EP10_S1_L001_R2_001.fastq.gz\x94ub\x8c\x0boutput_path\x94N)\x81\x94}\x94(\x8c\x04path\x94\x8c\x19preprocess/GHOST_
EP10.gh5\x94\x8c\x06kwargs\x94}\x94\x8c\x0bsample_name\x94\x8c\nGHOST_EP10\x94s\x8c\x04mode\x94\x8c\x01w\x94ub\x8c\x05clean\x94\x8c\x19cdcdvh.ghost.clean.casper\x94\x8c\x0eCasperStrategy\x94\x93\x94)\x81\x94}\x94(\x8c\x14min_major_propor
tion\x94N\x8c\x05steps\x94(\x8c\x1acdcdvh.ghost.util.seqtools\x94\x8c\x0efastq_id_match\x94\x93\x94)\x81\x94h(\x8c\x18drop_ambiguous_sequences\x94\x93\x94)\x81\x94}\x94(\x8c\x07maximum\x94G?\xef\xae\x14z\xe1G\xae\x8c\x06reason\x94\x8c\x1
2more than 0.99% Ns\x94ubh(\x8c\x0bphix_filter\x94\x93\x94)\x81\x94}\x94\x8c\x07ref_dir\x94\x8c_/scicomp/home-pure/xzy3/.cache/ghost_reference_db/compiled-refs/phix174-ref-gj0drr_e-BWA/bwa-db\x94sbh(\x8c\x14short_product_filter\x94\x93\x
94)\x81\x94}\x94(\x8c\rforward_regex\x94\x8c\x0cregex._regex\x94\x8c\x07compile\x94\x93\x94(\x8c+(GGATATGATGATGAACTGGT){s<=2,i<=1,d<=1,e<=3}\x94M 0C-,\x1e\x01\x01\x01\x1b\x00\x00\x01\x00\x01\x00\x02\x00\x03\x01\x01\x01\x03J\x04\x14GGATAT
GATGATGAACTGGT\x14\x14\x01\x94}\x94}\x94}\x94]\x94K\x00)K\x00K\x01t\x94R\x94\x8c\rreverse_regex\x94h@(\x8c-(ATGTGCCAGCTGCCGTTGGTGT){s<=2,i<=1,d<=1,e<=3}\x94M 0C/.\x1e\x01\x01\x01\x1b\x00\x00\x01\x00\x01\x00\x02\x00\x03\x01\x01\x01\x03J\x04\x16ATGTGCCAGCTGCCGTTGGTGT\x14\x14\x01\x94}\x94}\x94}\x94]\x94K\x00)K\x00K\x01t\x94R\x94\x8c\x08min_size\x94G@g\x19\x99\x99\x99\x99\x99ubh(\x8c\x16remove_short_sequences\x94\x93\x94)\x81\x94}\x94(hRG@g\x19\x99\x99\x99\x99\x99h1\x8c+se$uence is shorter than 184.79999999999998\x94ubh(\x8c\x13mid_distance_filter\x94\x93\x94)\x81\x94}\x94(\x8c\x08mid_list\x94]\x94(\x8c\nACGAGTGCGT\x94\x8c\nACGCTCGACA\x94\x8c\nAGACGCACTC\x94\x8c\nAGCACTGTAG\x94\x8c\nATCAGACACG\x94\x8c\nAT$TCGCGAG\x94\x8c\nCGTGTCTCTA\x94\x8c\nCTCGCGTGTC\x94\x8c\nTAGTATCAGC\x94\x8c\nTCTCTATGCG\x94\x8c\nTGATACGTCT\x94\x8c\nTACTGAGCTA\x94\x8c\nCATAGTAGTG\x94\x8c\nCGAGAGATAC\x94\x8c\nATACGACGTA\x94\x8c\nTCACGTACTA\x94\x8c\nCGTCTAGTAC\x94\x8c\$TCTACGTAGC\x94\x8c\nTGTACTACTC\x94\x8c\nACGACTACAG\x94\x8c\nCGTAGACTAG\x94\x8c\nTACGAGTATG\x94\x8c\nTACTCTCGTG\x94\x8c\nTAGAGACGAG\x94\x8c\nTCGTCGCTCG\x94\x8c\nACATACGCGT\x94\x8c\nACGCGAGTAT\x94\x8c\nACTACTATGT\x94\x8c\nACTGTACAGT\x94\x$c\nAGACTATACT\x94\x8c\nAGCGTCGTCT\x94\x8c\nAGTACGCTAT\x94\x8c\nATAGAGTACT\x94\x8c\nCACGCTACGT\x94\x8c\nCAGTAGACGT\x94\x8c\nCGACGTGACT\x94\x8c\nTACACACACT\x94\x8c\nTACACGTGAT\x94\x8c\nTACAGATCGT\x94\x8c\nTACGCTGTCT\x94\x8c\nTAGTGTAGAT\x9$\x8c\nTCGATCACGT\x94\x8c\nTCGCACTAGT\x94\x8c\nTCTAGCGACT\x94\x8c\nTCTATACTAT\x94\x8c\nTGACGTATGT\x94\x8c\nTGTGAGTAGT\x94\x8c\nACAGTATATA\x94\x8c\nACGCGATCGA\x94\x8c\nACTAGCAGTA\x94\x8c\nAGCTCACGTA\x94\x8c\nAGTATACATA\x94\x8c\nAGTCGAGAGA$x94\x8c\nAGTGCTACGA\x94\x8c\nCGATCGTATA\x94\x8c\nCGCAGTACGA\x94\x8c\nCGCGTATACA\x94\x8c\nCGTACAGTCA\x94\x8c\nCGTACTCAGA\x94\x8c\nCTACGCTCTA\x94\x8c\nCTATAGCGTA\x94\x8c\nTACGTCATCA\x94\x8c\nTAGTCGCATA\x94\x8c\nTATATATACA\x94\x8c\nTATGCTA$TA\x94\x8c\nTCACGCGAGA\x94\x8c\nTCGATAGTGA\x94\x8c\nTCGCTGCGTA\x94\x8c\nTCTGACGTCA\x94\x8c\nTGAGTCAGTA\x94\x8c\nTGTAGTGTGA\x94\x8c\nTGTCACACGA\x94\x8c\nTGTCGTCGCA\x94\x8c\nACACATACGC\x94\x8c\nACAGTCGTGC\x94\x8c\nACATGACGAC\x94\x8c\nACGA$AGCTC\x94\x8c\nACGTCTCATC\x94\x8c\nACTCATCTAC\x94\x8c\nACTCGCGCAC\x94\x8c\nAGAGCGTCAC\x94\x8c\nAGCGACTAGC\x94\x8c\nAGTAGTGATC\x94\x8c\nAGTGACACAC\x94\x8c\nAGTGTATGTC\x94\x8c\nATAGATAGAC\x94\x8c\nATATAGTCGC\x94\x8c\nATCTACTGAC\x94\x8c\nC$CGTAGATC\x94\x8c\nCACGTGTCGC\x94\x8c\nCATACTCTAC\x94\x8c\nCGACACTATC\x94\x8c\nCGAGACGCGC\x94\x8c\nCGTATGCGAC\x94\x8c\nCGTCGATCTC\x94\x8c\nCTACGACTGC\x94\x8c\nCTAGTCACTC\x94\x8c\nCTCTACGCTC\x94\x8c\nCTGTACATAC\x94\x8c\nTAGACTGCAC\x94\x8c$nTAGCGCGCGC\x94\x8c\nTAGCTCTATC\x94\x8c\nTATAGACATC\x94\x8c\nTATGATACGC\x94\x8c\nTCACTCATAC\x94\x8c\nTCATCGAGTC\x94\x8c\nTCGAGCTCTC\x94\x8c\nTCGCAGACAC\x94\x8c\nTCTGTCTCGC\x94\x8c\nTGAGTGACGC\x94\x8c\nTGATGTGTAC\x94\x8c\nTGCTATAGAC\x94\$8c\nTGCTCGCTAC\x94\x8c\nACGTGCAGCG\x94\x8c\nACTCACAGAG\x94\x8c\nAGACTCAGCG\x94\x8c\nAGAGAGTGTG\x94\x8c\nAGCTATCGCG\x94\x8c\nAGTCTGACTG\x94\x8c\nAGTGAGCTCG\x94\x8c\nATAGCTCTCG\x94\x8c\nATCACGTGCG\x94\x8c\nATCGTAGCAG\x94\x8c\nATCGTCTGTG\x$4\x8c\nATGTACGATG\x94\x8c\nATGTGTCTAG\x94\x8c\nCACACGATAG\x94\x8c\nCACTCGCACG\x94\x8c\nCAGACGTCTG\x94\x8c\nCAGTACTGCG\x94\x8c\nCGACAGCGAG\x94\x8c\nCGATCTGTCG\x94\x8c\nCGCGTGCTAG\x94\x8c\nCGCTCGAGTG\x94\x8c\nCGTGATGACG\x94\x8c\nCTATGTACA$\x94\x8c\nCTCGATATAG\x94\x8c\nCTCGCACGCG\x94\x8c\nCTGCGTCACG\x94\x8c\nCTGTGCGTCG\x94\x8c\nTAGCATACTG\x94\x8c\nTATACATGTG\x94\x8c\nTATCACTCAG\x94\x8c\nTATCTGATAG\x94\x8c\nTCGTGACATG\x94\x8c\nTCTGATCGAG\x94\x8c\nTGACATCTCG\x94\x8c\nTGAGCT$GAG\x94\x8c\nTGATAGAGCG\x94\x8c\nTGCGTGTGCG\x94\x8c\nTGCTAGTCAG\x94\x8c\nTGTATCACAG\x94\x8c\nTGTGCGCGTG\x94e\x8c\x06metric\x94\x8c\x1acdcdvh.pyseqdist.cDistance\x94\x8c\redit_distance\x94\x93\x94\x8c\x07mid_len\x94K\n\x8c\x08max_dist\x9$K\x00\x8c\x12disable_mid_filter\x94\x89ubh(\x8c\x0fmost_common_mid\x94\x93\x94)\x81\x94}\x94(\x8c\x17contamination_threshold\x94G?\xd0\x00\x00\x00\x00\x00\x00h\xfd\x89ubh(\x8c\x11reservoir_sampler\x94\x93\x94)\x81\x94}\x94(\x8c\x04size\$94M NhRM\x88\x13ubh(\x8c\x12canonify_read_pair\x94\x93\x94)\x81\x94}\x94(h=hHhIhQ\x8c\x0camplicon_len\x94M\x08\x01ubh!\x8c\x06Casper\x94\x93\x94)\x81\x94}\x94(h1\x8c#casper too much mismatch in overlap\x94\x8c\x0equal_threshold\x94K\x0f$x8c\x11kmer_neighborhood\x94K\x08\x8c\x08kmer_len\x94K\x11\x8c\x12mismatch_threshold\x94G?\xa9\x99\x99\x99\x99\x99\x9a\x8c\x13minimum_overlap_len\x94K\n\x8c\x10max_assembly_len\x94M\x0e\x01ubh(\x8c\x19filter_nonsense_sequences\x94\x93\x$4)\x81\x94h(\x8c\x13collapse_haplotypes\x94\x93\x94h(\x8c\x08genotype\x94\x93\x94)\x81\x94}\x94j\x1d\x01\x00\x00\x8c#cdcdvh.ghost.genotyping.blasttyping\x94\x8c\nBlastTyper\x94\x93\x94)\x81\x94}\x94(\x8c\nblast_args\x94]\x94(\x8c\x03-db$x94\x8cq/scicomp/home-pure/xzy3/.cache/ghost_reference_db/compiled-refs/ghost-hcv-genotyping-5pasbb4g-GENOTYPING/blast-db\x94e\x8c\x11reference_version\x94\x8c(02b66e58e1e2586830018776c43c172b688e9514\x94\x8c\x13unmatched_threshold\x94J$\xff\xff\xffubsbt\x94h\x1a}\x94ub\x8c\x12alignment_strategy\x94\x8c\x11profile-and-align\x94h\x1a}\x94ub.'

Can you post code that reproduces the error you are seeing?
I tried a few guesses at what iterable is, and the code works as expected.

Python 3.10.13 (main, Aug 25 2023, 02:21:32) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> iterable = [0,1,2,3,4,5]
>>> dill.loads(dill.dumps(iterable[0]))
0
>>> iterable = 'GATTACA'
>>> dill.loads(dill.dumps(iterable[0]))
'G'

I'm going to assume what you are experiencing is a case where pickle is serializing something in iterable by reference, while dill is storing the same object's contents. A minimal example to reproduce the error you are seeing would enable me to test it out and potentially do something.

Can you also try running with dill.settings['byref'] = True, and alternately, with dill.settings['recurse'] = True?

It's actually from uqfoundation's multiprocess Pool.imap_unordered adding a work unit to the queue. But I think I found the problem.

I had added some code quite a while ago to hack around dill issue #332. It is apparently not needed anymore and is causing this new issue now. I commented that code out while working on a minimal example it resolved things.